DS0705 -

Discovery of Complex Schemas for RDF Knowledge Bases – DICOS

Discovery of Complex Schemas for RDF Knowledge Bases

Recent years have seen the rise of large knowledge bases such as DBpedia, YAGO, Freebase, and Google’s knowledge graph. The advance of the Linked Open Data project, which now contains thousands of knowledge bases, is a case to the point. These knowledge bases use RDF and are thus inherently schema-less. We propose to use rule mining to deduce schema constraints automatically from the data.

Mining schemas automatically

Building on recent advances in the field, we propose to enlarge the scope of automated rule mining to<br />numerical and existential rules. The resulting constraints could be used to spot errors in the data or even<br />to predict missing pieces in the knowledge. The particular challenge in the context of knowledge bases<br />is the absence of counterexamples, which requires a new approach to mining rules.

Rule Mining

Our central insight is that logical rules provide a general and expressive framework to
mine all of these aspects together. Logical rules take the form
type(x,movie) ? stars(x,y) ? type(y,actor)
Such rules typically come with a weight or confidence score. They can be mined efficiently from large
KBs [48, 50]. If the rule language could be extended to numerical attributes, existential rules, and
negated atoms, then we would be able to mine much richer constraints than previously possible with
the approaches in isolation. We could discover, e.g., that people who teach at a university usually have a
doctoral degree, that ISBNs should be unique, that a year together with a title identifies a movie uniquely,
or that race cars are generally faster than standard cars. Second, these rules could be used to spot and
eliminate erroneous facts. For example, we could automatically detect that “Titanic (movie)” should be
classified as a movie and not as a ship, because it has actors; we could check the taxonomy to make sure
that more general classes (with fewer attributes) include more specific classes (with more attributes); or
we could detect that a birth date must be wrong because it appears after the death date. If we are able to
mine that people usually have no more nationalities than those mentioned in the KB, we could find areas
where the KB is “complete” and where it is not. Learning rules that estimate the completeness of a KB
could open up new ways of reasoning and evaluation, and this topic has not yet been touched by current
research. If such regularities could be found and exploited automatically, that would mean a huge step
forward for the community.

Results

The work progressed on four axes :

WikiData
Our idea is to spot mistakes in the data of Wikidata, to learn how the contributors of Wikidata corrected these mistakes in the past, and to propose to correct similar mistakes on the current data. This falls in the Work Package 1 of the DICOS project, “Mining Dependencies”.

Dynamic Knowledge Bases
We work on schema mining on dynamic knowledge bases (i.e., knowledge bases that are accessible only through Web services). The idea is to pin down all queries that can be answered given the services. This amounts to a characterization of the part of the knowledge base that is accessible from the outside. This is a special case of Work Package 1 that we decided to treat because it has a clear usecase.

Conditional Key Mining
The idea is to mine constraints that identify an entity uniquely in a certain context. For example, a German PhD student can have only a single advisor. Thus, the student uniquely identifies the advisor – but only in Germany. This work falls in the Work Package 1 as well. I have worked in this project with two colleagues, one PhD student, and one postdoc.

Mining of obligatory attributes
We work on mining obligatory attributes in knowledge bases. The idea is to find out whether an attribute (e.g., “hasNationality” or “isMarried”) applies to all instances of a class in the real world (i.e., whether all people are married in the real world) – given only the incomplete knowledge of the knowledge base. This falls in the Work Package 2 of the DICOS project, “Mining Rules with Existential Quantifiers”.

Prospects

Scientific productions and patents

Thomas Pellissier Tanon, Camille Bourgaux, Fabian M. Suchanek:
“Learning How to Correct a Knowledge Base from the Edit History” (pdf)
Full paper at the The Web Conference (WWW), 2019

Jonathan Lajus, Fabian M. Suchanek:
“Are All People Married? Determining Obligatory Attributes in Knowledge Bases” (pdf)
Full paper at the Web Conference (WWW), 2018

Danai Symeonidou, Luis Galárraga, Nathalie Pernelle, Fatiha Saïs, Fabian M. Suchanek:
“VICKEY: Mining Conditional Keys on Knowledge Bases” (pdf)
Full paper at the International Semantic Web Conference (ISWC), 2017

Fabian M. Suchanek:
“Extraction d’informations” (pdf)
Book chapter in the Les Big Data à découvert , 2017

Submission summary

Recent years have seen the rise of large knowledge bases such as DBpedia, YAGO, Freebase, and Google's knowledge graph. The advance of the Linked Open Data project, which now contains thousands of knowledge bases, is a case to the point. These knowledge bases use RDF and are thus inherently schema-less. We propose to use rule mining to deduce schema constraints automatically from the data. Building on recent advances in the field, we propose to enlarge the scope of automated rule mining to numerical and existential rules. The resulting constraints could be used to spot errors in the data or even to predict missing pieces in the knowledge. The particular challenge in the context of knowledge bases is the absence of counterexamples, which requires a new approach to mining rules.

Fabian Suchanek (Institut Mines-Télécom)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LTCI - TELECOM PARISTECH Institut Mines-Télécom

Help of the ANR 250,043 euros
Beginning and duration of the scientific project: September 2016 - 36 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.