CE23 - Données, Connaissances, Big data, Contenus multimédias, Intelligence Artificielle

Effective Inference of Cleaning Programs from Data Annotations – InfClean

Effective Inference of Cleaning Programs from Data Annotations

Besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as «data cleaning«. In this process, engineers and domain experts collect specifications, such as business rules. Specifications are then encoded in programs to be executed over the raw data to identify and fix errors. This process is expensive and does not provide any formal guarantee on the ultimate quality of the data.

A formal framework that reduces the human effort in cleaning data

The goal of InfClean is to rethink the data cleaning field with an inclusive formal framework that radically reduces the human effort in cleaning data. <br />As described in the original proposal, the project is executed along three research directions:<br />1. Laying the theoretical foundations of synthesizing specifications directly with the domain experts;<br />2. Designing and implementing new automated techniques that use external information to identify and repair data errors;<br />3. Modeling the interactive cleaning process with a principled optimization framework that guarantees quality requirements.

Logic and deep learning methods for data cleaning

We use different approaches for data cleaning. The two main methods are based on first order logic for declarative rules and on deep learning for obtaining data representations from the data itself in an unsupervised fashion.
More specifically, we presented an algorithm and a system for mining declarative rules over RDF knowledge graphs (KGs). We discover rules expressing both positive relationships between elements, e.g., “if two persons share at least one parent, they are likely to be siblings,” and negative patterns identifying data contradictions, e.g., “if two persons are married, one cannot be the child of the other”. While the first kind of rules identify new facts in the KG, the second kind enables the detection of incorrect triples and the generation of (training) negative examples for learning algorithms. Our approach increases the expressive power of the supported rule language w.r.t. the existing systems. Also, the method is robust to errors and missing data in the input graph.
For learning local embeddings for relational data, we use a two step approach. In the first step, we leverage a graph-based representation of relational datasets that represent syntactic and semantic relationships between cell values. We use Token nodes for unique values in the dataset; Record Ids for tuples and Column Ids for columns/attributes. These nodes are connected by edges based on their relationships. This compact graph highlights overlap and explicitly represent the primitives for data integration tasks, i.e., records and attributes. In the second phase, we formulate the problem of obtaining local embeddings as a graph embeddings generation problem. We use random walks to quantify the similarity between neighboring nodes and to exploit metadata such as tuple and attribute IDs. This ensures that nodes sharing similar neighborhoods are in proximity in the final space. The corpus that is used to train our local embeddings is generated by materializing these random walks.

Results

For the first direction, our main result is an homogeneous framework for data cleaning based on deep learning. The solution, presented in a SIGMOD 2020 paper, shows how to create embeddings for relational datasets in a completely unsupervised manner. These embeddings can then be used for several data integration and data cleaning tasks. The results show that the new embeddings can be used in both in supervised and supervised approaches with performance that beat the state of the art.
For the second direction, we introduced several algorithms aimed at increasing the automation in the cleaning process. Our contributions include the model that naturally represents and integrate such information (the SIGMOD 2020 paper), an algorithm to mine logical rules from the data (JDIQ 2019 article), and a system that combines rules and Web evidence to validate facts in an explainable fashion (TTO 2019 article). For the latter task, we also developed a benchmark for the systematic evaluation of the outcome of the system (CIKM 2019 paper and VLDB 2020 demo). Algorithms for data repairing have also been published in a VLDB Journal with external collaborators.
For the third direction, we started studying the problem of identifying good examples to be checked by the users. Our initial effort has been focused on time series data in collaboration with a French start up. This work has led to a patent and to a paper in ICDE 2020.

Prospects

Looking forward, we plan to keep working on the the second and the third directions by focusing our attention on the possibilites offered by the recent advanced in deep learning. We believe recent architectures, such as transformers (Bert, XML), can be effectively used to solve data cleaning tasks. However, there are several open questions on how to integrate external domain information, such as logical rules, in these solutions and we aim at making contributions in this direction.

Scientific productions and patents

- N. Ahmadi, P. Huynh, V. Meduri, P. Papotti, S. Ortona.
Mining Expressive Rules in Knowledge Graphs.
Journal of Data and Information Quality (JDIQ), 2020.

- F. Geerts, G. Mecca, P. Papotti, D. Santoro,
Cleaning data with Llunatic.
VLDB Journal, 2019.

- R. Cappuzzo, P. Papotti, S. Thirumuruganathan
Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks.
In SIGMOD, 2020.

- P. Huynh, P. Papotti.
A Benchmark for Fact Checking Algorithms Built on Knowledge Bases.
CIKM, 2019.

- P. Huynh, P. Papotti.
Buckle: Evaluating Fact Checking Algorithms Built on Knowledge Bases.
VLDB (demo), 2019.

- N. Ahmadi, J. Lee, P. Papotti, M. Saeed.
Explainable Fact Checking with Probabilistic Answer Set Programming.
Conference for Truth and Trust Online (TTO), 2019.

Our main results have been disseminated in the technical documents mentioned above (and relative presentations at the conferences), in the form of code repositories, invited talks and keynotes (IMT Webinar, talk at QCRI and Telecom Paris, invited talk at DEXA 2020), and in online activities (LinkedIn and Twitter).

Submission summary

This proposal addresses a pressing need in data science applications: besides reliable models for decision making, we need data that has been processed from its original, raw state into a curated form, a process referred to as “data cleaning”. In this process, data engineers collaborate with domain experts to collect specifications, such as business rules on salaries, physical constraints for molecules, or representative training data. Specifications are then encoded in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today’s data, is conducted with a best effort approach, which does not provide any formal guarantee on the ultimate quality of the data.
The goal of InfClean is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data. This will be achieved in three steps:
(1) by laying the theoretical foundations of synthesizing specifications directly with the domain experts;
(2) by designing and implementing new automated techniques that use external information to identify and repair data errors;
(3) by modeling the interactive cleaning process with a principled optimization framework that guarantees quality requirements.

The project will lay a solid foundation for data cleaning, enabling a formal framework for specification synthesis, algorithms for increased automation, and a principled optimizer with quality performance guarantees for the user interaction. It will also broadly enable accelerated information discovery, as well as economic benefits of early, well-informed, trustworthy decisions. To provide the right context for evaluating these new techniques and highlight the impact of the project in different fields, InfClean plans to address its objectives by using real case studies from different domains, including health and biodiversity data.

Paolo Papotti (EURECOM)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

EURECOM EURECOM

Help of the ANR 213,320 euros
Beginning and duration of the scientific project: - 48 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.