DS0707 - Interactions des mondes physiques, de l'humain et du monde numérique

Generating Text from Semantic Web Data – WEB-NLG

Natural Language Generation for the Semantic Web

With the emergence of the Big Data phenomenon, there is a growing need for technologies that give humans easy access to the machine-oriented Web of data. Because it maps data to text, Natural Language Generation (NLG) provides a natural mean for presenting this data in an organized, coherent and accessible way. In this context, the WebNLG project aims to further the development of robust and portable high quality NLG systems capable of producing natural sounding text from Semantic Web data.

Generating natural sounding text from arbitrary RDF and OWL input.

The two main goals of the WebNLG project are:<br /><br />1. to develop techniques and tools for verbalising data encoded in a format supported by the semantic web (OWL,RDF). Emphasis will be on generating short text of high quality as required by longer user queries or by the verbalisation of class descriptions. Hybrid symbolic/stochastic techniques will be drawn upon to support efficiency, robustness and portability.<br /><br />2. to promote research on NLG for the Semantic Web through the establishment of common benchmarks and evaluation metrics. To this end, the WebNLG project will organise an international shared task on verbalising semantic web data.

Automated lexicon and grammar extraction. Using both symbolic and machine learning techniques for lexicon and grammar extraction, the WebNLG project propose new solutions which address a major technological bottleneck in developing data-to-text generators namely, the excessive cost of grammar and lexicon development. To generate from the KBGen biology knowledge base, we exploit a parallel corpora to induce a generation grammar and a comparable corpora to induce a generation lexicon. We use an embedding-based approach to learn a mapping between DBPedia property and candidate lexicalisations.

Linguistically Principled Grammars. Existing approaches to ontology verbalisation usually rely on templates or on hand-made grammars thereby ignoring the wealth of research conducted on computational grammars in particular, grammars that interface syntax and semantics. In the WebNLG project, we exploit semantic Lexicalised Feature-Based Tree Adjoining Grammars thereby providing a natural modelling of the linking between text, syntax and semantics.

Uncontrolled Natural Language. While existing work on ontology verbalisation generally assumes verbalisation in a controlled natural language, we exploit ambiguous grammars thereby allowing for greater variability in the output and for a wider range of applications using the same grammar.


A Hybrid Symbolic-Stochastic Approach. While most existing work on ontology verbalisation is symbolic based, we will combine symbolic and stochastic approaches so as to increase robustness and efficiency, decrease the need for manual intervention and support linguistic variability. In particular, we combine such techniques as automatic data-to-text alignement, automatic grammar extraction, ranking and classification with the use of symbolic grammars to produce linguistically fine-grained but computationally robust and efficient tools for ontology verbalisation and querying.

While there has been much work in recent years on data-driven natural language generation, little attention has been paid to the fine grained interactions that arise during micro-planning between aggregation, surface realization and sentence segmentation. The WebNLG project developed a hybrid symbolic/statistical approach to jointly model the interactions arising in Natural Language Generation between syntactic, aggregation and sentence segmentation choices. Our approach integrates a small hand-written grammar, a statistical hypertagger and a surface realization algorithm. It is applied to the verbalization of knowledge base queries and tested on 13 knowledge bases to demonstrate domain independence. We evaluate our approach in several ways. A quantitative analysis shows that the hybrid approach outperforms a purely symbolic approach in terms of both speed and coverage. Results from a human study indicate that users find the output of this hybrid statistic/symbolic system more fluent than both a template- and a purely symbolic grammar-based approach. Finally, we illustrate by means of examples that our approach can account for various factors impacting aggregation, sentence segmentation and surface realization.

This work is accepted for publication in the Journal of Computational Linguistics and gave rise to a keynote at the Spanish conference for Natural Language Processing (SEPLN 2015).


Organisation of two international workshops on Natural Language Generation and the Semantic Web (Nancy June 2015, Edinburgh September 2016)

New International (Nancy, Paris 13, Mexico) project accepted on Generating from the output of a machine reading tool (FRED) producing semantic representations in RDF format.

During the first half of the project, we worked on generating from knowledge bases and developed a statistical grammar-based approach to micro-planning which adequately captures the interactions between surface realisation, aggregation and sentence segmentation. We also worked on unsupervised verbalisation of n-ary relations, uploaded demos of our various generation algorithms on the website, organised two international workshops and started new collaborations with semantic web people (Algo Gangemi, PIRAT project with Mexico and Paris 13) and with the US National Library of Medicine, Bethesda, Maryland USA (Yassine M'rabet).

During the second half, we will work on generation from RDF and linked data focusing on content selection (how to select from an RDF knowledge base the content to be verbalised?) and on micro-planning (how to produce good quality text out of the selected content?). For content selection, we are currently exploring how Integer Linear Programming can be used to select representative content of an entity belonging to a given category (e.g., Alan Bean of category Astronaut). The aim is to be able to automatically produce short summaries of e.g., DBPedia entities which are both accurate and relevant. For micro-planning, we will work on extending the approach we developed for knowledge base queries and binary relations to n-ary relations and to arbitrary event or entity descriptions. This includes devising methods for automatically inducing lexicons mapping an event and its arguments to a syntactic frame and to the linking information required to map syntactic and semantic arguments (lexicalisation of n-ary relations) and modeling the interactions between higher-order event-to-event relations, surface realisation, aggregation and sentence segmentation.

As planned, we will also work on constructing training and test data for a shared campaign to be launched in 2016/2017.

C. Gardent and L. Perez-Beltrachini. A Statistical, Grammar-Based Approach to Micro-Planning. In Computational Linguistics, Volume 43, Issue 1 - March 2017.

C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Creating Training Corpora for NLG Micro-Planning. ACL 2017. Vancouver (Canada).

S. Narayan, C. Gardent, S. Cohen and A. Shimorina. Split and Rephrase. EMNLP 2017. Copenhagen (Denmark).

L. Perez-Beltrachini, R. Sayed and C. Gardent. Building RDF Content for Data-to-Text Generation. COLING 2016, Osaka (Japan).

L. Perez-Beltrachini and C. Gardent. Learning Embeddings to lexicalise RDF Properties. SEM 2016, The Fifth Joint Conference on Lexical and Computational Semantics, August 11-12 2016, Berlin (Germany).



Enrico Franconi, Claire Gardent, Ximena I. Juarez-Castro and Laura Perez-Beltrachini. Quelo Natural Language Interface : Generating Queries and Answer Descriptions. In procedings of ISWC 2014
workshop on Natural Language Interfaces for Web of Data (NLIWod), Riva del Garda, Trention, Italy.

Bikash Gyawali, Claire Gardent and Christophe Cerisara. A Domain Agnostic Approach to Verbalizing n-ary Events without Parallel Corpora. In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), September 2015, Brighton, UK. Pp 18-27.

Bikash Gyawali, Claire Gardent and Christophe Cerisara. Automatic Verbalisation of Biological Events. Proceedings of the 2nd Workshop on Definitions in Ontologies (IWOOD 2015). July 2015, Portugal.

There is a growing need in the semantic web (SW) community for technologies that give humans easy access to the machine-oriented Web of data. Because it maps data to text, Natural Language Generation (NLG) provides a natural mean for presenting this data in an organized, coherent and accessible way. Conversely, the representation languages used by the semantic web (e.g., OWL ontologies and RDF data) are a natural starting ground for NLG systems.

The aim of the Web-NLG project is to exploit this synergy between NLG and the Semantic Web and to further the development of robust and portable, high quality NLG systems capable of producing natural sounding text from SW data (e.g., Knowledge Bases, Linked Data).

The project will build on an ongoing collaboration between LORIA (Nancy, France), the KRDB group at (Bolzano, Italy) and Stanford Research International (Palo Alto, USA), bringing together high level academic partners with internationally recognised expertise in both NLG (LORIA) and knowledge processing (KRDB, SRI).

Project coordination

Claire GARDENT (Laboratoire Lorrain de Recherche en Informatique et ses Applications)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

Partner

LORIA Laboratoire Lorrain de Recherche en Informatique et ses Applications
KRDB KRDB Research Center for Knowledge and Data
SRI SRI International

Help of the ANR 251,925 euros
Beginning and duration of the scientific project: September 2014 - 36 Months

Useful links

Explorez notre base de projets financés

 

 

ANR makes available its datasets on funded projects, click here to find more.

Sign up for the latest news:
Subscribe to our newsletter