DS0704 -

Advanced Models for Multilingual Semantic Processing – MULTISEM

Submission summary

The MultiSem project will propose novel advanced models for multilingual semantic processing. Existing data-driven models employ robust machine learning techniques for handling vast amounts of textual data but overlook the intricacies of the mechanisms involved in language processing which should be reflected in automatic methods. At the same time, findings in the computational semantics field fail to make their way to large-scale NLP systems, mainly due to the focus on small lexical samples which restricts the potential of the models to scale up and be used on unrestricted text. Interaction between disciplines has thus been limited up to now and the mutual potential benefits of their respective research remain unclear. At this moment of burgeoning interest in multilingual processing and semantics-related research, the MultiSem project proposes to bridge the gap between disciplines by combining the efficiency and robustness of state of the art approaches to semantic analysis with linguistically motivated semantic representations.

The main novelty of the semantic processing models proposed in MultiSem is that they will be able to adapt processing to different lexical items and text types, inspired by findings regarding the organisation of semantic information in the mental lexicon and the role of context in meaning activation. It has been shown that instead of considering all possible interpretations for words in context, human bilinguals and translators restrict their choice to specific senses. This focus is largely influenced by the parameters of the communicative context and by the domain and topic of the processed texts, while a finer-grained filtering occurs only when needed for improving text understanding. Based on these findings, the models developed in MultiSem will differentiate semantic processing according to the disambiguation needs of specific words, contexts and textual genres. To achieve this ambitious goal, we intend to combine continuous space representations and topic models with traditional vector-space models for ambiguity resolution. The selection of the optimal representation for specific lexical items and text types will be guided by the output of an ambiguity type detection mechanism, combined with genre and domain identification techniques. These parameters have up to now been left unexploited in favor of models that adopt a uniform approach (either topic-based or fine-grained) for handling different words and types of text. This is largely due to the difficulty of identifying the disambiguation needs of specific lexical items and texts, a challenge that MultiSem intends to address.

The models that will be developed will be mainly data-driven and enriched with knowledge from large-scale semantic resources which have been shown to improve the performance of machine learning semantic processing methods. The combination of high-level ambiguity resolution techniques (topic models and neural networks) with fine-grained (vector-based) models, and the exploitation of the knowledge available in these resources will enhance the descriptive and processing capacities of the models. The research that will be conducted in MultiSem will renew the scientific perspectives in multilingual NLP, but also in linguistics and semantics due to the knowledge that will be extracted from large volumes of data. The proposed multi-layer ambiguity resolution models will also be exploited for improving lexical selection in translation applications. Lexical errors are found to be the predominant type of errors in automatically produced translations and could be avoided if Machine Translation (MT) systems were able to identify the meaning of words and larger textual units. By improving the quality of the generated translations, MultiSem will enhance the experience of numerous users of MT systems and will have an important social impact given the current pressing demand for quality processing of large volumes of digital content.

Marianna Apidianaki (Laboratoire d'informatique pour la mécanique et les sciences de l'ingénieur)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

LIMSI Laboratoire d'informatique pour la mécanique et les sciences de l'ingénieur

Help of the ANR 255,611 euros
Beginning and duration of the scientific project: February 2017 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.