DS0707 - Interactions humain-machine, objets connectés, contenus numériques, données massives et connaissance

Flexibility for expressive speech synthesis – SynPaFlex

Submission summary

Nowadays, Text-To-Speech synthesis achieves very good quality. The use of very large corpora largely contributes to this success. Despite that, emotion, intention and speaking style are missing from the synthetic speech generated. Currently, we are not able to synthesize a voice with all the expressivity necessary for applications such that audiobooks reading without recording a very large corpora with a specific style.

Some works in the literature are interested in the use of phenomenon linked to expressivity and brings some interesting results that enable to partly characterize expressivity in terms of functioning and concretisation in the acoustic material. Here, we consider the joint treatment of emotion, intention and elocution style, since they are all linked in practical situations, in the perspective of an integration into speech synthesis systems.

The idea of the SynPaFlex project is to investigate the different characteristics of what makes voice expressivity in order to build a speaker-specific prosodic model and also a speaker-specific phonemic model to predict the modifications on the phoneme sequence. Then, these models will be used so as to integrate expressivity into text-to-speech synthesis systems, notably for concatenative systems. Finally, a complementary work will be focused on the post-processing steps, necessary to overcome the imperfections of the selected units. A voice conversion-based approach will then be investigated as a selection post-processing step.

These different steps will enable to improve the knowledge on the way one can modify a speech synthesis system, in terms of speech unit attributes, unit selection cost function and post-processing. This work will be conducted for the French and English languages to keep a certain genericity.

The major contributions of the project reside in the feasibility of expressive text-to-speech synthesis, applications which are, for the moment, not much spread out. Some prospects are to be expected in the domain of video games (artificial voice diversification, expressive voice creation adapted to the game situation), language learning (dictation, elocution style) or even for vocal aids designed for handicapped people.

Damien LOLIVE (Institut de Recherche en Informatique et Systèmes Aléatoires)

The author of this summary is the project coordinator, who is responsible for the content of this summary. The ANR declines any responsibility as for its contents.

IRISA Institut de Recherche en Informatique et Systèmes Aléatoires

Help of the ANR 245,648 euros
Beginning and duration of the scientific project: September 2015 - 42 Months

Explorez notre base de projets financés

ANR makes available its datasets on funded projects, click here to find more.