Friday, 14th June - 11am Benoît Sagot : Seminar

TITLE: Doing more with sentence embeddings

 

ABSTRACT:

Sentence representation learning is a well-researched area in NLP. Some studies focus on pre-training objectives to create contextual representations, while others aim to develop multilingual sentence embeddings that closely encode paraphrases and translations. Their main applications are sentence classification and mining parallel textual data. However, sentences embeddings are capable of more than just that. In this talk, I will first describe how we extended the LASER sentence embedding space to the speech modality in order to build encoders that embed speech and text sentences in multiple languages into a single sentence embedding space. I will show how we can succesfully train decoders that generate text or speech in different languages from such sentence embeddings. I will demonstrate that we can perform zero-shot cross-modal translation by combining our encoders and decoders in a modular way, called T-modules, and show that we achieve competitive results in all translation tasks despite the fixed size sentence embedding bottleneck and no cross-modal labelled translation data being used during training. I will then describe how we further extended the scope of our approach by adapting it to user-generated content (UGC), a type of data characterised by a high level of lexical variation and that deviates from the standard texts on which most models, including our encoders, were trained. We show that with training only on standard and synthetic UGC-like data, our robust sentence encoder for English, called RoLASER, significantly improves LASER’s robustness to both natural and artificial UGC data. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.

The work on cross-modal MT between text and speech using sentence embeddings was carried out in the context of Paul-Ambroise Duquenne's PhD thesis at META and Inria, whom I co-supervised with and Holger Schwenk (META). The work on making sentence embeddings robust to UGC was carried out in the context of Lydia Nishimwe's PhD thesis at Inria, whom I am cosupervising with Rachel Bawden (Inria). Both works are carried out within my chair and Rachel's chair within the PRAIRIE institute, which is funding Lydia's PhD.

BIO:

Benoît Sagot is a computer scientist specialised in natural language processing (NLP). He is a Senior Researcher (Directeur de Recherche) at Inria, where is heads the Inria research team ALMAnaCH in Paris, France. He also holds a chair in the PRAIRIE institute dedicated to artificial intelligence, and currently holds the annual chair for computer science in the Collège de France. His research focuses on language modelling, machine translation, language resource development and computational linguistics, with a focus on French in all its form and on less-resourced languages.

 

 

 

Jun 14 2024 -

Friday, 14th June - 11am Benoît Sagot : Seminar

This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk.

IF G.03 and Teams invite