27 September 2019 - Ivan Vulic: Seminar
Title: Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?
Cross-lingual word representations offer an elegant and language-pair independent way to represent content across different languages. They enable us to reason over word meaning in multilingual contexts and serve as an integral source of knowledge for enabling language technology in low-resource languages through cross-lingual transfer. A current research focus is on resource-lean projection-based embedding models which require cheap word-level bilingual supervision. In the extreme, fully unsupervised methods do not require any supervision as all: this property makes such approaches conceptually attractive and potentially applicable to a wide spectrum of language pairs and cross-lingual scenarios. However, their only core difference to weakly supervised projection-based methods is in the way they obtain a seed dictionary used to initialize an iterative self-learning procedure. While the primary use case of fully unsupervised approaches should be low-resource target languages and distant language pairs, in this talk we show that even the most robust and effective fully unsupervised approaches still struggle in these challenging settings, lacking robustness and yielding suboptimal solutions. What is more, we empirically demonstrate that even when fully unsupervised methods succeed, they never surpass the performance of weakly supervised methods (seeded only with 500-1,000 translation pairs) using the same self-learning procedure. These findings call for improving robustness and revisiting the main motivations behind fully unsupervised cross-lingual word embedding methods.
Ivan Vulić is a Senior Research Associate in the Language Technology Lab, University of Cambridge and a Senior Scientist at PolyAI. He holds a PhD in Computer Science from KU Leuven awarded summa cum laude. His core expertise is in representation learning, cross-lingual learning, human language understanding, distributional, lexical, multi-modal, and knowledge-enhanced semantics in monolingual and multilingual contexts, and transfer learning for enabling cross-lingual NLP applications such as conversational AI in low-resource languages. He has published more than 70 papers at top-tier NLP and IR conferences and journals. He co-lectured a tutorial on monolingual and multilingual topic models and applications at ECIR 2013 and WSDM 2014, a tutorial on word vector space specialization at EACL 2017 and ESSLLI 2018 and EMNLP 2019, and tutorials on cross-lingual representation learning and cross-lingual NLP at EMNLP 2017 and ACL 2019. He also co-lectured a tutorial on conversational AI at NAACL 2018. He recently co-authored a book on cross-lingual word representations for the Morgan & Claypool Handbook series (published in June 2019). He serves as an area chair and regularly reviews for major NLP and Machine Learning conferences (ACL, EMNLP, NAACL-HLT, EACL, COLING, ICLR, ICML, NeurIPS) and journals (Computational Linguistics, JAIR, Transactions of the ACL, Computer Speech & Language). Ivan has given invited talks at academia and industry such as Apple Inc., University of Cambridge, UCL, University of Copenhagen, Paris-Saclay, Bar-Ilan University, University of Helsinki, KU Leuven, University of Stuttgart, London REWORK summit, etc.