Friday, 22nd March - 11am Alessandro Suglia : Seminar

Title: “Multimodal Embodied Models: Enabling Embodied Artificial Agents to Act and Reason Using Multimodal Perception”

Abstract:

Very recently, the research community has built models that can fuse the visual modality with the language modality facilitating the development of AI systems that can solve multimodal tasks such as captioning generation and visual question answering. However, most of these approaches ignore the fact that as humans we learn by interacting with the world and with other agents. Interaction can be modelled as an interleaved sequence of perceptual experiences. In this talk, I will take advantage of the concept of language games to define a theoretical framework to investigate the weaknesses and strengths of current Vision+Language tasks in the literature [1]. This analysis will justify the need for benchmarks considering embodiment and interaction as first-class citizens, and I will present EMMA, our foundation model for language-guided embodied task completion tasks that we developed during the Amazon Simbot challenge [2]. Finally, I will provide research outcomes that feed into my research agenda which aims to create Embodied Multimodal Artificial Agents inspired by Barsalou’s Grounded Cognition theory (e.g., PIXAR [3]).

Bio:

Alessandro Suglia is an Assistant Professor at Heriot-Watt University (HWU) and Head of Visual Dialogue at Alana AI, a startup developing Multimodal Foundation Models for healthcare. He’s also a member of the ELLIS network and the academic liaison between HWU and the Alan Turing Institute. Alessandro’s research focuses on designing artificial agents that learn language by leveraging sensory information derived from interacting with the world and with other agents. During his PhD, he was one of the main developers of Alana, the Heriot-Watt conversational AI which ranked 3rd in the Amazon Alexa Prize challenge in 2018. In his role as Assistant Professor at HWU, he led the HWU team “EMMA”, the only non-American university team which was one of the finalists of the Amazon Simbot Challenge—the first Amazon competition to push the boundaries of Embodied Conversational AI. Alongside several academic collaborations, he also completed research collaborations with Amazon Alexa AI, Meta AI, and the European Space Agency focused on developing innovative Multimodal Generative AI models for embodied and situated human-robot interaction tasks.

References:

[1]: Suglia, A., Konstas, I., & Lemon, O. (2024). Visually Grounded Language Learning: a review of language games, datasets, tasks, and models. Journal of Artificial Intelligence Research, 79, 173-239.

[2]: Pantazopoulos, G., Nikandrou, M., Parekh, A., Hemanthage, B., Eshghi, A., Konstas, I., ... & Suglia, A. (2023, December). Multitask Multimodal Prompted Training for Interactive Embodied Task Completion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 768-789).

[3]: Tai, Y., Liao, X., Suglia, A., & Vergari, A. (2024). PIXAR: Auto-Regressive Language Modeling in Pixel Space. arXiv preprint arXiv:2401.03321.

Mar 22 2024 11.00 - 12.00

Friday, 22nd March - 11am Alessandro Suglia : Seminar

This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk.

IF G.03

Contact