12 August 2020 - Marcus Rohrbach: Seminar
Speaker
Marcus Rohrback
Title
Understanding Images with Text - Visually Grounded Reading Comprehension
Abstract
Written natural language is omnipresent in human environments, and, thus not surprisingly, many of the questions asked by visually impaired users about images involves reading text in the image. In this talk I will introduce two novel vision & language tasks and datasets, one for visual question answering and one for image captioning, which require reasoning about images and their text. Our datasets challenge a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We find that text in images is very different from e.g. questions or standard image captioning datasets, requiring to considering it as a different modality. We propose a model to address several of these challenges, and I will discuss success and failure modes. Our human evaluation for captioning models reveals that there is still a large gap to human performance in contrast to standard captioning datasets and show that automatic metrics are correlated well with human judgments. I will conclude with open challenges and relating it to some other recent work, including long-tail, multi-tasking, and video & language.
Biography
Marcus Rohrbach is a research scientist at Facebook AI Research. Previously, he was a PostDoc at UCB (University of California, Berkeley) at EECS and ICSI with Trevor Darrell (2014-2017), and he did his PhD at the Max Planck Institute for Informatics with Bernt Schiele (2010-2014). His interests include computer vision, computational linguistics, and machine learning and how these areas can collaborate best.
Blackboard Link
https://eu.bbcollab.com/guest/25ad6b0ad9f44f4e8cfa2e96f8164088
Add to your calendar
12 August 2020 - Marcus Rohrbach: Seminar
Blackboard invitation