Understanding Images with Text - Visually Grounded Reading Comprehension


Written natural language is omnipresent in human environments, and, thus not surprisingly, many of the questions asked by visually impaired users about images involves reading text in the image. In this talk I will introduce two novel vision & language tasks and datasets, one for visual question answering and one for image captioning, which require reasoning about images and their text. Our datasets challenge a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We find that text in images is very different from e.g. questions or standard image captioning datasets, requiring to considering it as a different modality. We propose a model to address several of these challenges, and I will discuss success and failure modes. Our human evaluation for captioning models reveals that there is still a large gap to human performance in contrast to standard captioning datasets and show that automatic metrics are correlated well with human judgments. I will conclude with open challenges and relating it to some other recent work, including long-tail, multi-tasking, and video & language.


Marcus Rohrbach is a research scientist at Facebook AI Research. Previously, he was a PostDoc at UCB (University of California, Berkeley) at EECS and ICSI with Trevor Darrell (2014-2017), and he did his PhD at the Max Planck Institute for Informatics with Bernt Schiele (2010-2014). His interests include computer vision, computational linguistics, and machine learning and how these areas can collaborate best.

This event is co-organised by ILCC and by the UKRI Centre for Doctoral Training in Natural Language Processing, https://nlp-cdt.ac.uk

