Grounding Actions and Action Modifiers in Instructional Videos

Grounding is the process of associating a word or a phrase with the perceptual entity it refers to

Team

Led by Hakan Bilen, Frank Keller, and Laura Sevilla

Grounding is the process of associating a word or a phrase with the perceptual entity it refers to. For example, the noun phrase "the brown dog" is associated with a dog-shaped region in the visual input, or the verb phrase "walk slowly" is associated with a particular sequence of body movements. Any agent operating in a realistic environment requires grounding, for instance to understand and execute the commands, or to successfully communicate about pictures, diagrams, or videos. This project combines work in computer vision and natural language processing to enable multimodal grounding for complex tasks. We will use weakly supervised learning methods to infer grounding from instructional videos with verbal narration. Our models will combine vision transformers and pretrained language models to learn a mapping from regions in videos to words in narrations, with attention distributions representing grounding decisions. A small amount of annotated data will be created for evaluation purposes. Our project image is an example for adverb grounding. The adverb "slowly" is identified as modifying the verb "turn" (rather than the verb "come" in the same sentence), and "you" and "bowl" are identified as the (pro-)nouns that depend on this verb. The verb and the adverb are grounded in the video by temporally locating it (red boxes), and the noun "bowl" is grounded by spatially locating it (green boxes). Note that the pronoun "you" needs to be recognized as ungrounded. Images and narration taken from Doughty et al. (CVRR 2020), grounding added.

Publications

Arushi Goel, Basura Fernando, Frank Keller, Hakan Bilen, Semi-supervised multimodal coreference resolution in image narrations, (2023)
Arushi Goel, Basura Fernando, Frank Keller, Hakan Bilen, Not All Relations are Equal: Mining Informative Labels for Scene Graph Generation, arXiv preprint arXiv:2111.13517, (2021).
Davide Moltisanti, Frank Keller, Hakan Bilen, Laura Sevilla-Lara, Learning Action Changes by Measuring Verb-Adverb Textual Relationships, arXiv https://arxiv.org/abs/2303.15086v1 (2023). Code and Dataset available here.
Pinelopi Papalampidi, Frank Keller, Mirella Lapata, Film Trailer Generation via Task Decomposition, arXiv preprint arXiv:2111.08774, (2021).
Radina Dobreva, Frank Keller, Investigating Negation in Pre-trained Vision-and-language Models, Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, (2021).
Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach, A New Split for Evaluating True Zero-Shot Action Recognition, DAGM German Conference on Pattern Recognition, pp. 191-205, (2021).
Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach, CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition, arXiv preprint arXiv:2101.07042, (2021).

Talks

Understanding adverbs in videos - Davide Moltisanti (Talk held at the Informatics Forum, University of Edinburgh).

Abstract: Given a video showing a person performing an action, we are interested in understanding how the action is performed (e.g. chop quickly/finely, etc). Current methods for this underexplored task model adverbs as invertible action modifiers in a joint visual-text embedding space. However, these methods do not guide the model to look for salient visual cues in the video to learn how actions are performed. We thus suspect models learn spurious data correlations rather than actually learning the visual signature of an adverb. We first aim to demonstrate this, showing that when videos are altered (e.g. objects are masked, playback is edited) adverb recognition performance does not drop considerably. To address this limitation, we then plan to design a mixture-of-experts method that is trained to look for specific visual cues, e.g. the model should look at temporal dynamics for speed adverbs (e.g. quickly/slowly) or at spatial regions for completeness adverbs (e.g. fully/partially).

Learning Action Changes by Measuring Verb-Adverb Textual Relationships - Davide Moltisanti (Talk held at the Informatics Forum, University of Edinburgh).

Abstract: The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut “finely”). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging for models, demonstrating that there is considerable room for improvement.

Modelling object changes to understand adverbs in videos - Davide Moltisanti (Talk held at the Informatics Forum, University of Edinburgh).

Abstract: Adverb recognition is the task of understanding how an action is performed in a video (e.g. "chop something finely or coarsely"). Objects carry a strong visual signal regarding the way actions are performed. For example, if we chop parsley coarsely, the final state of the vegetable will look quite different compared to how it would look if we chopped it finely. In other words, the way objects transition from a state into another can help us understand the way actions are performed. Current approaches for this task ignore this, and in this talk we will explore ideas on how we can model object changes to understand action changes in videos.

This article was published on 6 Nov, 2023

Grounding Actions and Action Modifiers in Instructional Videos

Team

Summary

Publications

Talks