Grounding Actions and Action Modifiers in Instructional Videos

Led by Hakan Bilen, Frank Keller, and Laura Sevilla with Davide Moltisanti (Postdoctoral Researcher)


Grounding is the process of associating a word or a phrase with theperceptual entity it refers to. For example, the noun phrase "the brown dog" is associated with a dog-shaped region in the visual input, or the verb phrase "walk slowly" is associated with a particular sequence of body movements. Any agent operating in a realistic environment requires grounding, for instance to understand and execute the commands, or to successfully communicate about pictures, diagrams, or videos. This project combines work in computer vision and natural language processing to enable multimodal grounding for complex tasks. We will use weakly supervised learning methods to infer grounding from instructional videos with verbal narration. Our models will combine vision transformers and pretrained language models to learn a mapping from regions in videos to words in narrations, with attention distributions representing grounding decisions. A small amount of annotated data will be created for evaluation purposes. Our project image is an example for adverb grounding. The adverb "slowly" is identified as modifying the verb "turn" (rather than the verb "come" in the same sentence), and "you" and "bowl" are identified as the (pro-)nouns that depend on this verb. The verb and the adverb are grounded in the video by temporally locating it (red boxes), and the noun "bowl" is grounded by spatially locating it (green boxes). Note that the pronoun "you" needs to be recognized as ungrounded. Images and narration taken from Doughty et al. (CVRR 2020), grounding added.

| Hakan Bilen | Frank Keller | Laura Sevilla | Davide Moltisanti | Publications |