Video2Commonsense: Generating commonsense descriptions to enrich video captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Scopus citations

Abstract

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

Original languageEnglish (US)
Title of host publicationEMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages840-860
Number of pages21
ISBN (Electronic)9781952148606
StatePublished - 2020
Event2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 - Virtual, Online
Duration: Nov 16 2020Nov 20 2020

Publication series

NameEMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
CityVirtual, Online
Period11/16/2011/20/20

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Video2Commonsense: Generating commonsense descriptions to enrich video captioning'. Together they form a unique fingerprint.

Cite this