TY - GEN
T1 - Video2Commonsense
T2 - 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
AU - Fang, Zhiyuan
AU - Gokhale, Tejas
AU - Banerjee, Pratyay
AU - Baral, Chitta
AU - Yang, Yezhou
N1 - Funding Information:
The authors acknowledge support from the NSF Robust Intelligence Program project #1816039, the DARPA KAIROS program (LESTAT project), the DARPA SAIL-ON program, and ONR award N00014-20-1-2332. ZF, TG, YY thank the organizers and the participants of the Telluride Neuromor-phic Cognition Workshop, especially the Machine Common Sense (MCS) group.
Publisher Copyright:
© 2020 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
AB - Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
UR - http://www.scopus.com/inward/record.url?scp=85099829754&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099829754&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85099829754
T3 - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 840
EP - 860
BT - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 16 November 2020 through 20 November 2020
ER -