Video2Commonsense: Generating commonsense descriptions to enrich video captioning

Zhiyuan Fang; Tejas Gokhale; Pratyay Banerjee; Chitta Baral; Yezhou Yang

Video2Commonsense: Generating commonsense descriptions to enrich video captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

29 Scopus citations

Abstract

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

Original language	English (US)
Title of host publication	EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Publisher	Association for Computational Linguistics (ACL)
Pages	840-860
Number of pages	21
ISBN (Electronic)	9781952148606
State	Published - 2020
Event	2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 - Virtual, Online Duration: Nov 16 2020 → Nov 20 2020

Publication series

Name	EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference	2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
City	Virtual, Online
Period	11/16/20 → 11/20/20

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Computational Theory and Mathematics

Cite this

Fang, Z., Gokhale, T., Banerjee, P., Baral, C., & Yang, Y. (2020). Video2Commonsense: Generating commonsense descriptions to enrich video captioning. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 840-860). (EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference). Association for Computational Linguistics (ACL).

Video2Commonsense: Generating commonsense descriptions to enrich video captioning. / Fang, Zhiyuan; Gokhale, Tejas; Banerjee, Pratyay et al.
EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. p. 840-860 (EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Fang, Z, Gokhale, T, Banerjee, P, Baral, C & Yang, Y 2020, Video2Commonsense: Generating commonsense descriptions to enrich video captioning. in EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, Association for Computational Linguistics (ACL), pp. 840-860, 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Virtual, Online, 11/16/20.

Fang Z, Gokhale T, Banerjee P, Baral C , Yang Y. Video2Commonsense: Generating commonsense descriptions to enrich video captioning. In EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL). 2020. p. 840-860. (EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference).

Fang, Zhiyuan ; Gokhale, Tejas ; Banerjee, Pratyay et al. / Video2Commonsense : Generating commonsense descriptions to enrich video captioning. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. Association for Computational Linguistics (ACL), 2020. pp. 840-860 (EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference).

@inproceedings{b32e08100ce14274bf2835ced57b8cd2,

title = "Video2Commonsense: Generating commonsense descriptions to enrich video captioning",

abstract = "Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.",

author = "Zhiyuan Fang and Tejas Gokhale and Pratyay Banerjee and Chitta Baral and Yezhou Yang",

note = "Funding Information: The authors acknowledge support from the NSF Robust Intelligence Program project #1816039, the DARPA KAIROS program (LESTAT project), the DARPA SAIL-ON program, and ONR award N00014-20-1-2332. ZF, TG, YY thank the organizers and the participants of the Telluride Neuromor-phic Cognition Workshop, especially the Machine Common Sense (MCS) group. Publisher Copyright: {\textcopyright} 2020 Association for Computational Linguistics; 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",

year = "2020",

language = "English (US)",

series = "EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

publisher = "Association for Computational Linguistics (ACL)",

pages = "840--860",

booktitle = "EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - Video2Commonsense

T2 - 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020

AU - Fang, Zhiyuan

AU - Gokhale, Tejas

AU - Banerjee, Pratyay

AU - Baral, Chitta

AU - Yang, Yezhou

N1 - Funding Information: The authors acknowledge support from the NSF Robust Intelligence Program project #1816039, the DARPA KAIROS program (LESTAT project), the DARPA SAIL-ON program, and ONR award N00014-20-1-2332. ZF, TG, YY thank the organizers and the participants of the Telluride Neuromor-phic Cognition Workshop, especially the Machine Common Sense (MCS) group. Publisher Copyright: © 2020 Association for Computational Linguistics

PY - 2020

Y1 - 2020

N2 - Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

AB - Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.

UR - http://www.scopus.com/inward/record.url?scp=85099829754&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85099829754&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85099829754

T3 - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

SP - 840

EP - 860

BT - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

PB - Association for Computational Linguistics (ACL)

Y2 - 16 November 2020 through 20 November 2020

ER -

Video2Commonsense: Generating commonsense descriptions to enrich video captioning

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this