CAVAN: Commonsense Knowledge Anchored Video Captioning

Huiliang Shao, Zhiyuan Fang, Yezhou Yang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

It is not merely an aggregation of static entities that a video clip carries, but also a variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. In this work, we present a Commonsense knowledge Anchored Video cAptioNing(dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, we acquire commonsense knowledge complementing per training caption by querying a generic knowledge atlas (ATOMIC [1]), and form the commonsense-caption entailment corpus. A BERT [2] based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. Experimental results with ablations on MSRVTT [3], V2C [4] and VATEX [5] datasets validate the effectiveness of CAVAN and reveal that the use of commonsense knowledge benefits video caption generation.

Original languageEnglish (US)
Title of host publication2022 26th International Conference on Pattern Recognition, ICPR 2022
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4095-4102
Number of pages8
ISBN (Electronic)9781665490627
DOIs
StatePublished - 2022
Externally publishedYes
Event26th International Conference on Pattern Recognition, ICPR 2022 - Montreal, Canada
Duration: Aug 21 2022Aug 25 2022

Publication series

NameProceedings - International Conference on Pattern Recognition
Volume2022-August
ISSN (Print)1051-4651

Conference

Conference26th International Conference on Pattern Recognition, ICPR 2022
Country/TerritoryCanada
CityMontreal
Period8/21/228/25/22

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'CAVAN: Commonsense Knowledge Anchored Video Captioning'. Together they form a unique fingerprint.

Cite this