Visuo-Linguistic Question Answering (VLQA) challenge

Shailaja Keyur Sampat; Yezhou Yang; Chitta Baral

Visuo-Linguistic Question Answering (VLQA) challenge

Shailaja Keyur Sampat, Yezhou Yang, Chitta Baral

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

9 Scopus citations

Abstract

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

Original language	English (US)
Title of host publication	Findings of the Association for Computational Linguistics Findings of ACL
Subtitle of host publication	EMNLP 2020
Publisher	Association for Computational Linguistics (ACL)
Pages	4606-4616
Number of pages	11
ISBN (Electronic)	9781952148903
State	Published - 2020
Event	Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 - Virtual, Online Duration: Nov 16 2020 → Nov 20 2020

Publication series

Name	Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

Conference

Conference	Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
City	Virtual, Online
Period	11/16/20 → 11/20/20

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Computational Theory and Mathematics

Cite this

Visuo-Linguistic Question Answering (VLQA) challenge. / Sampat, Shailaja Keyur; Yang, Yezhou ; Baral, Chitta.
Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020. Association for Computational Linguistics (ACL), 2020. p. 4606-4616 (Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Sampat, SK, Yang, Y & Baral, C 2020, Visuo-Linguistic Question Answering (VLQA) challenge. in Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020. Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, Association for Computational Linguistics (ACL), pp. 4606-4616, Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020, Virtual, Online, 11/16/20.

@inproceedings{1622bef73d6d4dd3a60075150a04e1a6,

title = "Visuo-Linguistic Question Answering (VLQA) challenge",

abstract = "Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.",

author = "Sampat, {Shailaja Keyur} and Yezhou Yang and Chitta Baral",

note = "Funding Information: We are thankful to the anonymous reviewers for the feedback. This work is partially supported by the National Science Foundation grant IIS-1816039. Publisher Copyright: {\textcopyright} 2020 Association for Computational Linguistics; Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020 ; Conference date: 16-11-2020 Through 20-11-2020",

year = "2020",

language = "English (US)",

series = "Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020",

publisher = "Association for Computational Linguistics (ACL)",

pages = "4606--4616",

booktitle = "Findings of the Association for Computational Linguistics Findings of ACL",

}

TY - GEN

T1 - Visuo-Linguistic Question Answering (VLQA) challenge

AU - Sampat, Shailaja Keyur

AU - Yang, Yezhou

AU - Baral, Chitta

N1 - Funding Information: We are thankful to the anonymous reviewers for the feedback. This work is partially supported by the National Science Foundation grant IIS-1816039. Publisher Copyright: © 2020 Association for Computational Linguistics

PY - 2020

Y1 - 2020

N2 - Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

AB - Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

UR - http://www.scopus.com/inward/record.url?scp=85115362650&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85115362650&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85115362650

T3 - Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020

SP - 4606

EP - 4616

BT - Findings of the Association for Computational Linguistics Findings of ACL

PB - Association for Computational Linguistics (ACL)

T2 - Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020

Y2 - 16 November 2020 through 20 November 2020

ER -

Visuo-Linguistic Question Answering (VLQA) challenge

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this