TY - GEN
T1 - Visuo-Linguistic Question Answering (VLQA) challenge
AU - Sampat, Shailaja Keyur
AU - Yang, Yezhou
AU - Baral, Chitta
N1 - Funding Information:
We are thankful to the anonymous reviewers for the feedback. This work is partially supported by the National Science Foundation grant IIS-1816039.
Publisher Copyright:
© 2020 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.
AB - Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.
UR - http://www.scopus.com/inward/record.url?scp=85115362650&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115362650&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85115362650
T3 - Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020
SP - 4606
EP - 4616
BT - Findings of the Association for Computational Linguistics Findings of ACL
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
Y2 - 16 November 2020 through 20 November 2020
ER -