TY - GEN
T1 - VQA-LOL
T2 - 16th European Conference on Computer Vision, ECCV 2020
AU - Gokhale, Tejas
AU - Banerjee, Pratyay
AU - Baral, Chitta
AU - Yang, Yezhou
N1 - Funding Information:
Acknowledgments. Support from NSF Robust Intelligence Program (1816039 and 1750082), DARPA (W911NF2020006) and ONR (N00014-20-1-2332) is gratefully acknowledged.
Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.
AB - Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.
KW - Logical robustness
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85097401041&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097401041&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-58589-1_23
DO - 10.1007/978-3-030-58589-1_23
M3 - Conference contribution
AN - SCOPUS:85097401041
SN - 9783030585884
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 379
EP - 396
BT - Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings
A2 - Vedaldi, Andrea
A2 - Bischof, Horst
A2 - Brox, Thomas
A2 - Frahm, Jan-Michael
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 23 August 2020 through 28 August 2020
ER -