VQA-LOL: Visual Question Answering Under the Lens of Logic

Tejas Gokhale; Pratyay Banerjee; Chitta Baral; Yezhou Yang

doi:10.1007/978-3-030-58589-1_23

VQA-LOL: Visual Question Answering Under the Lens of Logic

Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

24 Scopus citations

Abstract

Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

Original language	English (US)
Title of host publication	Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings
Editors	Andrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	379-396
Number of pages	18
ISBN (Print)	9783030585884
DOIs	https://doi.org/10.1007/978-3-030-58589-1_23
State	Published - 2020
Event	16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom Duration: Aug 23 2020 → Aug 28 2020

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	12366 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	16th European Conference on Computer Vision, ECCV 2020
Country/Territory	United Kingdom
City	Glasgow
Period	8/23/20 → 8/28/20

Keywords

Logical robustness
Visual question answering

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-030-58589-1_23

Cite this

Gokhale, T., Banerjee, P., Baral, C., & Yang, Y. (2020). VQA-LOL: Visual Question Answering Under the Lens of Logic. In A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings (pp. 379-396). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12366 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58589-1_23

VQA-LOL: Visual Question Answering Under the Lens of Logic. / Gokhale, Tejas; Banerjee, Pratyay; Baral, Chitta et al.
Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. ed. / Andrea Vedaldi; Horst Bischof; Thomas Brox; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. p. 379-396 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 12366 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Gokhale, T, Banerjee, P, Baral, C & Yang, Y 2020, VQA-LOL: Visual Question Answering Under the Lens of Logic. in A Vedaldi, H Bischof, T Brox & J-M Frahm (eds), Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12366 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 379-396, 16th European Conference on Computer Vision, ECCV 2020, Glasgow, United Kingdom, 8/23/20. https://doi.org/10.1007/978-3-030-58589-1_23

Gokhale T, Banerjee P, Baral C , Yang Y. VQA-LOL: Visual Question Answering Under the Lens of Logic. In Vedaldi A, Bischof H, Brox T, Frahm JM, editors, Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. Springer Science and Business Media Deutschland GmbH. 2020. p. 379-396. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-030-58589-1_23

Gokhale, Tejas ; Banerjee, Pratyay ; Baral, Chitta et al. / VQA-LOL : Visual Question Answering Under the Lens of Logic. Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings. editor / Andrea Vedaldi ; Horst Bischof ; Thomas Brox ; Jan-Michael Frahm. Springer Science and Business Media Deutschland GmbH, 2020. pp. 379-396 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{f10f664d32554597ad050403288b80ca,

title = "VQA-LOL: Visual Question Answering Under the Lens of Logic",

abstract = "Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fr{\'e}chet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.",

keywords = "Logical robustness, Visual question answering",

author = "Tejas Gokhale and Pratyay Banerjee and Chitta Baral and Yezhou Yang",

note = "Funding Information: Acknowledgments. Support from NSF Robust Intelligence Program (1816039 and 1750082), DARPA (W911NF2020006) and ONR (N00014-20-1-2332) is gratefully acknowledged. Publisher Copyright: {\textcopyright} 2020, Springer Nature Switzerland AG.; 16th European Conference on Computer Vision, ECCV 2020 ; Conference date: 23-08-2020 Through 28-08-2020",

year = "2020",

doi = "10.1007/978-3-030-58589-1_23",

language = "English (US)",

isbn = "9783030585884",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "379--396",

editor = "Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm",

booktitle = "Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings",

address = "Germany",

}

TY - GEN

T1 - VQA-LOL

T2 - 16th European Conference on Computer Vision, ECCV 2020

AU - Gokhale, Tejas

AU - Banerjee, Pratyay

AU - Baral, Chitta

AU - Yang, Yezhou

N1 - Funding Information: Acknowledgments. Support from NSF Robust Intelligence Program (1816039 and 1750082), DARPA (W911NF2020006) and ONR (N00014-20-1-2332) is gratefully acknowledged. Publisher Copyright: © 2020, Springer Nature Switzerland AG.

PY - 2020

Y1 - 2020

N2 - Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

AB - Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this Lens of Logic, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our Lens of Logic (LOL) model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding.

KW - Logical robustness

KW - Visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85097401041&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85097401041&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-58589-1_23

DO - 10.1007/978-3-030-58589-1_23

M3 - Conference contribution

AN - SCOPUS:85097401041

SN - 9783030585884

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 379

EP - 396

BT - Computer Vision – ECCV 2020 - 16th European Conference 2020, Proceedings

A2 - Vedaldi, Andrea

A2 - Bischof, Horst

A2 - Brox, Thomas

A2 - Frahm, Jan-Michael

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 23 August 2020 through 28 August 2020

ER -

VQA-LOL: Visual Question Answering Under the Lens of Logic

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this