Spatial knowledge distillation to aid visual reasoning

Somak Aditya; Rudra Saha; Yezhou Yang; Chitta Baral

doi:10.1109/WACV.2019.00030

Spatial knowledge distillation to aid visual reasoning

Somak Aditya, Rudra Saha, Yezhou Yang, Chitta Baral

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

12 Scopus citations

Abstract

For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system’s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.

Original language	English (US)
Title of host publication	Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	227-235
Number of pages	9
ISBN (Electronic)	9781728119755
DOIs	https://doi.org/10.1109/WACV.2019.00030
State	Published - Mar 4 2019
Event	19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019 - Waikoloa Village, United States Duration: Jan 7 2019 → Jan 11 2019

Publication series

Name	Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019

Conference

Conference	19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019
Country/Territory	United States
City	Waikoloa Village
Period	1/7/19 → 1/11/19

ASJC Scopus subject areas

Computer Vision and Pattern Recognition
Computer Science Applications

Access to Document

10.1109/WACV.2019.00030

Cite this

Aditya, S., Saha, R., Yang, Y., & Baral, C. (2019). Spatial knowledge distillation to aid visual reasoning. In Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019 (pp. 227-235). Article 8658731 (Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/WACV.2019.00030

Spatial knowledge distillation to aid visual reasoning. / Aditya, Somak; Saha, Rudra; Yang, Yezhou et al.
Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 227-235 8658731 (Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Aditya, S, Saha, R, Yang, Y & Baral, C 2019, Spatial knowledge distillation to aid visual reasoning. in Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019., 8658731, Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Institute of Electrical and Electronics Engineers Inc., pp. 227-235, 19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, United States, 1/7/19. https://doi.org/10.1109/WACV.2019.00030

Aditya S, Saha R, Yang Y , Baral C. Spatial knowledge distillation to aid visual reasoning. In Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 227-235. 8658731. (Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019). doi: 10.1109/WACV.2019.00030

@inproceedings{62fe6a0dba9a4f4597325901bcbc8b86,

title = "Spatial knowledge distillation to aid visual reasoning",

abstract = "For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system{\textquoteright}s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.",

author = "Somak Aditya and Rudra Saha and Yezhou Yang and Chitta Baral",

note = "Funding Information: The support of the National Science Foundation under the Robust Intelligence Program (1816039 and 1750082), and a gift from Verisk AI are gratefully acknowledged. We also acknowledge NVIDIA for the donation of GPUs. Publisher Copyright: {\textcopyright} 2019 IEEE; 19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019 ; Conference date: 07-01-2019 Through 11-01-2019",

year = "2019",

month = mar,

day = "4",

doi = "10.1109/WACV.2019.00030",

language = "English (US)",

series = "Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "227--235",

booktitle = "Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019",

}

TY - GEN

T1 - Spatial knowledge distillation to aid visual reasoning

AU - Aditya, Somak

AU - Saha, Rudra

AU - Yang, Yezhou

AU - Baral, Chitta

N1 - Funding Information: The support of the National Science Foundation under the Robust Intelligence Program (1816039 and 1750082), and a gift from Verisk AI are gratefully acknowledged. We also acknowledge NVIDIA for the donation of GPUs. Publisher Copyright: © 2019 IEEE

PY - 2019/3/4

Y1 - 2019/3/4

N2 - For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system’s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.

AB - For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system’s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.

UR - http://www.scopus.com/inward/record.url?scp=85063574220&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063574220&partnerID=8YFLogxK

U2 - 10.1109/WACV.2019.00030

DO - 10.1109/WACV.2019.00030

M3 - Conference contribution

AN - SCOPUS:85063574220

T3 - Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019

SP - 227

EP - 235

BT - Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019

Y2 - 7 January 2019 through 11 January 2019

ER -

Spatial knowledge distillation to aid visual reasoning

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this