TY - GEN
T1 - Spatial knowledge distillation to aid visual reasoning
AU - Aditya, Somak
AU - Saha, Rudra
AU - Yang, Yezhou
AU - Baral, Chitta
N1 - Funding Information:
The support of the National Science Foundation under the Robust Intelligence Program (1816039 and 1750082), and a gift from Verisk AI are gratefully acknowledged. We also acknowledge NVIDIA for the donation of GPUs.
Publisher Copyright:
© 2019 IEEE
PY - 2019/3/4
Y1 - 2019/3/4
N2 - For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system’s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.
AB - For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system’s capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.
UR - http://www.scopus.com/inward/record.url?scp=85063574220&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063574220&partnerID=8YFLogxK
U2 - 10.1109/WACV.2019.00030
DO - 10.1109/WACV.2019.00030
M3 - Conference contribution
AN - SCOPUS:85063574220
T3 - Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019
SP - 227
EP - 235
BT - Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th IEEE Winter Conference on Applications of Computer Vision, WACV 2019
Y2 - 7 January 2019 through 11 January 2019
ER -