TY - JOUR
T1 - Image Understanding using vision and reasoning through Scene Description Graph
AU - Aditya, Somak
AU - Yang, Yezhou
AU - Baral, Chitta
AU - Aloimonos, Yiannis
AU - Fermüller, Cornelia
N1 - Funding Information:
Yiannis Aloimonos and Cornelia Fermüller acknowledge the support of the National Science Foundation under grants SMA 1540917 and CNS 1544797.
Publisher Copyright:
© 2017 Elsevier Inc.
PY - 2018/8
Y1 - 2018/8
N2 - Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; Xiong et al., 2016). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.
AB - Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; Xiong et al., 2016). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.
KW - Commonsense Reasoning
KW - Image Understanding
KW - Reasoning
KW - Vision
UR - http://www.scopus.com/inward/record.url?scp=85039696240&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039696240&partnerID=8YFLogxK
U2 - 10.1016/j.cviu.2017.12.004
DO - 10.1016/j.cviu.2017.12.004
M3 - Article
AN - SCOPUS:85039696240
VL - 173
SP - 33
EP - 45
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
SN - 1077-3142
ER -