Image Understanding using vision and reasoning through Scene Description Graph

Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos, Cornelia Fermüller

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; ). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.

Original languageEnglish (US)
JournalComputer Vision and Image Understanding
DOIs
StateAccepted/In press - Jan 1 2017

Fingerprint

Image understanding
Semantics
Directed graphs
Image retrieval
Syntactics
Ontology
Experiments

Keywords

  • Commonsense Reasoning
  • Image Understanding
  • Reasoning
  • Vision

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition

Cite this

Image Understanding using vision and reasoning through Scene Description Graph. / Aditya, Somak; Yang, Yezhou; Baral, Chitta; Aloimonos, Yiannis; Fermüller, Cornelia.

In: Computer Vision and Image Understanding, 01.01.2017.

Research output: Contribution to journalArticle

@article{97477ecf1b00436b8838617b29b36b73,
title = "Image Understanding using vision and reasoning through Scene Description Graph",
abstract = "Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; ). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.",
keywords = "Commonsense Reasoning, Image Understanding, Reasoning, Vision",
author = "Somak Aditya and Yezhou Yang and Chitta Baral and Yiannis Aloimonos and Cornelia Ferm{\"u}ller",
year = "2017",
month = "1",
day = "1",
doi = "10.1016/j.cviu.2017.12.004",
language = "English (US)",
journal = "Computer Vision and Image Understanding",
issn = "1077-3142",
publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Image Understanding using vision and reasoning through Scene Description Graph

AU - Aditya, Somak

AU - Yang, Yezhou

AU - Baral, Chitta

AU - Aloimonos, Yiannis

AU - Fermüller, Cornelia

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; ). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.

AB - Two of the fundamental tasks in image understanding using text are caption generation and visual question answering (Antol et al., 2015; ). This work presents an intermediate knowledge structure that can be used for both tasks to obtain increased interpretability. We call this knowledge structure Scene Description Graph (SDG), as it is a directed labeled graph, representing objects, actions, regions, as well as their attributes, along with inferred concepts and semantic (from KM-Ontology (Clark et al., 2004)), ontological (i.e. superclass, hasProperty), and spatial relations. Thereby a general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG. The architecture is implemented using generic visual recognition techniques and commonsense reasoning to extract graphs from images. The utility of the generated SDGs is demonstrated in the applications of image captioning, image retrieval, and through examples in visual question answering. The experiments in this work show that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.

KW - Commonsense Reasoning

KW - Image Understanding

KW - Reasoning

KW - Vision

UR - http://www.scopus.com/inward/record.url?scp=85039696240&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039696240&partnerID=8YFLogxK

U2 - 10.1016/j.cviu.2017.12.004

DO - 10.1016/j.cviu.2017.12.004

M3 - Article

JO - Computer Vision and Image Understanding

JF - Computer Vision and Image Understanding

SN - 1077-3142

ER -