Unsupervised Linking of Visual Features to Textual Descriptions in Long Manipulation Activities

Eren Erdal Aksoy, Ekaterina Ovchinnikova, Adil Orhan, Yezhou Yang, Tamim Asfour

Research output: Contribution to journalArticlepeer-review

8 Scopus citations


We present a novel unsupervised framework, which links continuous visual features and symbolic textual descriptions of manipulation activity videos. First, we extract the semantic representation of visually observed manipulations by applying a bottom-up approach to the continuous image streams. We then employ a rule-based reasoning to link visual and linguistic inputs. The proposed framework allows robots 1) to autonomously parse, classify, and label sequentially and/or concurrently performed atomic manipulations (e.g., 'cutting' or 'stirring'), 2) to simultaneously categorize and identify manipulated objects without using any standard feature-based recognition techniques, and 3) to generate textual descriptions for long activities, e.g., 'breakfast preparation.' We evaluated the framework using a dataset of 120 atomic manipulations and 20 long activities.

Original languageEnglish (US)
Article number7856986
Pages (from-to)1397-1404
Number of pages8
JournalIEEE Robotics and Automation Letters
Issue number3
StatePublished - Jul 2017


  • Cognitive human-robot interaction
  • learning and adaptive systems
  • semantic scene understanding

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Biomedical Engineering
  • Human-Computer Interaction
  • Mechanical Engineering
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Control and Optimization
  • Artificial Intelligence


Dive into the research topics of 'Unsupervised Linking of Visual Features to Textual Descriptions in Long Manipulation Activities'. Together they form a unique fingerprint.

Cite this