Abstract
We present a novel unsupervised framework, which links continuous visual features and symbolic textual descriptions of manipulation activity videos. First, we extract the semantic representation of visually observed manipulations by applying a bottom-up approach to the continuous image streams. We then employ a rule-based reasoning to link visual and linguistic inputs. The proposed framework allows robots 1) to autonomously parse, classify, and label sequentially and/or concurrently performed atomic manipulations (e.g., 'cutting' or 'stirring'), 2) to simultaneously categorize and identify manipulated objects without using any standard feature-based recognition techniques, and 3) to generate textual descriptions for long activities, e.g., 'breakfast preparation.' We evaluated the framework using a dataset of 120 atomic manipulations and 20 long activities.
Original language | English (US) |
---|---|
Article number | 7856986 |
Pages (from-to) | 1397-1404 |
Number of pages | 8 |
Journal | IEEE Robotics and Automation Letters |
Volume | 2 |
Issue number | 3 |
DOIs | |
State | Published - Jul 2017 |
Keywords
- Cognitive human-robot interaction
- learning and adaptive systems
- semantic scene understanding
ASJC Scopus subject areas
- Control and Systems Engineering
- Biomedical Engineering
- Human-Computer Interaction
- Mechanical Engineering
- Computer Vision and Pattern Recognition
- Computer Science Applications
- Control and Optimization
- Artificial Intelligence