TY - GEN
T1 - Towards a Watson that sees
T2 - 2012 IEEE International Conference on Robotics and Automation, ICRA 2012
AU - Teo, Ching L.
AU - Yang, Yezhou
AU - Daumé, Hal
AU - Fermuller, Cornelia
AU - Aloimonos, Yiannis
PY - 2012
Y1 - 2012
N2 - For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.
AB - For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.
UR - http://www.scopus.com/inward/record.url?scp=84864473231&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84864473231&partnerID=8YFLogxK
U2 - 10.1109/ICRA.2012.6224589
DO - 10.1109/ICRA.2012.6224589
M3 - Conference contribution
AN - SCOPUS:84864473231
SN - 9781467314039
T3 - Proceedings - IEEE International Conference on Robotics and Automation
SP - 374
EP - 381
BT - 2012 IEEE International Conference on Robotics and Automation, ICRA 2012
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 May 2012 through 18 May 2012
ER -