Towards a Watson that sees: Language-guided action recognition for robots

Ching L. Teo; Yezhou Yang; Hal Daumé; Cornelia Fermuller; Yiannis Aloimonos

doi:10.1109/ICRA.2012.6224589

Towards a Watson that sees: Language-guided action recognition for robots

Ching L. Teo, Yezhou Yang, Hal Daumé, Cornelia Fermuller, Yiannis Aloimonos

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

21 Scopus citations

Abstract

For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.

Original language	English (US)
Title of host publication	2012 IEEE International Conference on Robotics and Automation, ICRA 2012
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	374-381
Number of pages	8
ISBN (Print)	9781467314039
DOIs	https://doi.org/10.1109/ICRA.2012.6224589
State	Published - 2012
Externally published	Yes
Event	2012 IEEE International Conference on Robotics and Automation, ICRA 2012 - Saint Paul, MN, United States Duration: May 14 2012 → May 18 2012

Publication series

Name	Proceedings - IEEE International Conference on Robotics and Automation
ISSN (Print)	1050-4729

Conference

Conference	2012 IEEE International Conference on Robotics and Automation, ICRA 2012
Country/Territory	United States
City	Saint Paul, MN
Period	5/14/12 → 5/18/12

ASJC Scopus subject areas

Software
Artificial Intelligence
Electrical and Electronic Engineering
Control and Systems Engineering

Access to Document

10.1109/ICRA.2012.6224589

Cite this

Teo, C. L., Yang, Y., Daumé, H., Fermuller, C., & Aloimonos, Y. (2012). Towards a Watson that sees: Language-guided action recognition for robots. In 2012 IEEE International Conference on Robotics and Automation, ICRA 2012 (pp. 374-381). Article 6224589 (Proceedings - IEEE International Conference on Robotics and Automation). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICRA.2012.6224589

Towards a Watson that sees: Language-guided action recognition for robots. / Teo, Ching L.; Yang, Yezhou; Daumé, Hal et al.
2012 IEEE International Conference on Robotics and Automation, ICRA 2012. Institute of Electrical and Electronics Engineers Inc., 2012. p. 374-381 6224589 (Proceedings - IEEE International Conference on Robotics and Automation).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Teo, CL, Yang, Y, Daumé, H, Fermuller, C & Aloimonos, Y 2012, Towards a Watson that sees: Language-guided action recognition for robots. in 2012 IEEE International Conference on Robotics and Automation, ICRA 2012., 6224589, Proceedings - IEEE International Conference on Robotics and Automation, Institute of Electrical and Electronics Engineers Inc., pp. 374-381, 2012 IEEE International Conference on Robotics and Automation, ICRA 2012, Saint Paul, MN, United States, 5/14/12. https://doi.org/10.1109/ICRA.2012.6224589

Teo CL, Yang Y, Daumé H, Fermuller C, Aloimonos Y. Towards a Watson that sees: Language-guided action recognition for robots. In 2012 IEEE International Conference on Robotics and Automation, ICRA 2012. Institute of Electrical and Electronics Engineers Inc. 2012. p. 374-381. 6224589. (Proceedings - IEEE International Conference on Robotics and Automation). doi: 10.1109/ICRA.2012.6224589

@inproceedings{569cbb34b8f6450b8db5ed655a1b9b35,

title = "Towards a Watson that sees: Language-guided action recognition for robots",

abstract = "For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.",

author = "Teo, {Ching L.} and Yezhou Yang and Hal Daum{\'e} and Cornelia Fermuller and Yiannis Aloimonos",

year = "2012",

doi = "10.1109/ICRA.2012.6224589",

language = "English (US)",

isbn = "9781467314039",

series = "Proceedings - IEEE International Conference on Robotics and Automation",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "374--381",

booktitle = "2012 IEEE International Conference on Robotics and Automation, ICRA 2012",

note = " 2012 IEEE International Conference on Robotics and Automation, ICRA 2012 ; Conference date: 14-05-2012 Through 18-05-2012",

}

TY - GEN

T1 - Towards a Watson that sees

T2 - 2012 IEEE International Conference on Robotics and Automation, ICRA 2012

AU - Teo, Ching L.

AU - Yang, Yezhou

AU - Daumé, Hal

AU - Fermuller, Cornelia

AU - Aloimonos, Yiannis

PY - 2012

Y1 - 2012

N2 - For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.

AB - For robots of the future to interact seamlessly with humans, they must be able to reason about their surroundings and take actions that are appropriate to the situation. Such reasoning is only possible when the robot has knowledge of how the World functions, which must either be learned or hard-coded. In this paper, we propose an approach that exploits language as an important resource of high-level knowledge that a robot can use, akin to IBM's Watson in Jeopardy!. In particular, we show how language can be leveraged to reduce the ambiguity that arises from recognizing actions involving hand-tools from video data. Starting from the premise that tools and actions are intrinsically linked, with one explaining the existence of the other, we trained a language model over a large corpus of English newswire text so that we can extract this relationship directly. This model is then used as a prior to select the best tool and action that explains the video. We formalize the approach in the context of 1) an unsupervised recognition and 2) a supervised classification scenario by an EM formulation for the former and integrating language features for the latter. Results are validated over a new hand-tool action dataset, and comparisons with state of the art STIP features showed significantly improved results when language is used. In addition, we discuss the implications of these results and how it provides a framework for integrating language into vision on other robotic applications.

UR - http://www.scopus.com/inward/record.url?scp=84864473231&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864473231&partnerID=8YFLogxK

U2 - 10.1109/ICRA.2012.6224589

DO - 10.1109/ICRA.2012.6224589

M3 - Conference contribution

AN - SCOPUS:84864473231

SN - 9781467314039

T3 - Proceedings - IEEE International Conference on Robotics and Automation

SP - 374

EP - 381

BT - 2012 IEEE International Conference on Robotics and Automation, ICRA 2012

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 14 May 2012 through 18 May 2012

ER -

Towards a Watson that sees: Language-guided action recognition for robots

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this