TY - GEN
T1 - Tragedy Plus Time
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
AU - Chakravarthy, Arnav
AU - Fang, Zhiyuan
AU - Yang, Yezhou
N1 - Funding Information:
This work was supported by the National Science Foundation under Grant CMMI-1925403, IIS-2132724 and IIS-1750082.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.
AB - In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.
UR - http://www.scopus.com/inward/record.url?scp=85137826105&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137826105&partnerID=8YFLogxK
U2 - 10.1109/CVPRW56347.2022.00384
DO - 10.1109/CVPRW56347.2022.00384
M3 - Conference contribution
AN - SCOPUS:85137826105
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 3404
EP - 3414
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
PB - IEEE Computer Society
Y2 - 19 June 2022 through 20 June 2022
ER -