Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

Arnav Chakravarthy; Zhiyuan Fang; Yezhou Yang

doi:10.1109/CVPRW56347.2022.00384

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

Arnav Chakravarthy, Zhiyuan Fang, Yezhou Yang

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.

Original language	English (US)
Title of host publication	Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Publisher	IEEE Computer Society
Pages	3404-3414
Number of pages	11
ISBN (Electronic)	9781665487399
DOIs	https://doi.org/10.1109/CVPRW56347.2022.00384
State	Published - 2022
Event	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 - New Orleans, United States Duration: Jun 19 2022 → Jun 20 2022

Publication series

Name	IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Volume	2022-June
ISSN (Print)	2160-7508
ISSN (Electronic)	2160-7516

Conference

Conference	2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022
Country/Territory	United States
City	New Orleans
Period	6/19/22 → 6/20/22

ASJC Scopus subject areas

Computer Vision and Pattern Recognition
Electrical and Electronic Engineering

Access to Document

10.1109/CVPRW56347.2022.00384

Cite this

Chakravarthy, A., Fang, Z., & Yang, Y. (2022). Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 (pp. 3404-3414). (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; Vol. 2022-June). IEEE Computer Society. https://doi.org/10.1109/CVPRW56347.2022.00384

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos. / Chakravarthy, Arnav; Fang, Zhiyuan; Yang, Yezhou.
Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022. IEEE Computer Society, 2022. p. 3404-3414 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops; Vol. 2022-June).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Chakravarthy, A, Fang, Z & Yang, Y 2022, Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos. in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2022-June, IEEE Computer Society, pp. 3404-3414, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022, New Orleans, United States, 6/19/22. https://doi.org/10.1109/CVPRW56347.2022.00384

Chakravarthy A, Fang Z, Yang Y. Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos. In Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022. IEEE Computer Society. 2022. p. 3404-3414. (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops). doi: 10.1109/CVPRW56347.2022.00384

Chakravarthy, Arnav ; Fang, Zhiyuan ; Yang, Yezhou. / Tragedy Plus Time : Capturing Unintended Human Activities from Weakly-labeled Videos. Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022. IEEE Computer Society, 2022. pp. 3404-3414 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops).

@inproceedings{6e39b573cd264641a6a0594be5e7701e,

title = "Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos",

abstract = "In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.",

author = "Arnav Chakravarthy and Zhiyuan Fang and Yezhou Yang",

note = "Funding Information: This work was supported by the National Science Foundation under Grant CMMI-1925403, IIS-2132724 and IIS-1750082. Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022 ; Conference date: 19-06-2022 Through 20-06-2022",

year = "2022",

doi = "10.1109/CVPRW56347.2022.00384",

language = "English (US)",

series = "IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops",

publisher = "IEEE Computer Society",

pages = "3404--3414",

booktitle = "Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022",

}

TY - GEN

T1 - Tragedy Plus Time

T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022

AU - Chakravarthy, Arnav

AU - Fang, Zhiyuan

AU - Yang, Yezhou

PY - 2022

Y1 - 2022

N2 - In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.

AB - In videos that contain actions performed unintentionally, agents do not achieve their desired goals. In such videos, it is challenging for computer vision systems to understand high-level concepts such as goal-directed behavior, an ability present in humans from a very early age. Inculcating this ability in artificially intelligent agents would make them better social learners by allowing them to evaluate human action under a teleological lens. To validate this ability of deep learning models to perform this task, we curate the W-Oops dataset, built upon the Oops dataset [11]. W-Oops consists of 2,100 unintentional human action videos, with 44 goal-directed and 30 unintentional video-level activity labels collected through human annotations. Due to the expensive segment annotation procedure, we propose a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels. In particular, we employ an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task. Meanwhile, our designed overlap regularization allows the model to focus on distinct portions of the video for inferring the goal-directed and unintentional activity, while guaranteeing their temporal ordering. Extensive quantitative experiments verify the validity of our localization method. We further conduct a video captioning experiment which demonstrates that the proposed localization module does indeed assist teleological action understanding. Project website can be found at: https://asu-apg.github.io/TragedyPlusTime.

UR - http://www.scopus.com/inward/record.url?scp=85137826105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85137826105&partnerID=8YFLogxK

U2 - 10.1109/CVPRW56347.2022.00384

DO - 10.1109/CVPRW56347.2022.00384

M3 - Conference contribution

AN - SCOPUS:85137826105

T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

SP - 3404

EP - 3414

BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022

PB - IEEE Computer Society

Y2 - 19 June 2022 through 20 June 2022

ER -

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this