Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

Shailaja Keyur Sampat; Pratyay Banerjee; Yezhou Yang; Chitta Baral

Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, Chitta Baral

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to conference › Paper › peer-review

Abstract

'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.

Original language	English (US)
Pages	5943-5953
Number of pages	11
State	Published - 2022
Event	2022 Findings of the Association for Computational Linguistics: EMNLP 2022 - Abu Dhabi, United Arab Emirates Duration: Dec 7 2022 → Dec 11 2022

Conference

Conference	2022 Findings of the Association for Computational Linguistics: EMNLP 2022
Country/Territory	United Arab Emirates
City	Abu Dhabi
Period	12/7/22 → 12/11/22

ASJC Scopus subject areas

Computational Theory and Mathematics
Computer Science Applications
Information Systems

Cite this

@conference{9c07aec6685745cb9d66e0b3b09d3a96,

title = "Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task",

abstract = "'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.",

author = "Sampat, {Shailaja Keyur} and Pratyay Banerjee and Yezhou Yang and Chitta Baral",

note = "Funding Information: We are thankful to the anonymous reviewers for the constructive feedback. This work is partially supported by the grants NSF 1816039 and NSF 2132724. Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Findings of the Association for Computational Linguistics: EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English (US)",

pages = "5943--5953",

}

TY - CONF

T1 - Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

AU - Sampat, Shailaja Keyur

AU - Banerjee, Pratyay

AU - Yang, Yezhou

AU - Baral, Chitta

N1 - Funding Information: We are thankful to the anonymous reviewers for the constructive feedback. This work is partially supported by the grants NSF 1816039 and NSF 2132724. Publisher Copyright: © 2022 Association for Computational Linguistics.

PY - 2022

Y1 - 2022

N2 - 'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.

AB - 'Actions' play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform 'Reasoning about Actions & Change' (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.

UR - http://www.scopus.com/inward/record.url?scp=85149901927&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85149901927&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85149901927

SP - 5943

EP - 5953

T2 - 2022 Findings of the Association for Computational Linguistics: EMNLP 2022

Y2 - 7 December 2022 through 11 December 2022

ER -

Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

Abstract

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this