NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

Swaroop Mishra; Arindam Mitra; Neeraj Varshney; Bhavdeep Sachdeva; Peter Clark; Chitta Baral; Ashwin Kalyan

NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, Ashwin Kalyan

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

21 Scopus citations

Abstract

Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE (Wang et al., 2018) that was proposed in the context of natural language understanding, we propose NUMGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NUMGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NUMGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

Original language	English (US)
Title of host publication	ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
Editors	Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Publisher	Association for Computational Linguistics (ACL)
Pages	3505-3523
Number of pages	19
ISBN (Electronic)	9781955917216
State	Published - 2022
Event	60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Ireland Duration: May 22 2022 → May 27 2022

Publication series

Name	Proceedings of the Annual Meeting of the Association for Computational Linguistics
Volume	1
ISSN (Print)	0736-587X

Conference

Conference	60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
Country/Territory	Ireland
City	Dublin
Period	5/22/22 → 5/27/22

ASJC Scopus subject areas

Computer Science Applications
Linguistics and Language
Language and Linguistics

Cite this

Mishra, S., Mitra, A., Varshney, N., Sachdeva, B., Clark, P., Baral, C., & Kalyan, A. (2022). NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (pp. 3505-3523). (Proceedings of the Annual Meeting of the Association for Computational Linguistics; Vol. 1). Association for Computational Linguistics (ACL).

NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. / Mishra, Swaroop; Mitra, Arindam; Varshney, Neeraj et al.
ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). ed. / Smaranda Muresan; Preslav Nakov; Aline Villavicencio. Association for Computational Linguistics (ACL), 2022. p. 3505-3523 (Proceedings of the Annual Meeting of the Association for Computational Linguistics; Vol. 1).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Mishra, S, Mitra, A, Varshney, N, Sachdeva, B, Clark, P, Baral, C & Kalyan, A 2022, NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. in S Muresan, P Nakov & A Villavicencio (eds), ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Proceedings of the Annual Meeting of the Association for Computational Linguistics, vol. 1, Association for Computational Linguistics (ACL), pp. 3505-3523, 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, Dublin, Ireland, 5/22/22.

Mishra S, Mitra A, Varshney N, Sachdeva B, Clark P, Baral C et al. NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. In Muresan S, Nakov P, Villavicencio A, editors, ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL). 2022. p. 3505-3523. (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Mishra, Swaroop ; Mitra, Arindam ; Varshney, Neeraj et al. / NUMGLUE : A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). editor / Smaranda Muresan ; Preslav Nakov ; Aline Villavicencio. Association for Computational Linguistics (ACL), 2022. pp. 3505-3523 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

@inproceedings{043f295c3152494894bda2bdc3612fad,

title = "NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks",

abstract = "Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE (Wang et al., 2018) that was proposed in the context of natural language understanding, we propose NUMGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NUMGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NUMGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.",

author = "Swaroop Mishra and Arindam Mitra and Neeraj Varshney and Bhavdeep Sachdeva and Peter Clark and Chitta Baral and Ashwin Kalyan",

note = "Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 ; Conference date: 22-05-2022 Through 27-05-2022",

year = "2022",

language = "English (US)",

series = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics (ACL)",

pages = "3505--3523",

editor = "Smaranda Muresan and Preslav Nakov and Aline Villavicencio",

booktitle = "ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",

}

TY - GEN

T1 - NUMGLUE

T2 - 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022

AU - Mishra, Swaroop

AU - Mitra, Arindam

AU - Varshney, Neeraj

AU - Sachdeva, Bhavdeep

AU - Clark, Peter

AU - Baral, Chitta

AU - Kalyan, Ashwin

PY - 2022

Y1 - 2022

N2 - Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE (Wang et al., 2018) that was proposed in the context of natural language understanding, we propose NUMGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NUMGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NUMGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

AB - Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple calculations is an important skill of AI systems. While many datasets and models have been developed to this end, state-of-the-art AI systems are brittle; failing to perform the underlying mathematical reasoning when they appear in a slightly different scenario. Drawing inspiration from GLUE (Wang et al., 2018) that was proposed in the context of natural language understanding, we propose NUMGLUE, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans (lower by 46.4%). Further, NUMGLUE promotes sharing knowledge across tasks, especially those with limited training data as evidenced by the superior performance (average gain of 3.4% on each task) when a model is jointly trained on all the tasks as opposed to task-specific modeling. Finally, we hope that NUMGLUE will encourage systems that perform robust and general arithmetic reasoning within language, a first step towards being able to perform more complex mathematical reasoning.

UR - http://www.scopus.com/inward/record.url?scp=85137576954&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85137576954&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85137576954

T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics

SP - 3505

EP - 3523

BT - ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

A2 - Muresan, Smaranda

A2 - Nakov, Preslav

A2 - Villavicencio, Aline

PB - Association for Computational Linguistics (ACL)

Y2 - 22 May 2022 through 27 May 2022

ER -

NUMGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this