LĪLA: A Unified Benchmark for Mathematical Reasoning

Swaroop Mishra; Matthew Finlayson; Pan Lu; Leonard Tang; Sean Welleck; Chitta Baral; Tanmay Rajpurohit; Oyvind Tafjord; Ashish Sabharwal; Peter Clark; Ashwin Kalyan

LĪLA: A Unified Benchmark for Mathematical Reasoning

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, Ashwin Kalyan

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to conference › Paper › peer-review

15 Scopus citations

Abstract

Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LĪLA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHĀSKARA, a general-purpose mathematical reasoning model trained on LĪLA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

Original language	English (US)
Pages	5807-5832
Number of pages	26
State	Published - 2022
Event	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates Duration: Dec 7 2022 → Dec 11 2022

Conference

Conference	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/Territory	United Arab Emirates
City	Abu Dhabi
Period	12/7/22 → 12/11/22

ASJC Scopus subject areas

Computational Theory and Mathematics
Computer Science Applications
Information Systems

Cite this

@conference{0401f971f6374da8ab7ce79982a25576,

title = "LĪLA: A Unified Benchmark for Mathematical Reasoning",

abstract = "Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LĪLA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHĀSKARA, a general-purpose mathematical reasoning model trained on LĪLA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.",

author = "Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and Ashwin Kalyan",

note = "Funding Information: Work done while at the Allen Institute for AI. Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English (US)",

pages = "5807--5832",

}

TY - CONF

T1 - LĪLA

T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

AU - Mishra, Swaroop

AU - Finlayson, Matthew

AU - Lu, Pan

AU - Tang, Leonard

AU - Welleck, Sean

AU - Baral, Chitta

AU - Rajpurohit, Tanmay

AU - Tafjord, Oyvind

AU - Sabharwal, Ashish

AU - Clark, Peter

AU - Kalyan, Ashwin

PY - 2022

Y1 - 2022

N2 - Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LĪLA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHĀSKARA, a general-purpose mathematical reasoning model trained on LĪLA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

AB - Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LĪLA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHĀSKARA, a general-purpose mathematical reasoning model trained on LĪLA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.

UR - http://www.scopus.com/inward/record.url?scp=85141696776&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85141696776&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85141696776

SP - 5807

EP - 5832

Y2 - 7 December 2022 through 11 December 2022

ER -

LĪLA: A Unified Benchmark for Mathematical Reasoning

Abstract

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this