SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks

Yizhong Wang; Swaroop Mishra; Pegah Alipoormolabashi; Yeganeh Kordi; Amirreza Mirzaei; Anjana Arunkumar; Arjun Ashok; Arut Selvan Dhanasekaran; Atharva Naik; David Stap; Eshaan Pathak; Giannis Karamanolakis; Haizhi Gary Lai; Ishan Purohit; Ishani Mondal; Jacob Anderson; Kirby Kuznia; Krima Doshi; Maitreya Patel; Kuntal Kumar Pal; Mehrad Moradshahi; Mihir Parmar; Mirali Purohit; Neeraj Varshney; Phani Rohitha Kaza; Pulkit Verma; Ravsehaj Singh Puri; Rushang Karia; Shailaja Keyur Sampat; Savan Doshi; Siddhartha Mishra; Sujan Reddy; Sumanta Patro; Tanay Dixit; Xudong Shen; Chitta Baral; Yejin Choi; Noah A. Smith; Hannaneh Hajishirzi; Daniel Khashabi

SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar PalMehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi

Engineering, Ira A. Fulton Schools of (IAFSE)

Research output: Contribution to conference › Paper › peer-review

107 Scopus citations

Abstract

How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce SUPER-NATURALINSTRUCTIONS, a benchmark of 1, 616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions-training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-INSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.

Original language	English (US)
Pages	5085-5109
Number of pages	25
State	Published - 2022
Event	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 - Abu Dhabi, United Arab Emirates Duration: Dec 7 2022 → Dec 11 2022

Conference

Conference	2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022
Country/Territory	United Arab Emirates
City	Abu Dhabi
Period	12/7/22 → 12/11/22

ASJC Scopus subject areas

Computational Theory and Mathematics
Computer Science Applications
Information Systems

Cite this

Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Arunkumar, A., Ashok, A., Dhanasekaran, A. S., Naik, A., Stap, D., Pathak, E., Karamanolakis, G., Lai, H. G., Purohit, I., Mondal, I., Anderson, J., Kuznia, K., Doshi, K., Patel, M., ... Khashabi, D. (2022). SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks. 5085-5109. Paper presented at 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates.

SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks. / Wang, Yizhong; Mishra, Swaroop; Alipoormolabashi, Pegah et al.
2022. 5085-5109 Paper presented at 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates.

Research output: Contribution to conference › Paper › peer-review

Wang, Y, Mishra, S, Alipoormolabashi, P, Kordi, Y, Mirzaei, A, Arunkumar, A, Ashok, A, Dhanasekaran, AS, Naik, A, Stap, D, Pathak, E, Karamanolakis, G, Lai, HG, Purohit, I, Mondal, I, Anderson, J, Kuznia, K, Doshi, K, Patel, M, Pal, KK, Moradshahi, M, Parmar, M, Purohit, M, Varshney, N, Kaza, PR, Verma, P, Puri, RS, Karia, R, Sampat, SK, Doshi, S, Mishra, S, Reddy, S, Patro, S, Dixit, T, Shen, X, Baral, C, Choi, Y, Smith, NA, Hajishirzi, H & Khashabi, D 2022, 'SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks', Paper presented at 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 12/7/22 - 12/11/22 pp. 5085-5109.

@conference{c26307b9e1ad46ffa6aafd6585810910,

title = "SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks",

abstract = "How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce SUPER-NATURALINSTRUCTIONS, a benchmark of 1, 616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions-training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-INSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.",

author = "Yizhong Wang and Swaroop Mishra and Pegah Alipoormolabashi and Yeganeh Kordi and Amirreza Mirzaei and Anjana Arunkumar and Arjun Ashok and Dhanasekaran, {Arut Selvan} and Atharva Naik and David Stap and Eshaan Pathak and Giannis Karamanolakis and Lai, {Haizhi Gary} and Ishan Purohit and Ishani Mondal and Jacob Anderson and Kirby Kuznia and Krima Doshi and Maitreya Patel and Pal, {Kuntal Kumar} and Mehrad Moradshahi and Mihir Parmar and Mirali Purohit and Neeraj Varshney and Kaza, {Phani Rohitha} and Pulkit Verma and Puri, {Ravsehaj Singh} and Rushang Karia and Sampat, {Shailaja Keyur} and Savan Doshi and Siddhartha Mishra and Sujan Reddy and Sumanta Patro and Tanay Dixit and Xudong Shen and Chitta Baral and Yejin Choi and Smith, {Noah A.} and Hannaneh Hajishirzi and Daniel Khashabi",

note = "Funding Information: We thank the anonymous reviewers, our colleagues from AI2 and UWNLP, especially Matthew Peters for his encouraging conversations that motivated this project. We also thank the student contributors of Arizona State University's CSE 576 “Topics in NLP” course and all other contributors to our data repository. All experiments were run on AI2's Beaker GPU clusters and Google's research TPUs. This work was supported in part by ONR MURI N00014-18-1-2670, ONR N00014-18-1-2826, and DARPA MCS N66001-19-2-4031 grants. Funding Information: We thank the anonymous reviewers, our colleagues from AI2 and UWNLP, especially Matthew Peters for his encouraging conversations that motivated this project. We also thank the student contributors of Arizona State University{\textquoteright}s CSE 576 “Topics in NLP” course and all other contributors to our data repository. All experiments were run on AI2{\textquoteright}s Beaker GPU clusters and Google{\textquoteright}s research TPUs. This work was supported in part by ONR MURI N00014-18-1-2670, ONR N00014-18-1-2826, and DARPA MCS N66001-19-2-4031 grants. Publisher Copyright: {\textcopyright} 2022 Association for Computational Linguistics.; 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 ; Conference date: 07-12-2022 Through 11-12-2022",

year = "2022",

language = "English (US)",

pages = "5085--5109",

}

TY - CONF

T1 - SUPER-NATURALINSTRUCTIONS

T2 - 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

AU - Wang, Yizhong

AU - Mishra, Swaroop

AU - Alipoormolabashi, Pegah

AU - Kordi, Yeganeh

AU - Mirzaei, Amirreza

AU - Arunkumar, Anjana

AU - Ashok, Arjun

AU - Dhanasekaran, Arut Selvan

AU - Naik, Atharva

AU - Stap, David

AU - Pathak, Eshaan

AU - Karamanolakis, Giannis

AU - Lai, Haizhi Gary

AU - Purohit, Ishan

AU - Mondal, Ishani

AU - Anderson, Jacob

AU - Kuznia, Kirby

AU - Doshi, Krima

AU - Patel, Maitreya

AU - Pal, Kuntal Kumar

AU - Moradshahi, Mehrad

AU - Parmar, Mihir

AU - Purohit, Mirali

AU - Varshney, Neeraj

AU - Kaza, Phani Rohitha

AU - Verma, Pulkit

AU - Puri, Ravsehaj Singh

AU - Karia, Rushang

AU - Sampat, Shailaja Keyur

AU - Doshi, Savan

AU - Mishra, Siddhartha

AU - Reddy, Sujan

AU - Patro, Sumanta

AU - Dixit, Tanay

AU - Shen, Xudong

AU - Baral, Chitta

AU - Choi, Yejin

AU - Smith, Noah A.

AU - Hajishirzi, Hannaneh

AU - Khashabi, Daniel

N1 - Funding Information: We thank the anonymous reviewers, our colleagues from AI2 and UWNLP, especially Matthew Peters for his encouraging conversations that motivated this project. We also thank the student contributors of Arizona State University's CSE 576 “Topics in NLP” course and all other contributors to our data repository. All experiments were run on AI2's Beaker GPU clusters and Google's research TPUs. This work was supported in part by ONR MURI N00014-18-1-2670, ONR N00014-18-1-2826, and DARPA MCS N66001-19-2-4031 grants. Funding Information: We thank the anonymous reviewers, our colleagues from AI2 and UWNLP, especially Matthew Peters for his encouraging conversations that motivated this project. We also thank the student contributors of Arizona State University’s CSE 576 “Topics in NLP” course and all other contributors to our data repository. All experiments were run on AI2’s Beaker GPU clusters and Google’s research TPUs. This work was supported in part by ONR MURI N00014-18-1-2670, ONR N00014-18-1-2826, and DARPA MCS N66001-19-2-4031 grants. Publisher Copyright: © 2022 Association for Computational Linguistics.

PY - 2022

Y1 - 2022

N2 - How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce SUPER-NATURALINSTRUCTIONS, a benchmark of 1, 616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions-training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-INSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.

AB - How well can NLP models generalize to a variety of unseen tasks when provided with task instructions? To address this question, we first introduce SUPER-NATURALINSTRUCTIONS, a benchmark of 1, 616 diverse NLP tasks and their expert-written instructions. Our collection covers 76 distinct task types, including but not limited to classification, extraction, infilling, sequence tagging, text rewriting, and text composition. This large and diverse collection of tasks enables rigorous benchmarking of cross-task generalization under instructions-training models to follow instructions on a subset of tasks and evaluating them on the remaining unseen ones. Furthermore, we build Tk-INSTRUCT, a transformer model trained to follow a variety of in-context instructions (plain language task definitions or k-shot examples). Our experiments show that Tk-INSTRUCT outperforms existing instruction-following models such as InstructGPT by over 9% on our benchmark despite being an order of magnitude smaller. We further analyze generalization as a function of various scaling parameters, such as the number of observed tasks, the number of instances per task, and model sizes. We hope our dataset and model facilitate future progress towards more general-purpose NLP models.

UR - http://www.scopus.com/inward/record.url?scp=85143257592&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85143257592&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85143257592

SP - 5085

EP - 5109

Y2 - 7 December 2022 through 11 December 2022

ER -

SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks

Abstract

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this