Toward Generalizable Models of I/O Throughput

Mihailo Isakov; Eliakin Del Rosario; Sandeep Madireddy; Prasanna Balaprakash; Philip Carns; Robert B. Ross; Michel A. Kinsy

doi:10.1109/ROSS51935.2020.00010

Toward Generalizable Models of I/O Throughput

Mihailo Isakov, Eliakin Del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Michel A. Kinsy

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.

Original language	English (US)
Title of host publication	Proceedings of ROSS 2020
Subtitle of host publication	10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	41-49
Number of pages	9
ISBN (Electronic)	9781665422680
DOIs	https://doi.org/10.1109/ROSS51935.2020.00010
State	Published - Nov 2020
Externally published	Yes
Event	10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020 - Virtual, Atlanta, United States Duration: Nov 13 2020 → …

Publication series

Name	Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference	10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020
Country/Territory	United States
City	Virtual, Atlanta
Period	11/13/20 → …

Keywords

High-Performance Computing
I/O Analysis
Machine Learning
Optimization

ASJC Scopus subject areas

Software
Information Systems and Management
Safety, Risk, Reliability and Quality

Access to Document

10.1109/ROSS51935.2020.00010

Cite this

Isakov, M., Del Rosario, E., Madireddy, S., Balaprakash, P., Carns, P., Ross, R. B., & Kinsy, M. A. (2020). Toward Generalizable Models of I/O Throughput. In Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 41-49). Article 9307941 (Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ROSS51935.2020.00010

Toward Generalizable Models of I/O Throughput. / Isakov, Mihailo; Del Rosario, Eliakin; Madireddy, Sandeep et al.
Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2020. p. 41-49 9307941 (Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Isakov, M, Del Rosario, E, Madireddy, S, Balaprakash, P, Carns, P, Ross, RB & Kinsy, MA 2020, Toward Generalizable Models of I/O Throughput. in Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis., 9307941, Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers Inc., pp. 41-49, 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020, Virtual, Atlanta, United States, 11/13/20. https://doi.org/10.1109/ROSS51935.2020.00010

Isakov M, Del Rosario E, Madireddy S, Balaprakash P, Carns P, Ross RB et al. Toward Generalizable Models of I/O Throughput. In Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc. 2020. p. 41-49. 9307941. (Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis). doi: 10.1109/ROSS51935.2020.00010

Isakov, Mihailo ; Del Rosario, Eliakin ; Madireddy, Sandeep et al. / Toward Generalizable Models of I/O Throughput. Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis. Institute of Electrical and Electronics Engineers Inc., 2020. pp. 41-49 (Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis).

@inproceedings{e0217e26f50747c2a4e21314e126850b,

title = "Toward Generalizable Models of I/O Throughput",

abstract = "Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize. ",

keywords = "High-Performance Computing, I/O Analysis, Machine Learning, Optimization",

author = "Mihailo Isakov and {Del Rosario}, Eliakin and Sandeep Madireddy and Prasanna Balaprakash and Philip Carns and Ross, {Robert B.} and Kinsy, {Michel A.}",

note = "Funding Information: This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. Publisher Copyright: {\textcopyright} 2020 IEEE.; 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020 ; Conference date: 13-11-2020",

year = "2020",

month = nov,

doi = "10.1109/ROSS51935.2020.00010",

language = "English (US)",

series = "Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "41--49",

booktitle = "Proceedings of ROSS 2020",

}

TY - GEN

T1 - Toward Generalizable Models of I/O Throughput

AU - Isakov, Mihailo

AU - Del Rosario, Eliakin

AU - Madireddy, Sandeep

AU - Balaprakash, Prasanna

AU - Carns, Philip

AU - Ross, Robert B.

AU - Kinsy, Michel A.

N1 - Funding Information: This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. Publisher Copyright: © 2020 IEEE.

PY - 2020/11

Y1 - 2020/11

N2 - Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.

AB - Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.

KW - High-Performance Computing

KW - I/O Analysis

KW - Machine Learning

KW - Optimization

UR - http://www.scopus.com/inward/record.url?scp=85099545745&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85099545745&partnerID=8YFLogxK

U2 - 10.1109/ROSS51935.2020.00010

DO - 10.1109/ROSS51935.2020.00010

M3 - Conference contribution

AN - SCOPUS:85099545745

T3 - Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

SP - 41

EP - 49

BT - Proceedings of ROSS 2020

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020

Y2 - 13 November 2020

ER -

Toward Generalizable Models of I/O Throughput

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this