TY - GEN
T1 - Toward Generalizable Models of I/O Throughput
AU - Isakov, Mihailo
AU - Del Rosario, Eliakin
AU - Madireddy, Sandeep
AU - Balaprakash, Prasanna
AU - Carns, Philip
AU - Ross, Robert B.
AU - Kinsy, Michel A.
N1 - Funding Information:
This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.
AB - Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.
KW - High-Performance Computing
KW - I/O Analysis
KW - Machine Learning
KW - Optimization
UR - http://www.scopus.com/inward/record.url?scp=85099545745&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85099545745&partnerID=8YFLogxK
U2 - 10.1109/ROSS51935.2020.00010
DO - 10.1109/ROSS51935.2020.00010
M3 - Conference contribution
AN - SCOPUS:85099545745
T3 - Proceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 41
EP - 49
BT - Proceedings of ROSS 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020
Y2 - 13 November 2020
ER -