Toward Generalizable Models of I/O Throughput

Mihailo Isakov, Eliakin Del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Michel A. Kinsy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. However, the direct application of ML methods to I/O throughput prediction often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-Test set separation. Next we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.

Original languageEnglish (US)
Title of host publicationProceedings of ROSS 2020
Subtitle of host publication10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages41-49
Number of pages9
ISBN (Electronic)9781665422680
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020 - Virtual, Atlanta, United States
Duration: Nov 13 2020 → …

Publication series

NameProceedings of ROSS 2020: 10th International Workshop on Runtime and Operating Systems for Supercomputers, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Conference

Conference10th IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/13/20 → …

Keywords

  • High-Performance Computing
  • I/O Analysis
  • Machine Learning
  • Optimization

ASJC Scopus subject areas

  • Software
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Toward Generalizable Models of I/O Throughput'. Together they form a unique fingerprint.

Cite this