Hpc i/o throughput bottleneck analysis with explainable local models

Mihailo Isakov, Eliakin Del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Michel A. Kinsy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

With the growing complexity of high-performance computing (HPC) systems, achieving high performance can be difficult because of I/O bottlenecks. We analyze multiple years' worth of Darshan logs from the Argonne Leadership Computing Facility's Theta supercomputer in order to understand causes of poor I/O throughput. We present Gauge: A data-driven diagnostic tool for exploring the latent space of supercomputing job features, understanding behaviors of clusters of jobs, and interpreting I/O bottlenecks. We find groups of jobs that at first sight are highly heterogeneous but share certain behaviors, and analyze these groups instead of individual jobs, allowing us to reduce the workload of domain experts and automate I/O performance analysis. We conduct a case study where a system owner using Gauge was able to arrive at several clusters that do not conform to conventional I/O behaviors, as well as find several potential improvements, both on the application level and the system level.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2020
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781728199986
DOIs
StatePublished - Nov 2020
Externally publishedYes
Event2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020 - Virtual, Atlanta, United States
Duration: Nov 9 2020Nov 19 2020

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2020-November
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
Country/TerritoryUnited States
CityVirtual, Atlanta
Period11/9/2011/19/20

Keywords

  • clustering
  • diagnostics
  • HPC
  • I/O
  • machine learning

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Hpc i/o throughput bottleneck analysis with explainable local models'. Together they form a unique fingerprint.

Cite this