Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Michael Lui, Yavuz Yetim, Ozgur Ozkan, Zhuoran Zhao, Shin Yeh Tsai, Carole Jean Wu, Mark Hempstead

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes-that load entire models to a single server-are unable to support this scale. One approach to support these models is distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work is the first to describe and characterize scale-out deep learning recommender inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely attributed to a model's static embedding table distribution and sparsity of inference request inputs. We evaluate three embedding table mapping strategies on three representative models and specify the challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe a modest latency overhead with distributed inference-P99 latency is increased by only 1% in the best case configuration. The latency overheads are a result of the commodity infrastructure used and the sparsity of embedding tables. Encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages162-171
Number of pages10
ISBN (Electronic)9781728186436
DOIs
StatePublished - Mar 2021
Externally publishedYes
Event2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021 - Virtual, Stony Brook, United States
Duration: Mar 28 2021Mar 30 2021

Publication series

NameProceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021

Conference

Conference2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021
Country/TerritoryUnited States
CityVirtual, Stony Brook
Period3/28/213/30/21

Keywords

  • deep learning
  • distributed systems
  • recommendation

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems
  • Software
  • Safety, Risk, Reliability and Quality
  • Artificial Intelligence
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Understanding Capacity-Driven Scale-Out Neural Recommendation Inference'. Together they form a unique fingerprint.

Cite this