Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Zhaoxia Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu, Jie Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy

Research output: Contribution to journalArticlepeer-review

Abstract

Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.

Original languageEnglish (US)
Pages (from-to)93-100
Number of pages8
JournalIEEE Micro
Volume41
Issue number5
DOIs
StatePublished - Sep 1 2021
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale'. Together they form a unique fingerprint.

Cite this