TY - JOUR
T1 - Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale
AU - Deng, Zhaoxia
AU - Park, Jongsoo
AU - Tang, Ping Tak Peter
AU - Liu, Haixin
AU - Yang, Jie
AU - Yuen, Hector
AU - Huang, Jianyu
AU - Khudia, Daya
AU - Wei, Xiaohan
AU - Wen, Ellie
AU - Choudhary, Dhruv
AU - Krishnamoorthi, Raghuraman
AU - Wu, Carole Jean
AU - Nadathur, Satish
AU - Kim, Changkyu
AU - Naumov, Maxim
AU - Naghshineh, Sam
AU - Smelyanskiy, Mikhail
N1 - Publisher Copyright:
© 1981-2012 IEEE.
PY - 2021/9/1
Y1 - 2021/9/1
N2 - Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.
AB - Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook's personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models' accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.
UR - http://www.scopus.com/inward/record.url?scp=85107209568&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107209568&partnerID=8YFLogxK
U2 - 10.1109/MM.2021.3081981
DO - 10.1109/MM.2021.3081981
M3 - Article
AN - SCOPUS:85107209568
SN - 0272-1732
VL - 41
SP - 93
EP - 100
JO - IEEE Micro
JF - IEEE Micro
IS - 5
ER -