DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Gupta Udit,Hsia Samuel,Saraph Vikram,Wang Xiaodong,Reagen Brandon,Wei Gu-Yeon,Lee Hsien-Hsin S.,Brooks David,Wu Carole-Jean

2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)（2020）

引用 169|浏览648

暂无评分

摘要

Neural personalized recommendation is the cornerstone of a wide collection of cloud services and products, constituting significant compute demand of cloud infrastructure. Thus, improving the execution efficiency of recommendation directly translates into infrastructure capacity saving. In this paper, we propose DeepRecSched, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems. By carefully optimizing task versus data-level parallelism, DeepRecSched improves system throughput on server class CPUs by $2 \times$ across eight industry-representative models. Next, we deploy and evaluate this optimization in an at-scale production datacenter which reduces end-to-end tail latency across a wide variety of recommendation models by 30%. Finally, DeepRecSched demonstrates the role and impact of specialized AI hardware in optimizing system level performance (QPS) and power efficiency (QPS/watt) of recommendation inference.In order to enable the design space exploration of customized recommendation systems shown in this paper, we design and validate an end-to-end modeling infrastructure, DeepRecInfra. DeepRecInfra enables studies over a variety of recommendation use cases, taking into account at-scale effects, such as query arrival patterns and recommendation query sizes, observed from a production datacenter, as well as industry-representative models and tail latency targets.

查看译文

关键词

recommendation,end-to-end,at-scale

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要