CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems

Sagar Karandikar, Aniruddha N. Udipi, Junsun Choi,Joonho Whangbo,Jerry Zhao,Svilen Kanev, Edwin Lim,Jyrki Alakuijala, Vrishab Madduri,Yakun Sophia Shao,Borivoje Nikolic,Krste Asanovic,Parthasarathy Ranganathan

PROCEEDINGS OF THE 2023 THE 50TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2023（2023）

引用 0|浏览33

暂无评分

摘要

General-purpose lossless data compression and decompression ("(de)compression") are used widely in hyperscale systems and are key "datacenter taxes". However, designing optimal hardware compression and decompression processing units ("CDPUs") is challenging due to the variety of algorithms deployed, input data characteristics, and evolving costs of CPU cycles, network bandwidth, and memory/storage capacities. To navigate this vast design space, we present the first largescale data-driven analysis of (de)compression usage at a major cloud provider by profiling Google's datacenter fleet. We find that (de)compression consumes 2.9% of fleet CPU cycles and 10-50% of cycles in key services. Demand is also artificially limited; 95% of bytes compressed in the fleet use less capable algorithms to reduce compute, motivating a CDPU that changes cost vs. size tradeoffs. Prior work has improved the microarchitectural state-of-the-art for CDPUs supporting various algorithms in fixed contexts. However, we find that higher-level design parameters like CDPU placement, hash table sizing, history window sizes, and more have as significant of an impact on the viability of CDPU integration, but are not well-studied. Thus, we present the first end-to-end design/evaluation framework for CDPUs, including: 1. An open-source RTLbased CDPU generator that supports many run-time and compiletime parameters. 2. Integration into an open-source RISC-V SoC for rapid performance and silicon area evaluation across CDPU placements and parameters. 3. An open-source (de)compression benchmark, HyperCompressBench, that is representative of (de)compression usage in Google's fleet. Using our framework, we perform an extensive design space exploration running HyperCompressBench. Our exploration spans a 46x range in CDPU speedup, 3x range in silicon area (for a single pipeline), and evaluates a variety of CDPU integration techniques to optimize CDPU designs for hyperscale contexts. Our final hyperscale-optimized CDPU instances are up to 10x to 16x faster than a single Xeon core, while consuming a small fraction (as little as 2.4% to 4.7%) of the area.

查看译文

关键词

compression,decompression,hardware-acceleration,warehouse-scale computing,hyperscale systems,profiling

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要