High Performance and Power Efficient Accelerator for Cloud Inference.

Jianguo Yao,Hao Zhou, Yalin Zhang, Ying Li, Chuang Feng, Shi Chen, Jiaoyan Chen, Yongdong Wang, Qiaojuan Hu

HPCA(2023)

引用 0|浏览9
暂无评分
摘要
Facing the growing complexity of Deep Neural Networks (DNNs), high-performance and power-efficient AI accelerators are desired to provide effective and affordable cloud inference services. We introduce our flagship product, i.e., the Cloudblazer i20 accelerator, which integrates the innovated Deep Thinking Unit (DTU 2.0). The design is driven by requests drawn from various AI inference applications and insights learned from our previous products. With careful tradeoffs in hardware-software co-design, Cloudblazer i20 delivers impressive performance and energy efficiency while maintaining acceptable hardware costs and software complexity/flexibility. To tackle computation- and data-intensive workloads, DTU 2.0 integrates powerful vector/matrix engines and a large-capacity multi-level memory hierarchy with high bandwidth. It supports comprehensive data flow and synchronization patterns to fully exploit parallelism in computation/memory access within or among concurrent tasks. Moreover, it enables sparse data compression/decompression, data broadcasting, repeated data transfer, and kernel code prefetching to optimize bandwidth utilization and reduce data access overheads. To utilize the underlying hardware and simplify the development of customized DNNs/operators, the software stack enables automatic optimizations (such as operator fusion and data flow tuning) and provides diverse programming interfaces for developers. Lastly, the energy consumption is optimized through dynamic power integrity and efficiency management, eliminating integrity risks and energy wastes. Based on the performance requirement, developers also can assign their workloads with the entire or partial hardware resources accordingly. Evaluated with 10 representative DNN models widely adopted in various domains, Cloudblazer i20 outperforms Nvidia T4 and A10 GPUs with a geometric mean of 2.22x and 1.16x in performance and 1.04x and 1.17x in energy efficiency, respectively. The improvements demonstrate the effectiveness of Cloudblazer i20’s design that emphasizes performance, efficiency, and flexibility.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要