Chrome Extension
WeChat Mini Program
Use on ChatGLM

Composition of Experts on the SN40L Reconfigurable Dataflow Unit

Raghu Prabhakar, Ram Sivaramakrishnan, Darshan Gandhi,Yun Du, Mingran Wang, Xiangyu Sog, Kejie Zhang, Tianren Gao, Angela Wang, Karen Li, Joshua Brot, Calvin Leung, Tuowen Zhao, Mark Gottscho, Edison Chen,Kaizhao Liang, Swayambhoo Jain, Urmish Thakker, Kevin J. Brown, Kunle Olukotun

IEEE Micro(2024)

Cited 0|Views0
No score
Abstract
Monolithic large language models (LLMs) pose significant challenges in training and serving during an active deployment. In contrast, Composition of Experts (CoE) is a modular approach that lowers the cost and complexity of training and serving. In this article, we explore unique hardware challenges for CoE models, such as lower operational intensity and the cost of switching between models. We describe the Sambanova SN40L Reconfigurable Dataflow Unit (RDU) that combines streaming dataflow and a new three-tier memory system with SRAM, HBM, and DDR DRAM. A single 8-socket SN40L Node achieves speedups between 2× to 13× due to aggressive operator fusion over an optimized baseline. The SN40L Node deploys Samba-CoE – a 1T (trillion) parameter CoE – with 19× smaller machine footprint, speeds up model switching time by 15× to 31×, and achieves an overall speedup of 3.7× over a DGX H100 and 6.6× over a DGX A100.
More
Translated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined