2.4 ATOMUS: A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications
2024 IEEE International Solid-State Circuits Conference (ISSCC)(2024)
Abstract
The growing computational demands of AI inference have led to widespread use of hardware accelerators for different platforms, spanning from edge to the datacenter/cloud. Certain AI application areas, such as in high-frequency trading (HFT) [1–2], have a hard inference latency deadline for successful execution. We present our new AI accelerator which achieves high inference capability with outstanding single-stream responsiveness for demanding service-layer objective (SLO)-based AI services and pipelined inference applications, including large language models (LLM). Owing to low thermal design power (TDP), the scale-out solution can support multi-stream applications, as well as total cost of ownership (TCO)-centric systems effectively.
MoreTranslated text
Key words
Synchronization,Data Center,Data Streams,High Bandwidth,Edge Computing,Task Scheduling,Hardware Accelerators,External Memory,Data Cache,Memory Bandwidth,Public Cloud,Computational Graph,Neural Engineering,Off-chip Memory,Neural Clusters,Local Bus,Separate Core
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined