Chrome Extension
WeChat Mini Program
Use on ChatGLM

2.4 ATOMUS: A 5nm 32TFLOPS/128TOPS ML System-on-Chip for Latency Critical Applications

Chang-Hyo Yu,Hyo-Eun Kim,Sungho Shin,Kyeongryeol Bong, Hyunsuk Kim,Yoonho Boo,Jaewan Bae, Minjae Kwon, Karim Charfi,Jinseok Kim,Hongyun Kim, Myeongbo Shim, Changsoo Ha,Wongyu Shin,Jae-Sung Yoon, Miock Chi,Byungjae Lee,Sungpill Choi, Donghan Kim, Jeongseok Woo, Seokju Yoon, Hyunje Jo, Hyunho Kim,Hyungseok Heo, Young-Jae Jin, Jiun Yu,Jaehwan Lee,Hyunsung Kim, Minhoo Kang, Seokhyeon Choi, Seung-Goo Kim, Myunghoon Choi, Jungju Oh, Yunseong Kim, Haejoon Kim, Sangeun Je, Junhee Ham, Juyeong Yoon, Jaedon Lee, Seonhyeok Park, Youngseob Park, Jaebong Lee, Boeui Hong, Jaehun Ryu, Hyunseok Ko, Kwanghyun Chung, Jongho Choi, Sunwook Jung, Yashael Faith Arthanto, Jonghyeon Kim, Heejin Cho, Hyebin Jeong, Sungmin Choi, Sujin Han, Junkyu Park, Kwangbae Lee, Sung-Il Bae, Jaeho Bang, Kyeong-Jae Lee, Yeongsang Jang, Jungchul Park, Sanggyu Park, Jueon Park, Hyein Shin,Sunghyun Park,Jinwook Oh

2024 IEEE International Solid-State Circuits Conference (ISSCC)(2024)

Cited 0|Views9
No score
Abstract
The growing computational demands of AI inference have led to widespread use of hardware accelerators for different platforms, spanning from edge to the datacenter/cloud. Certain AI application areas, such as in high-frequency trading (HFT) [1–2], have a hard inference latency deadline for successful execution. We present our new AI accelerator which achieves high inference capability with outstanding single-stream responsiveness for demanding service-layer objective (SLO)-based AI services and pipelined inference applications, including large language models (LLM). Owing to low thermal design power (TDP), the scale-out solution can support multi-stream applications, as well as total cost of ownership (TCO)-centric systems effectively.
More
Translated text
Key words
Synchronization,Data Center,Data Streams,High Bandwidth,Edge Computing,Task Scheduling,Hardware Accelerators,External Memory,Data Cache,Memory Bandwidth,Public Cloud,Computational Graph,Neural Engineering,Off-chip Memory,Neural Clusters,Local Bus,Separate Core
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined