CoDA: A Co-Design Framework for Versatile and Efficient Attention Accelerators

IEEE Transactions on Computers(2024)

引用 0|浏览7
暂无评分
摘要
As a primary component of Transformers, attention mechanism suffers from quadratic computational complexity. To achieve efficient implementations, its hardware accelerator designs have aroused great research interest. However, most existing accelerators only support a single type of application and a single type of attention, making it difficult to meet the demands of diverse application scenarios. Additionally, they mainly focus on the dynamic pruning of attention matrices, which requires the deployment of pre-processing units, thereby reducing overall hardware efficiency. This paper presents CoDA which is an algorithm, dataflow and architecture co-design framework for versatile and efficient attention accelerators. The designed accelerator supports both NLP and CV applications, and can be configured into the mode supporting low-rank attention or low-rank plus sparse attention. We apply algorithmic transformations to low-rank attention to significantly reduce computational complexity. To prevent an increase in storage overhead resulting from the proposed algorithmic transformations, we carefully design the dataflows and adopt a block-wise fashion. Down-scaling softmax is further supported by architecture and dataflow co-design. Moreover, we propose a softmax sharing strategy to reduce the area cost. Our experiment results demonstrate that the proposed accelerator outperforms the state-of-the-art designs in terms of throughput, area efficiency and energy efficiency.
更多
查看译文
关键词
Transformers,attention mechanism,hardware accelerator,low-rank approximation,co-design framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要