High-Throughput Edge Inference for BERT Models via Neural Architecture Search and Pipeline

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023(2023)

引用 0|浏览10
暂无评分
摘要
There has been growing interest in improving the BERT inference throughput on resource-constrained edge devices for a satisfactory user experience. One methodology is to employ heterogeneous computing, which utilizes multiple processing elements to accelerate inference. Another methodology is to deploy Neural Architecture Search (NAS) to find optimal solutions in accuracy-throughput design space. In this paper, for the first time, we incorporate NAS with pipelining for BERT models. We show that performing NAS with pipelining achieves on average 53% higher throughput, compared to NAS with a homogeneous system. Additionally, we propose a NAS algorithm that incorporates hardware performance feedback to accelerate the NAS process. Our proposed NAS algorithm speeds up the search process by ~4x, and 5.5x on the design space of the BERT and CNNs, respectively. Also, by exploring the accuracy-throughput design space of BERT models, we demonstrate that performing pipelining then NAS (Pipeline-then-NAS) can lead to solutions with up to 9x higher inference throughput, compared to running homogeneous inference on the BERT-base model, with only a 1.3% decrease in accuracy.
更多
查看译文
关键词
Edge inference, Throughput, Pipeline, ARM big.LITTLE, NAS
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要