APQ: Automated DNN Pruning and Quantization for ReRAM-Based Accelerators

IEEE Transactions on Parallel and Distributed Systems(2023)

引用 0|浏览12
暂无评分
摘要
Emerging ReRAM-based accelerators support in-memory computation to accelerate deep neural network (DNN) inference. Weight matrix pruning is a widely used technique to reduce the size of DNN models, thereby reducing the resource and energy consumption of ReRAM-based accelerators. However, existing pruning works for ReRAM-based accelerators have three major issues. First, they use heuristics or rules from domain experts to prune the weights, leading to sub-optimal pruning policies. Second, they use row or column-level coarse-granularity methods to prune weights, resulting in poor compression rates with model accuracy constraints. Third, they only apply the weight pruning technique individually, losing the compression opportunity of both pruning and quantization. In this article, we propose an Automated DNN Pruning and Quantization framework, named APQ , for ReRAM-based accelerators. First, APQ adopts reinforcement learning (RL) to automatically determine the pruning policy for DNN layers for a global optimum. Second, it prunes and maps weight matrices to a ReRAM-based accelerator in a finer granularity of column-vector, which improves the compression rates with the accuracy constraints. To address the dislocation problem, it uses a new data path in ReRAM-based accelerators to correctly index and feed input to matrix-vector computation. Third, to further reduce resource consumption, APQ also leverages reinforcement learning to automatically determine the quantization bitwidth of each layer of the pruned DNN model. Experimental results show that, APQ achieves up to 4.52X compression rate, 4.11X area efficiency, and 4.51X energy efficiency with similar or even higher model accuracy, compared to the state-of-the-art work.
更多
查看译文
关键词
automated dnn pruning,quantization,reram-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要