Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip
arxiv(2024)
摘要
Model size and inference speed at deployment time, are major challenges in
many deep learning applications. A promising strategy to overcome these
challenges is quantization. However, a straightforward uniform quantization to
very low precision can result in significant accuracy loss. Mixed-precision
quantization, based on the idea that certain parts of the network can
accommodate lower precision without compromising performance compared to other
parts, offers a potential solution. In this work, we present High Granularity
Quantization (HGQ), an innovative quantization-aware training method designed
to fine-tune the per-weight and per-activation precision in an automatic way
for ultra-low latency and low power neural networks which are to be deployed on
FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial
margin, achieving resource reduction by up to a factor of 20 and latency
improvement by a factor of 5 while preserving accuracy.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要