More is Less - Byte-quantized models are faster than bit-quantized models on the edge.

Big Data(2022)

引用 0|浏览16
暂无评分
摘要
Model quantization has been a popular approach to trade accuracy for speed during model serving, especially since new models are getting bigger and bigger. Traditionally, low-precision quantized models have better inference speed than their high-precision counterparts. However, the story may change with the advent of new machine learning instructions in modern processors. In this paper, we make a case for that using ARM processors. Our quantitative analysis shows that low-precision quantized models can be inferior in both accuracy and inference speed on modern processors. In response, we present, MiL, a new quantized neural network package. Operators in MiL are optimized for inferences based on the newly available machine learning instructions on the target platform. Experiments show that serving neural networks using PyTorch with MiL outperforms all state-of-the-art, including Riptide, TFLite with Ruy, and PyTorch with QNNPACK.
更多
查看译文
关键词
quantization,neural network,edge computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要