Towards an Accurate Latency Model for Convolutional Neural Network Layers on GPUs.

MILCOM(2021)

引用 6|浏览51
暂无评分
摘要
Convolutional Neural Networks (CNN) have shown great success in many sensing and recognition applications. However, the excessive resource demand remains a major barrier against their deployment on low-end devices. Optimizations, such as model compression, are thus a need for practical deployment. To fully exploit existing system resources, platformaware optimizations emerged in recent years, where an executiontime model becomes a necessity. However, non-monotonicity over the network configuration space makes execution time modeling a challenging task. Data-driven approaches have the advantage of being portable over different platforms by treating the hardware and software stack as a black box but at the cost of extremely long profiling time. On the other hand, analytical models can be found in the architecture and system literature that do not need heavy profiling but require laborious analysis by domain experts. In this paper, we focus on building a general latency model for convolutional layers that account for the majority of the total execution time in CNN models. We identify two major non-linear modes in the relationship between latency and convolution parameters, and analyze the mechanism behind them. The resulting model has better interpretability and can reduce profiling workload. The evaluation results show that our model outperforms baselines on different platforms and CNN models.
更多
查看译文
关键词
CNN models,convolutional neural network layers,recognition applications,resource demand,low-end devices,model compression,system resources,platform-aware optimizations,execution-time model,network configuration space,data-driven approaches,software stack,extremely long profiling time,analytical models,system literature,general latency model,total execution time,nonlinear modes,latency convolution parameters,GPU
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要