VirtualFlow: Decoupling Deep Learning Model Execution from Underlying Hardware

Andrew Or,Haoyu Zhang,Michael J. Freedman

arxiv（2020）

引用 0|浏览59

暂无评分

摘要

State-of-the-art deep learning systems tightly couple model execution with the underlying hardware. This coupling poses important challenges in a world where the scale of deep learning workloads is growing rapidly: workloads with high resource requirements are inaccessible to most users, experimentation on smaller test beds is impossible, and results are difficult to reproduce across different hardware. We propose VirtualFlow, a novel system approach leveraging virtual node processing to decouple model execution from the hardware. In each execution step, the batch is divided and processed with data parallelism on many virtual nodes instead of physical devices (GPUs, TPUs), and the gradients are aggregated and applied to the model after all virtual nodes finish processing their data. With multiple virtual nodes mapped to each device, the system allows users to run models at much larger batch sizes that would otherwise exceed the memory limits of the underlying physical resources. VirtualFlow significantly improves model training reproducibility across different hardware, and enables models to run on shared clusters with dynamically changing resources for better efficiency. Our implementation of VirtualFlow enables virtual node processing with elasticity for TensorFlow. Evaluation with representative deep learning models (ResNet, BERT, Transformer) demonstrates strong convergence guarantees on different hardware with out-of-the-box hyperparameters, and up to 48% lower job completion times with resource elasticity.

查看译文

关键词

deep learning model execution,underlying hardware,deep learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要