Accelerating container-based deep learning hyperparameter optimization workloads.

ACM Conference on Management of Data(2022)

引用 2|浏览16
暂无评分
摘要
DocuSign is advancing at a great pace for artificial intelligence and embracing a continuous shift towards developing and deploying an increasing number of deep learning models. During the development stage, developers usually build a number of deep learning models and train them using a bunch of potential hyperparameter configurations to find the best-performed one, which is called hyperparameter optimization (HPO). Such HPO jobs can run for a long time due to ever-larger models and numerous hyperparameter configurations. Furthermore, the HPO jobs at DocuSign are processed in container-based environments so that the best-performed model can be deployed and maintained in production reliably and efficiently. The workload consists of the long-running and containerized HPO jobs that can saturate the current machine learning infrastructure in DocuSign rapidly, but the key resource (e.g., GPU memory or computing unit) are not always full utilized, for example, some hyperparameter configurations may only take a fraction of the GPU memory but will occupy the entire device due to containerization. Suffering from this issue, the users may have to either wait or manually coordinate with others for the resource to run the jobs, and such HPO workloads often take an unexpectedly long time to be completed. To address this problem, we propose Relish , a system designed specifically to accelerate HPO workloads by segmenting HPO jobs and efficiently sharing GPU resources in container-based environments so that multiple containerized segmented jobs can be executed in parallel. We conduct an HPO workload based on a three-month-long trace from a multi-tenant GPU cluster of a research and development team in DocuSign to evaluate Relish, the results demonstrate that Relish can significantly improve GPU utilization and accelerate the workload through efficient multiple jobs execution.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要