UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

IEEE Transactions on Parallel and Distributed Systems(2023)

引用 0|浏览5
暂无评分
摘要
Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods can overcome this shortcoming. However, through a series of tests, observations, and analyses, we realize that developing an effective asynchronous multi-CPU/GPU MF framework faces several major design challenges: the underutilized CPUs, high communication overhead, and the asynchronous data safety issue. This article presents a unified multi-CPU/GPU asynchronous computing framework for SGD-based matrix factorization, named UMA-MF . UMA-MF treats CPUs and GPUs in the system as distributed workers that train matrix datasets in parallel and update feature parameters asynchronously. It provides a cache-friendly CPU external working mode, which can improve the CPU's cache hit rate, thereby promoting the efficient use of CPUs. It offers an algorithm to find the shortest communication ring topology of heterogeneous CPU/GPU workers and builds computing-communication pipelines to minimize the communication overhead. It implements a wait-free structure and load-balanced data distribution to achieve asynchronous data safety. UMA-MF can effectively accelerate SGD-based MF on multi-CPU/GPU systems in an asynchronous way. On a physical platform with configurations ranging from single processor system to 2CPUs--4CPUs system, for five common datasets Netfix, R1, R2, Goodreads, and de-dense, UMA-MF achieves up to 3.56x speedup compared with HCC-MF, which is the state-of-the-art multi-CPU/GPU synchronous computing framework for SGD-based MF. UMA-MF also shows good scalability. When the system is scaled to 2CPUs-4GPUs, the training time speedup of UMA-MF can reach 70%--97% of the ideal speedup.
更多
查看译文
关键词
Matrix factorization,multi-CPU/GPU,parallel systems,stochastic gradient descent
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要