Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies
arxiv(2024)
摘要
Neural networks have become a cornerstone of machine learning. As the trend
for these to get more and more complex continues, so does the underlying
hardware and software infrastructure for training and deployment. In this
survey we answer three research questions: "What types of model parallelism
exist?", "What are the challenges of model parallelism?", and "What is a modern
use-case of model parallelism?" We answer the first question by looking at how
neural networks can be parallelised and expressing these as operator graphs
while exploring the available dimensions. The dimensions along which neural
networks can be parallelised are intra-operator and inter-operator. We answer
the second question by collecting and listing both implementation challenges
for the types of parallelism, as well as the problem of optimally partitioning
the operator graph. We answer the last question by collecting and listing how
parallelism is applied in modern multi-billion parameter transformer networks,
to the extend that this is possible with the limited information shared about
these networks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要