Towards Fast Crash-Consistent Cluster Checkpointing

2022 IEEE High Performance Extreme Computing Conference (HPEC)(2022)

引用 1|浏览10
暂无评分
摘要
Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.
更多
查看译文
关键词
towards fast crash-consistent cluster checkpointing,Machine Learning models,high-compute hardware,long training times,unexpected system crashes,unexpected system downtime,popular strategy,persistent storage,checkpoint,checkpointing schedule,leverage Python Memory Manager,PyMM,persistent memory,checkpointing operation,crash consistency,checkpointing models,checkpointing runtime,Gaussian Mixture Model algorithms,checkpointing speedup,current state-of-the-art checkpointing approaches
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要