A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)(2023)

引用 0|浏览23
暂无评分
摘要
Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches.
更多
查看译文
关键词
Data Management,Parallel I/O,HDF5,High Performance Computing,Machine Learning,Scientific Workflows
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要