A Case Study of Data Management Challenges Presented in Large-Scale Machine Learning Workflows

Claire Songhyun Lee, V Hewes,Giuseppe Cerati,Jim Kowalkowski,Adam Aurisano,Ankit Agrawal,Alok Choudhary,Wei-Keng Liao

2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)（2023）

引用 0|浏览23

暂无评分

摘要

Running scientific workflow applications on high-performance computing systems provides promising results in terms of accuracy and scalability. An example is the particle track reconstruction research in high-energy physics that consists of multiple machine-learning tasks. However, as the modern HPC system scales up, researchers spend more effort on coordinating the individual workflow tasks due to their increasing demands on computational power, large memory footprint, and data movement among various storage devices. These issues are further exacerbated when intermediate result data must be shared among different tasks and each is optimized to fulfill its own design goals, such as the shortest time or minimal memory footprint. In this paper, we investigate the data management challenges presented in scientific workflows. We observe that individual tasks, such as data generation, data curation, model training, and inference, often use data layouts only best for one's I/O performance but orthogonal to its successive tasks. We propose various solutions by employing alternative data structures and layouts in consideration of two tasks running consecutively in the workflow. Our experimental results show up to a 16.46x and 3.42x speedup for initialization time and I/O time respectively, compared to previous approaches.

查看译文

关键词

Data Management,Parallel I/O,HDF5,High Performance Computing,Machine Learning,Scientific Workflows

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要