Design and Implementation of an Analysis Pipeline for Heterogeneous Data
arxiv(2024)
摘要
Managing and preparing complex data for deep learning, a prevalent approach
in large-scale data science can be challenging. Data transfer for model
training also presents difficulties, impacting scientific fields like genomics,
climate modeling, and astronomy. A large-scale solution like Google Pathways
with a distributed execution environment for deep learning models exists but is
proprietary. Integrating existing open-source, scalable runtime tools and data
frameworks on high-performance computing (HPC) platforms is crucial to address
these challenges. Our objective is to establish a smooth and unified method of
combining data engineering and deep learning frameworks with diverse execution
capabilities that can be deployed on various high-performance computing
platforms, including cloud and supercomputers. We aim to support heterogeneous
systems with accelerators, where Cylon and other data engineering and deep
learning frameworks can utilize heterogeneous execution. To achieve this, we
propose Radical-Cylon, a heterogeneous runtime system with a parallel and
distributed data framework to execute Cylon as a task of Radical Pilot. We
thoroughly explain Radical-Cylon's design and development and the execution
process of Cylon tasks using Radical Pilot. This approach enables the use of
heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves
better performance than Bare-Metal Cylon with minimal and constant overhead.
Radical-Cylon achieves (4 15)
performing similar join and sort operations with 35 million and 3.5 billion
rows with the same resources. The approach aims to excel in both scientific and
engineering research HPC systems while demonstrating robust performance on
cloud infrastructures. This dual capability fosters collaboration and
innovation within the open-source scientific research community.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要