High-Level Data Abstraction and Elastic Data Caching for Data-Intensive AI Applications on Cloud-Native Platforms

IEEE Transactions on Parallel and Distributed Systems(2023)

Cited 0|Views14
No score
Abstract
Nowdays, it is prevalent to train deep learning models in cloud-native platforms that actively leverage containerization and orchestration technologies for high elasticity, low and flexible operation cost, and many other benefits. However, it also faces new challenges and our work is focusing on those related to I/O throughput for training, including complex data access, lack of matching dynamic I/O requirement, and inefficient I/O resource scheduling across different jobs. We propose Fluid, a cloud-native platform that provides DL training jobs with high-level data abstraction called Fluid Dataset to access training data from heterogeneous sources with elastic data acceleration. In addition, it comes with an on-the-fly cache system autoscaler that can match the online training speed and increase the number of cache replicas adaptively to alleviate I/O bottlenecks. To improve the overall performance of multiple DL jobs, Fluid co-orchestrate the data cache and DL jobs by arranging job scheduling in an appropriate order and can also schedule data cache and DL jobs on the same node to realize cache affinity. Experimental results show significant performance improvement of each individual DL job which uses dynamic computing resources with Fluid. For scheduling multiple DL jobs with same datasets, Fluid achieves around 2x performance speedup when integrated with existing widely-used and cutting-edge scheduling solutions through the appropriate job scheduling order. Besides, the cache affinity scheduling policy also improves job execution performance significantly. Fluid is now an open source project hosted by Cloud Native Computing Foundation (CNCF) with many production adopters.
More
Translated text
Key words
Training,Processor scheduling,Fluids,Data models,Graphics processing units,Containers,Job shop scheduling,Cloud native,dataset abstraction,elastic data cache,job scheduling
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined