Distributed Incremental Machine Learning for Big Time Series Data.

Dhaval Salwala,Seshu Tirupathi,Brian Quanz,Wesley M. Gifford,Stuart Siegel,Vijay Ekambaram,Arindam Jati

Big Data（2022）

引用 1|浏览9

暂无评分

摘要

Today’s highly instrumented systems generate large amounts of time series data from many different domains. In order to create meaningful insights from these data, techniques are needed to handle the collection, processing, and analysis at scale. The high frequency and volume of data that is generated introduces several challenges including data transformation, managing concept drift, the operational cost of model re-training and tracking, and scaling hyperparameter optimization.Incremental machine learning can provide a viable solution to handle these kinds of data. Further, distributed machine learning can be an efficient technique to improve performance, increase accuracy, and scale to larger input sizes.In this paper, we introduce a framework that combines the computational capabilities of Apache Spark and the workflow parallelization of Ray for distributed incremental learning. We conduct an empirical analysis of our framework for time series forecasting using the Walmart M5 dataset. The system can perform a parameter search on streaming data with concept drift producing a robust pipeline that fits high-volume data effectively. The results are encouraging and substantiate system proficiency over traditional big data analysis approaches that exclusively use either offline or online training.

查看译文

关键词

Big Data,incremental learning,distributed learning,hyperparameter optimization

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要