DistForest: A Parallel Random Forest Training Framework Based on Supercomputer

2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)(2018)

引用 1|浏览13
暂无评分
摘要
The random forest algorithm is an ensemble classifier algorithm based on the decision tree model. It has a wide range of applications in machine learning, data mining and other fields. With the emergence of big data age, the training process of random forest becomes very lengthy. Most studies speed up the training of random forests through small clusters or high performance device, however, few pay attention to high performance supercomputers. In this paper, we propose a parallel random forest training framework on supercomputers called DistForest which can utilize multiple nodes to train random forest with large data sets concurrently. Firstly, DistForest applies master-slave architecture in which system will select some nodes as the primary node to distribute the large tasks to other slave nodes. Secondly, DistForest exploits a multilevel parallel strategy which pushes small tasks to a task queue rather than continues to distribute them among other slaves. Thirdly, DistForest can use the heterogeneous architecture of supercomputers to accelerated training process. Finally, DistForest can also balance computing tasks between devices with different computing ability. Our performance results on Tianhe-2 show our implementation can acquire very high performance improvement.
更多
查看译文
关键词
Parallel Random Forest,Supercomputers,Speed up,Framework
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要