Time and Cost Prediction Models for Language Classification Over a Large Corpus on Spark.

SSCI(2020)

Cited 0|Views1
No score
Abstract
This paper investigates performance impact regarding variation of five factors (input data size, node number, cores, memory, and disks) on the Spark big data processing framework, when applying a distributed implementation of Naive Bayes for text classification of a large Corpus. Since performance depends on multiple factors and hardware is priced by time slice in cloud environments, knowing beforehand the effects of each factor on time and cost becomes particularly important. The idea is to explain the functional relationship between factors and performance and to develop linear predictor models for time and cost. The approach is based on the solid statistical principles of the Design of Experiments (DoE), particularly the randomized two-level fractional factorial design with replications. The research involved 48 real clusters with different hardware arrangements, and linear models with appropriate metrics were employed for screening, ranking, and measuring each factor's impact. Our findings with DoE included prediction models and showed the small influence of cores, the neutrality of memory and disks in the total execution time, and the non-significant impact of data scale on cost. Conversely, the scale was the most relevant factor for degrading the execution time, while the number of nodes was more important to improve it. Finally, costs were balanced impacted by nodes and cores. The experiments consistently evidenced the usefulness of employing DoE to analyze factor influence on cluster performance.
More
Translated text
Key words
Big Data, Design of Experiments, Distributed Machine Learning, Natural Language Processing, Spark
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined