Programming Languages in Data Science - a Comparison from a Database Angle.

IEEE BigData(2021)

引用 1|浏览7
暂无评分
摘要
In a typical Data Science project, the analyst uses many programming languages to explore and analyze big data coming from diverse data sources. A major challenge is managing and pre-processing so much data, with potentially inconsistent content, significant redundancy, in diverse formats, with varying data quality. Database systems research has tackled such problems for a long time, but mostly on relational databases. With such motivation in mind, this paper compares strengths and weaknesses of popular languages used nowadays from a database pespective: Python, R and SQL. We discuss the entire analytic pipeline, going from data integration, cleaning and pre-processing to model application and tuning. From a database systems perspective, we present a comprehensive survey of storage mechanisms, data processing algorithms, external algorithms, run-time memory management, consistency, optimizations and parallel processing. From a programming languages angle, we consider elegance, expressiveness, abstraction, composability, interactive behavior and automatic code optimization. We present a short experimental evaluation comparing the performance of the three languages on typical data exploration and pre-processing tasks. Our conclusion: there is no winner.
更多
查看译文
关键词
programming languages,data science,database angle
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要