SAC: A System for Big Data Lineage Tracking

2019 IEEE 35th International Conference on Data Engineering (ICDE)(2019)

引用 20|浏览106
暂无评分
摘要
In the era of big data, a data processing flow contains various types of tasks. It is nontrivial to discover the data flow/movement from its source to destination, such that monitoring different transformations and hops on its way in an enterprise environment. Therefore, data lineage or provenance is useful to learn how the data gets transformed along the way, how the representation and parameters change, and how the data splits or converges after each hop. However, existing systems offer limited support for such use cases in a distributed computing setup. To address this issue, we build Spark-Atlas-Connector (short as SAC), a new system to track data lineage in a distributed computation platform, e.g., Spark. SAC tracks different processes involved in the data flow and their dependencies, supporting different data storage (e.g., HBase, HDFS, and Hive) and data processing paradigms (e.g., SQL, ETL, machine learning, and streaming). SAC provides a visual representation of data lineage to track data from its origin to downstreams, and is deployed in a distributed production environment for demonstrating its efficiency and scalability.
更多
查看译文
关键词
Sparks,Data processing,Metadata,Memory,Pipelines,Machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要