Enabling useful provenance in scripting languages with a human-in-the-loop.

ACM Conference on Management of Data(2022)

引用 0|浏览32
暂无评分
摘要
Most data scientists must build substantial data pipelines using scripting languages like Python and R. These pipelines are hard to get correct due to the large volume of data they process (thus the long execution time), and the fact that they are tested mainly by inspection of output data quality. It is therefore crucial for developers to reason about data through each step in the pipeline, starting from the raw input; this information is akin to data provenance in a relational setting. Past efforts for capturing data provenance for scripting languages have required substantial manual modifications to the scripts, or else yield information that is too inflexible for many debugging tasks. We instead propose a "human-in-the-loop" provenance generation model with three key improvements: (1) allowing humans to express the desired provenance through a provenance schema , (2) enabling one-time execution capture of scripts to produce traces that are later combined with different provenance schemata to yield useful provenance for different tasks, (3) providing a modular rule-based recommendation component to help design provenance schemata through a user interaction interface. We describe the concepts, the user experience with our system, explain the system components, and present preliminary experiment results.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要