What's in a (Data) Type? Meaningful Type Safety for Data Science.

Research Challenges in Information Science (RCIS)(2022)

引用 0|浏览10
暂无评分
摘要
Data science incorporates a variety of processes, concepts, techniques and domains, to transform data that is representative of real-world phenomena into meaningful insights and to inform decision-making. Data science relies on simple datatypes like strings and integers to represent complex real-world phenomena like time and geospatial regions. This reduction of semantically rich types to simplistic ones creates issues by ignoring common and significant relationships in data science including time, mereology, and provenance. Current solutions to this problem including documentation standards, provenance tracking, and knowledge model integration are opaque, lack standardization, and require manual intervention to validate. We introduce the meaningful type safety framework (MeTS) to ensure meaningful and correct data science through semantically-rich datatypes based on dependent types. Our solution encodes the assumptions and rules of common real-world concepts, such as time, geospatial regions, and populations, and automatically detects violations of these rules and assumptions. Additionally, our type system is provenance-integrated, meaning the type environment is updated with every data operation. To illustrate the effectiveness of our system, we present a case study based on real-world datasets from Statistics Canada (StatCAN). We also include a proof-of-concept implementation of our system in the Idris programming language.
更多
查看译文
关键词
Data science,Dependent types,Type safety,Data provenance,Meaningful types
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要