Enhancing public research on citizen data: An empirical investigation of data synthesis using Statistics New Zealand's Integrated Data Infrastructure

INFORMATION PROCESSING & MANAGEMENT(2024)

引用 0|浏览14
暂无评分
摘要
The Integrated Data Infrastructure (IDI) in New Zealand is a critical asset that integrates citizen data from various public and private organizations for population-level analyses. However, access restrictions within the IDI environment present challenges for fully utilizing its potential. This study examines synthetic data as a potential solution, offering a comprehensive framework for generating customizable and easily implementable synthetic data. The evaluation of multiple data synthesis algorithms considers statistical similarity, machine learning utility, and privacy concerns. The findings reveal that distance-based algorithms, like SMOTE, strike a balance between accuracy and computational cost, making them suitable for IDI. The study also identifies the need for a clear release guide for micro-level synthetic data and proposes exploring a fully automatic data evaluation pipeline in future research. Additionally, the study highlights opportunities enabled by synthetic data, such as familiarization with administrative datasets, reproducibility of studies, pilot analyses, and enhanced cross-domain collaboration. Overall, the proposed framework and findings offer valuable insights and guidance for synthetic data projects within the IDI, advancing synthetic data privacy research and facilitating reproducibility, collaboration, and data sharing in the IDI ecosystem.
更多
查看译文
关键词
Synthetic data,Privacy,IDI,Data science,Machine learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要