Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
ICLR 2024(2023)
摘要
Recent advances in tabular data generation have greatly enhanced synthetic
data quality. However, extending diffusion models to tabular data is
challenging due to the intricately varied distributions and a blend of data
types of tabular data. This paper introduces Tabsyn, a methodology that
synthesizes tabular data by leveraging a diffusion model within a variational
autoencoder (VAE) crafted latent space. The key advantages of the proposed
Tabsyn include (1) Generality: the ability to handle a broad spectrum of data
types by converting them into a single unified space and explicitly capture
inter-column relations; (2) Quality: optimizing the distribution of latent
embeddings to enhance the subsequent training of diffusion models, which helps
generate high-quality synthetic data, (3) Speed: much fewer number of reverse
steps and faster synthesis speed than existing diffusion-based methods.
Extensive experiments on six datasets with five metrics demonstrate that Tabsyn
outperforms existing methods. Specifically, it reduces the error rates by 86
and 67
estimations compared with the most competitive baselines.
更多查看译文
关键词
Tabular data,tabular generation,diffusion models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要