Signature Informed Sampling for Transcriptomic Data

bioRxiv (Cold Spring Harbor Laboratory)(2023)

引用 0|浏览1
暂无评分
摘要
Working with transcriptomic data is challenging in deep learning applications due to its high dimensionality and low patient numbers. Deep learning models tend to overfit this data and do not generalize well on out-of-distribution samples and new cohorts. Data augmentation strategies help alleviate this problem by introducing synthetic data points and acting as regularisers. However, existing approaches are either computationally intensive or require parametric estimates. We introduce a new solution to an old problem - a simple, non-parametric, and novel data augmentation approach inspired by the phenomenon of chromosomal crossover. Based on the assumption that there exist non-overlapping gene signatures describing each phenotype of interest, we demonstrate how new synthetic data points can be generated by sampling gene signatures from different patients under certain phenotypic constraints. As a case study, we apply our method to transcriptomic data of colorectal cancer. Through discriminative and generative experiments on two different datasets, we show that our method improves patient stratification by generating samples that mirror biological variability as well as the models' robustness to overfitting and distribution shift. Our approach requires little to no computation, and outperforms, or at the very least matches, the performance of established augmentation methods.
更多
查看译文
关键词
signature,data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要