Interpretable and predictive models to harness the life science data revolution

Joshua P. Jahner,C. Alex Buerkle, Dustin G. Gannon,Eliza M. Grames,S. Eryn McFarlane,Andrew Siefert,Katherine L. Bell, Victoria L. DeLeo,Matthew L. Forister,Joshua G. Harrison,Daniel C. Laughlin, Amy C. Patterson, Breanna F. Powers, Chhaya M. Werner, Isabella A. Oleksy

biorxiv(2024)

引用 0|浏览4
暂无评分
摘要
The proliferation of high-dimensional biological data is kindling hope that life scientists will be able to fit statistical and machine learning models that are highly predictive and interpretable. However, large biological data are commonly burdened with an inherent trade-off: in-sample prediction will improve as additional predictors are included in the model, but this may come at the cost of poor predictive accuracy and limited generalizability for future or unsampled observations (out-of-sample prediction). To confront this problem of overfitting, sparse models narrow in on the causal predictors by correctly placing low weight on unimportant variables. We competed nine methods to quantify their performance in variable selection and prediction using simulated data with different sample sizes, numbers of predictors, and strengths of effects. Overfitting was typical for many methods and simulation scenarios. Despite this, in-sample and out-of-sample prediction converged on the true predictive target for simulations with more observations, larger causal effects, and fewer predictors. Accurate variable selection to support process-based understanding will be unattainable for many realistic sampling schemes. We use our analyses to characterize data attributes in which statistical learning is possible, and illustrate how some sparse methods can achieve predictive accuracy while mitigating and learning the extent of overfitting. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要