Pseudoreplication in genomic-scale data sets

MOLECULAR ECOLOGY RESOURCES(2022)

引用 20|浏览20
暂无评分
摘要
In genomic-scale data sets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df') compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here, we measured pseudoreplication (quantified by the ratio df'/df) for a common metric of genetic differentiation (F-ST) and a common measure of linkage disequilibrium between pairs of loci (r(2)). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df' and df'/df by measuring the rate of decline in the variance of mean F-ST and mean r(2) as more loci were used. For both indices, df' increases with N-e and genome size, as expected. However, even for large N-e and large genomes, df' for mean r(2) plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for F-ST, but df'/df <= 0.01 can occur in data sets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var (F-ST), producing very conservative confidence intervals. Predicting df' based on our modelling results as a function of N-e, L, S, and genome size provides a robust way to quantify precision associated with genomic-scale data sets.
更多
查看译文
关键词
degrees of freedom, F-ST, genome size, jackknife variance, linkage disequilibrium, N-e, simulations
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要