SeqWho: Reliable, rapid determination of sequence file identity using k-mer frequencies

biorxiv(2021)

引用 0|浏览1
暂无评分
摘要
With the vast improvements in sequencing technologies and increased number of protocols, sequencing is finding more applications to answer complex biological problems. Thus, the amount of publicly available sequencing data has tremendously increased in repositories such as SRA, EGA, and ENCODE. With any large online database, there is a critical need to accurately document study metadata, such as the source protocol and organism. In some cases, this metadata may not be systematically verified by the hosting sites and may result in a negative influence on future studies. Here we present SeqWho, a program designed to heuristically assess the quality of sequencing files and reliably classify the organism and protocol type. This is done in an alignment-free algorithm that leverages a Random Forest classifier to learn from native biases in k -mer frequencies and repeat sequence identities between different sequencing technologies and species. Here, we show that our method can accurately and rapidly distinguish between human and mouse, nine different sequencing technologies, and both together, 98.32%, 97.86%, and 96.38% of the time in high confidence calls respectively. This demonstrates that SeqWho is a powerful method for reliably checking the identity of the sequencing files used in any pipeline and illustrates the program’s ability to leverage k -mer biases. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
sequence file identity,k-mer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要