Dgcf: A Distributed Greedy Clustering Framework For Large-Scale Genomic Sequences

2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM)(2019)

引用 2|浏览2
暂无评分
摘要
Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length of a single read sequence. However, existing tools lack scalability for handling large-scale datasets as well as long sequences. A feasible solution to this problem is to use parallel and distributed systems. The efficient deployment of such systems, however, requires high parallelism in both software implementations as well as algorithmic optimizations. In this paper, we propose DGCF, a Distributed Greedy Clustering Framework which is capable to handle large-scale datasets and long sequences. Our framework adopts a greedy clustering strategy which overlaps communication with computation among many distributed computing nodes. We also design and implement a sparse suffix array (SSA)-based alignment algorithm that can support long sequences. Experiments show that our framework achieves near linear speedups on a distributed memory cluster.
更多
查看译文
关键词
greedy clustering, sparse suffix array, sequence analysis, parallel computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要