Constructing Antidictionaries In Output-Sensitive Space

2019 DATA COMPRESSION CONFERENCE (DCC)(2019)

引用 7|浏览16
暂无评分
摘要
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y(1), y(2),..., y(k) over an alphabet Sigma, we are asked to compute the set M-y1#...#yk(l) of minimal absent words of length at most l of word y = y(1)#y(2)#...#y(k), # is not an element of Sigma. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Omega(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Omega(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M-y1#...#yN(l)|| = o(n), for all N is an element of [1, k]. For instance, in the human genome, n approximate to 3 x 10(9) but || M-y1#...#yk(12) || approximate to 10(6). We consider a constant-sized alphabet for stating our results. We show that all M-y1(l),..., M-y1#...#yk(l) can be computed in O(kn + Sigma(k)(N=1) ||M-y1#...#yN(l)||) total time using O(MAXIN + MAXOUT) space, where MAXIN is the length of the longest word in {y(1),..., y(k)} and MAXOUT = max{|| M-y1#...#yN(l) || : N is an element of [1, k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
更多
查看译文
关键词
absent words,antidictionaries,string algorithms,output sensitive algorithms,data compression
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要