Constructing Antidictionaries In Output-Sensitive Space
2019 DATA COMPRESSION CONFERENCE (DCC)(2019)
摘要
A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y(1), y(2),..., y(k) over an alphabet Sigma, we are asked to compute the set M-y1#...#yk(l) of minimal absent words of length at most l of word y = y(1)#y(2)#...#y(k), # is not an element of Sigma. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Omega(n) space for n = |y| using any of the plenty available O(n)-time algorithms. This is because an Omega(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M-y1#...#yN(l)|| = o(n), for all N is an element of [1, k]. For instance, in the human genome, n approximate to 3 x 10(9) but || M-y1#...#yk(12) || approximate to 10(6). We consider a constant-sized alphabet for stating our results. We show that all M-y1(l),..., M-y1#...#yk(l) can be computed in O(kn + Sigma(k)(N=1) ||M-y1#...#yN(l)||) total time using O(MAXIN + MAXOUT) space, where MAXIN is the length of the longest word in {y(1),..., y(k)} and MAXOUT = max{|| M-y1#...#yN(l) || : N is an element of [1, k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
更多查看译文
关键词
absent words,antidictionaries,string algorithms,output sensitive algorithms,data compression
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要