Chunk2vec: A novel resemblance detection scheme based on Sentence-BERT for post-deduplication delta compression in network transmission

Chunzhi Wang, Keguan Wang, Min Li,Feifei Wei,Neal Xiong

IET COMMUNICATIONS(2024)

引用 0|浏览0
暂无评分
摘要
Delta compression, as a complementary technique for data deduplication, has gained widespread attention in network storage systems. It can eliminate redundant data between non-duplicate but similar chunks that cannot be identified by data deduplication. The network transmission overhead between servers and clients can be greatly reduced by using data deduplication and delta compression techniques. Resemblance detection is a technique that identifies similar chunks for post-deduplication delta compression in network storage systems. The existing resemblance detection approaches fail to detect similar chunks with arbitrary similarity by setting a similarity threshold, which can be suboptimal. In this paper, the authors propose Chunk2vec, a resemblance detection scheme for delta compression that utilizes deep learning techniques and Approximate Nearest Neighbour Search technique to detect similar chunks with any given similarity range. Chunk2vec uses a deep neural network, Sentence-BERT, to extract an approximate feature vector for each chunk while preserving its similarity with other chunks. The experimental results on five real-world datasets indicate that Chunk2vec improves the accuracy of resemblance detection for delta compression and achieves higher compression ratio than the state-of-the-art resemblance detection technique. The existing resemblance detection approaches fail to identify similar chunks with arbitrary similarity by setting a similarity threshold. In this work, the authors propose a novel resemblance detection scheme called Chunk2vec, which uses a deep neural network, Sentence-BERT, to extract an approximate feature vector for each chunk while preserving its similarity with other chunks and applies the Approximate Nearest Neighbour Search technique to find the chunk's fingerprint feature vector with any given similarity range. This novel approach can significantly improve the accuracy of resemblance detection for post-deduplication delta compression and greatly reduces the network transmission overhead between servers and clients.image
更多
查看译文
关键词
data deduplication,deep learning,delta compression,Natural Language Processing,network transmission,resemblance detection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要