climber++: Pivot-Based Approximate Similarity Search over Big Data Series
CoRR(2024)
摘要
The generation and collection of big data series are becoming an integral
part of many emerging applications in sciences, IoT, finance, and web
applications among several others. The terabyte-scale of data series has
motivated recent efforts to design fully distributed techniques for supporting
operations such as approximate kNN similarity search, which is a building block
operation in most analytics services on data series. Unfortunately, these
techniques are heavily geared towards achieving scalability at the cost of
sacrificing the results' accuracy. State-of-the-art systems report accuracy
below 10
applications. In this paper, we investigate the root problems in these existing
techniques that limit their ability to achieve better a trade-off between
scalability and accuracy. Then, we propose a framework, called CLIMBER, that
encompasses a novel feature extraction mechanism, indexing scheme, and query
processing algorithms for supporting approximate similarity search in big data
series. For CLIMBER, we propose a new loss-resistant dual representation
composed of rank-sensitive and ranking-insensitive signatures capturing data
series objects. Based on this representation, we devise a distributed two-level
index structure supported by an efficient data partitioning scheme. Our
similarity metrics tailored for this dual representation enables meaningful
comparison and distance evaluation between the rank-sensitive and
ranking-insensitive signatures. Finally, we propose two efficient query
processing algorithms, CLIMBER-kNN and CLIMBER-kNN-Adaptive, for answering
approximate kNN similarity queries. Our experimental study on real-world and
benchmark datasets demonstrates that CLIMBER, unlike existing techniques,
features results' accuracy above 80
terabytes of data.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要