Data stream algorithms

Encyclopedia of Database Systems(2009)

引用 8|浏览5
暂无评分
摘要
In recent years, there has been a dramatic growth of interest in developing algorithms for massive data sets. In particular, the data stream model has received a lot of attention. Many applications that deal with massive data, such as Internet traffic analysis and database mining, motivate this data stream model. In the data stream model, the data is treated as sequences, and the only feasible way of accessing data is through sequential access. This is a more appropriate computational model than the classic random access machine (RAM) model when dealing with massive data sets for which random access is expensive or even impossible. Due to these constraints, many data stream algorithms are randomized and/or compute only an approximation of the exact answer. Designing such data stream algorithms often involves trade-offs between time, space and accuracy. In this thesis we study several problems in this computational model. The first result is on estimating the frequency of each element in a data stream using a small amount of memory. Previous research on this topic has been focused on the algorithms that compute probabilistic results. Our study shows that an element's frequency can be estimated with guaranteed accuracy and we give a near optimal trade-off between space and accuracy in Chapter 3. Then we study the problem of quickly answering range mode and range median queries, which are two of the most important statistics of a data set. We propose the first non-trivial solutions to the approximate versions of the problem, first on one dimensional arrays (Chapter 4), then on matrices in higher (d≥2) dimensional space (Chapter 5). Finally, we study one of the earliest data stream problems, namely, sorting large data sets stored on tapes with limited internal memory. There is a gap of a factor of 4 in previous results on the lower bounds and upper bounds. We close this gap in Chapter 6. We also derive the first probabilistic lower bound for the problem.
更多
查看译文
关键词
data stream model,data stream,massive data set,accessing data,data set,data stream algorithm,earliest data stream problem,large data,massive data,appropriate computational model
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要