Chrome Extension
WeChat Mini Program
Use on ChatGLM

Data Lake Organization

Fatemeh Nargesian, Ken Pu, Bahar Ghadiri-Bashardoost,Erkang Zhu,Renee J. Miller

IEEE Transactions on Knowledge and Data Engineering(2023)

Cited 11|Views51
No score
Abstract
We consider the problem of building an organizational directory of data lakes to support effective user navigation. The organization directory is defined as an acyclic graph that contains nodes representing sets of attributes and edges indicating subset relationships between nodes. A probabilistic model is constructed to model user navigational behaviour. The model also predicts the likelihood of users finding relevant tables in a data lake given an organization. We formulate the data lake organization problem as an optimization over the organizational structure in order to maximize the expected likelihood of discovering tables by navigating. An approximation algorithm is proposed with an analysis of its error bound. The effectiveness and efficiency of the algorithm are evaluated on both synthetic and real data lakes. Our experiments show that our algorithm constructs organizations that outperform many existing organizations including an existing hand-curated taxonomy, a linkage graph, and a common baseline organization. We have also conducted a formal user study which shows that navigation can help users discover relevant tables that are not easily accessible by keyword search queries. This suggests that keyword search and navigation using an organization are complementary modalities for data discovery in data lakes.
More
Translated text
Key words
Data lake,dataset discovery,taxonomy,structure learning
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined