GraphAr: An Efficient Storage Scheme for Graph Data in Data Lakes
arxiv(2023)
摘要
Data lakes, increasingly adopted for their ability to store and analyze
diverse types of data, commonly use columnar storage formats like Parquet and
ORC for handling relational tables. However, these traditional setups fall
short when it comes to efficiently managing graph data, particularly those
conforming to the Labeled Property Graph (LPG) model. To address this gap, this
paper introduces GraphAr, a specialized storage scheme designed to enhance
existing data lakes for efficient graph data management. Leveraging the
strengths of Parquet, GraphAr captures LPG semantics precisely and facilitates
graph-specific operations such as neighbor retrieval and label filtering.
Through innovative data organization, encoding, and decoding techniques,
GraphAr dramatically improves performance. Our evaluations reveal that GraphAr
outperforms conventional Parquet and Acero-based methods, achieving an average
speedup of $3283\times$ for neighbor retrieval, $6.0\times$ for label
filtering, and $29.5\times$ for end-to-end workloads. These findings highlight
GraphAr's potential to extend the utility of data lakes by enabling efficient
graph data management.
更多查看译文
AI 理解论文
溯源树
样例
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要