谷歌浏览器插件
订阅小程序
在清言上使用

ForestZip: An Effective Parallel Parser for Log Compression.

Yuezhou Zhou,Yuxin Su

Guangdong - Hong Kong - Macao Greater Bay Area Artificial Intelligence and Big Data Forum(2023)

引用 0|浏览0
暂无评分
摘要
Nowadays, cloud services generate a significant amount of log streams. Storing these log streams consumes a large amount of disk space and leads to high costs. Traditional compression tools and algorithms work well for small-scale text processing but are not applicable to large-scale log data generated by production systems. Existing log-oriented compression algorithms achieve data compression by extracting invariant log structures, relying on log templates obtained from log parsing. However, existing log parsing methods are not sufficiently adaptive and versatile to ensure high accuracy on all types of datasets. Manual design of regular expressions or fine-tuning of hyperparameters is required to achieve optimal performance. We propose a log parsing and compression method that is applicable to versatile log streams, where each log entry can be independently extracted and subsequently compressed without domain knowledge or parameter tuning. Specifically, we construct a Prefix-Forest to represent the structure of log messages and minimize the impact of noise in log files. Prefix-Forest divides the logs into multiple partitions, parses each partition independently, and generates a prefix tree for each partition. The templates can be used in the ForestZip. ForestZip separates logs into templates and parameters a prefix forest based on template matching to achieve more efficient compression. We have implemented Prefix-Forest and ForestZip on both representative and widely used log datasets as well as log datasets that have not been explored in other papers. ForestZip's compression ratio is 1.23 to 2.14 times higher than Logzip's compression ratio, and it is 1.89 to 8.58 times higher than gzip's compression ratio. The compression speed is 1.51 to 8.4 times faster than Logzip's compression speed. Furthermore, both Prefix-Forest and ForestZip are designed for high parallelization and only incur negligible overhead.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要