Performance Improvement of Hadoop ext4-based Disk I/O

2020 Eighth International Symposium on Computing and Networking (CANDAR)(2020)

引用 0|浏览6
暂无评分
摘要
Hadoop is one of the most popular big-data analytics platforms, often relying on hard disk drives for storage of big-data amounts that exceed the capacity of solid-state drives. Unlike other data-intensive applications, such as database management systems, big-data processing jobs frequently require extensive sequential I/O requests. Previously proposed methods for improving sequential I/O performance modified the block usage bitmap of the Ext2/3 filesystem in order to actively use the faster disk zones, which are the outer zones in each hard disk drive. However, these methods do not support Ext4, which is the current version of Ext filesystems. In this paper, we discuss a method for improving the sequential I/O performance of the Ext4 filesystem. First, we evaluate the sequential file access throughputs on Ext3, Ext4, and XFS filesystems. We point out that Ext4 does not actively utilize the area freed by deleting existing files, causing declines in file access performance. Second, we propose a method for improving the Ext4 sequential file access performance. The improved Ext4 actively utilizes the faster zones of storage devices by controlling file placement location. Third, we evaluate the proposed filesystem and show that it outperforms existing filesystems. In the case of TeraSort, Hadoop with the proposed Ext4 filesystem performs better than when using the original Ext4 filesystem by as much as 30.1%.
更多
查看译文
关键词
Big data,ext4,Filesystem,Sequential access
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要