defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data

2019 15th International Conference on eScience (eScience)(2019)

引用 4|浏览7
暂无评分
摘要
This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.
更多
查看译文
关键词
text mining,Apache Spark,High-Performance Computing,XML schemas,digital tools,humanities research,historical sources,distributed queries
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要