Human-Machine Information Extraction Simulator for Biological Collections.

BigData(2019)

引用 1|浏览52
暂无评分
摘要
In the last decade, institutions from around the world have implemented initiatives for digitizing biological collections (biocollections) and sharing their information online. The transcription of the metadata from photographs of specimens’ labels is performed through human-centered approaches (e.g., crowdsourcing) because fully automated Information Extraction (IE) methods still generate a significant number of errors. The integration of human and machine tasks has been proposed to accelerate the IE from the billions of specimens waiting to be digitized. Nevertheless, in order to conduct research and trying new techniques, IE practitioners need to prepare sets of images, crowdsourcing experiments, recruit volunteers, process the transcriptions, generate ground truth values, program automated methods, etc. These research resources and processes require time and effort to be developed and architected into a functional system. In this paper, we present a simulator intended to accelerate the ability to experiment with workflows for extracting Darwin Core (DC) terms from images of specimens. The so-called HuMaIN Simulator includes the engine, the human-machine IE workflows for three DC terms, the code of the automated IE methods, crowdsourced and ground truth transcriptions of the DC terms of three biocollections, and several experiments that exemplify its potential use. The simulator adds Human-in-the-loop capabilities, for iterative IE and research on optimal methods. Its practical design permits the quick definition, customization, and implementation of experimental IE scenarios.
更多
查看译文
关键词
Information extraction,simulator,human-machine,human-in-the-loop,crowdsourcing,optical character recognition,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要