SELFIE: Self-Aware Information Extraction from Digitized Biocollections

2017 IEEE 13th International Conference on e-Science (e-Science)(2017)

引用 4|浏览77
暂无评分
摘要
Biological collections store information with broad societal and environmental impact. In the last 15 years, after worldwide investments and crowdsourcing efforts, 25% of the collected specimens have been digitized; a process that includes the imaging of text attached to specimens and subsequent extraction of information from the resulting image. This information extraction (IE) process is complex, thus slow and typically involving human tasks. We propose a hybrid (Human-Machine) information extraction model that efficiently uses resources of different cost (machines, volunteers and/or experts) and speeds up the biocollections' digitization process, while striving to maintain the same quality as human-only IE processes. In the proposed model, called SELFIE, self-aware IE processes determine whether their output quality is satisfactory. If the quality is unsatisfactory, additional or alternative processes that yield higher quality output at higher cost are triggered. The effectiveness of this model is demonstrated by three SELFIE workflows for the extraction of Darwin-core terms from specimens' images. Compared to the traditional human-driven IE approach, SELFIE workflows showed, on average, a reduction of 27% in the information-capture time and a decrease of 32% in the required number of humans and their associated cost, while the quality of the results was negligibly reduced by 0.27%.
更多
查看译文
关键词
information extraction,self-awareness,digitization,human-machine,biocollections
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要