Data Bridges: Modeling Marine Science Information to Heterogeneous Information Network for Research Data Management

crossref(2024)

引用 0|浏览0
暂无评分
摘要
Research Data Management (RDM) in Natural Science establishes a structured foundation for organizing and preserving scientific data. Effective management and access to these diverse data sources are crucial for supporting domain scientists in future knowledge discovery. Scientific publications, a primary data source often presented in Portable Document Format (PDF), serve as a rich source of information, encompassing text, tables, figures, and metadata. These components present information individually or collectively, offering the potential to explore exciting research directions. However, to fully address these aspects, it is necessary to be able to perform data acquisition from these publications, focusing on these data components, and conducting respective information extraction. Furthermore, modeling the extracted information into a Heterogeneous Information Network of publications enhances accessibility, collaboration, and information harvesting within the natural sciences domain. We developed a comprehensive framework ensuring user accessibility and widespread applicability, which is capable of modeling diverse information from marine science publications into a Heterogeneous Information Network. The framework comprises three modules: Data Acquisition, Information Extraction, and Information Modeling. The Data Acquisition (DA) module extracts various data components from the relevant publications and transforms them into machine-readable formats. The Information Extraction (IE) module includes two sub-modules: Named Entity Recognition (NER) modules trained on marine science annotated text, capable of extracting eight different types of entities from plain text; and an information parser module responsible for extracting quantitative information from tabular data. It initially detects and then extracts scientific measurements, relevant spatial information, and other available characteristics. Finally, the information modeling module exhibits the extracted information from data components and performs information linking. Consequently, the information is structured into a Heterogeneous Information Network (HIN) of scientific publications, ensuring effective information delivery and providing diverse information to domain experts while supporting the Research Data Management initiative.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要