Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Genomics, Proteomics & Bioinformatics(2020)

引用 14|浏览31
暂无评分
摘要
The volume of biological database records is growing rapidly, populated by complex records drawn from heterogeneous sources. A specific challenge is duplication, that is, the presence of redundancy (records with high similarity) or inconsistency (dissimilar records that correspond to the same entity). The characteristics (which records are duplicates), impact (why duplicates are significant), and solutions (how to address duplication), are not well understood. Studies on the topic are neither recent nor comprehensive. In addition, other data quality issues, such as inconsistencies and inaccuracies, are also of concern in the context of biological databases. A primary focus of this paper is to present and consolidate the opinions of over 20 experts and practitioners on the topic of duplication in biological sequence databases. The results reveal that survey participants believe that duplicate records are diverse; that the negative impacts of duplicates are severe, while positive impacts depend on correct identification of duplicates; and that duplicate detection methods need to be more precise, scalable, and robust. A secondary focus is to consider other quality issues. We observe that biocuration is the key mechanism used to ensure the quality of this data, and explore the issues through a case study of curation in UniProtKB/Swiss-Prot as well as an interview with an experienced biocurator. While biocuration is a vital solution for handling of data quality issues, a broader community effort is needed to provide adequate support for thorough biocuration in the face of widespread quality concerns.
更多
查看译文
关键词
Duplication,Redundancy,Data quality,Biocuration,Biological databases
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要