The systematic assessment of completeness of public metadata accompanying omics studies

Yu-Ning Huang, Pooja Vinod Jaiswal,Anushka Rajesh, Anushka Yadav,Dottie Yu, Fangyun Liu, Grace Scheg, Grigore Boldirev,Irina Nakashidze,Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng,Ram Ayyala,Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh,Malak S. Abedalthagafi,Serghei Mangul

bioRxiv (Cold Spring Harbor Laboratory)(2023)

Cited 0|Views10
No score
Abstract
Recent advances in high-throughput sequencing technologies have made it possible to collect and share a massive amount of omics data, along with its associated metadata. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data limits the reproducibility and reusability of millions of omics samples. In this study, we performed a comprehensive assessment of metadata completeness shared in both scientific publications and/or public repositories by analyzing over 253 studies encompassing over 164 thousands samples. We observed that studies often omit over a quarter of important phenotypes, with an average of only 74.8% of them shared either in the text of publication or the corresponding repository. Notably, public repositories alone contained 62% of the metadata, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Studies involving non-human samples were more likely to share metadata than studies involving human samples. We observed similar results on the extended dataset spanning 2.1 million samples across over 61,000 studies from the Gene Expression Omnibus repository. The limited availability of metadata reported in our study emphasizes the necessity for improved metadata sharing practices and standardized reporting. Finally, we discuss the numerous benefits of improving the availability and quality of metadata to the scientific community abd beyond, supporting data-driven decision-making and policy development in the field of biomedical research. ### Competing Interest Statement The authors have declared no competing interest.
More
Translated text
Key words
public metadata,systematic assessment,omics,studies
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined