36. Gene Normalizer: A tool to resolve genetic ambiguity through data harmonization

Cancer Genetics(2023)

Cited 0|Views16
No score
Abstract
Gene symbols, maintained by gene naming authorities such as HGNC, are error-prone when used as identifiers for describing genes in databases and biomedical literature. Gene symbols are subject to changes over time, and may conflict with community aliases for gene loci, leading to potential errors. We investigated the scale of this issue by evaluating the gene symbols and aliases of two authoritative gene sets: NCBI Gene and HGNC. We found 3,940 gene records (2.3%) containing aliases that identically matched the primary symbol of another gene record. For example, KRAS is both the primary symbol for an NCBI Entrez gene (ncbigene: 3845) as well as an alias for the related but distinct RAS-family gene, NRAS (ncbigene: 4893). Our analysis illustrates how these findings may impact downstream gene data analyses including natural language processing and literature curation. As with this example, intersections between aliases and gene symbols are present in well classified and frequently referenced genes, making disambiguation a recurring issue and a challenge to resolve. To raise awareness of this issue and provide policies for resolving these challenges we have developed the Gene Normalizer. This resource harmonizes data and improves corroboration for gene records across commonly used resources. The development of the Gene Normalizer is a piece of a larger effort to improve clinical application workflows that depend on efficient processes and precise genetic information for patient treatment. Gene symbols, maintained by gene naming authorities such as HGNC, are error-prone when used as identifiers for describing genes in databases and biomedical literature. Gene symbols are subject to changes over time, and may conflict with community aliases for gene loci, leading to potential errors. We investigated the scale of this issue by evaluating the gene symbols and aliases of two authoritative gene sets: NCBI Gene and HGNC. We found 3,940 gene records (2.3%) containing aliases that identically matched the primary symbol of another gene record. For example, KRAS is both the primary symbol for an NCBI Entrez gene (ncbigene: 3845) as well as an alias for the related but distinct RAS-family gene, NRAS (ncbigene: 4893). Our analysis illustrates how these findings may impact downstream gene data analyses including natural language processing and literature curation. As with this example, intersections between aliases and gene symbols are present in well classified and frequently referenced genes, making disambiguation a recurring issue and a challenge to resolve. To raise awareness of this issue and provide policies for resolving these challenges we have developed the Gene Normalizer. This resource harmonizes data and improves corroboration for gene records across commonly used resources. The development of the Gene Normalizer is a piece of a larger effort to improve clinical application workflows that depend on efficient processes and precise genetic information for patient treatment.
More
Translated text
Key words
gene normalizer,genetic ambiguity,data harmonization
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined