Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug- gene/protein relations

semanticscholar(2021)

引用 17|浏览7
暂无评分
摘要
Considering recent progress in NLP, deep learning techniques and biomedical language models there is a pressing need to generate annotated resources and comparable evaluation scenarios that enable the development of advanced biomedical relation extraction systems that extract interactions between drugs/chemical entities and genes, proteins or miRNAs. Building on the results and experience of the CHEMDNER, CHEMDNER patents and ChemProt tracks, we have posed the DrugProt track at BioCreative VII. The DrugProt track focused on the evaluation of automatic systems able to extract 13 different types of drug-genes/protein relations of importance to understand gene regulatory and pharmacological mechanisms. The DrugProt track addressed regulatory associations (direct/indirect, activator/inhibitor relations), certain types of binding associations (antagonist and agonist relations) as well as metabolic associations (substrate or product relations). To promote development of novel tools and offer a comparative evaluation scenario we have released 61,775 manually annotated gene mentions, 65,561 chemical and drug mentions and a total of 24,526 relationships manually labeled by domain experts. A total of 30 teams submitted results for the DrugProt main track, while 9 teams submitted results for the large-scale text mining subtrack that required processing of over 2,3 million records. Teams obtained very competitive results, with predictions reaching fmeasures of over 0.92 for some relation types (antagonist) and fmeasures across all relation types close to 0.8. INTRODUCTION Among the most relevant biological and pharmacological relation types are those that involve (a) chemical compounds and drugs as well as (b) gene products including genes, proteins, miRNAs. A variety of associations between chemicals and genes/proteins are described in the biomedical literature, and there is a growing interest in facilitating a more systematic extraction of these relations from the literature, either for manual database curation initiatives or to generate large knowledge graphs of importance for drug discovery, drug repurposing, building regulatory or interaction networks or to characterize off-target interactions of drugs that might be of importance to understand better adverse drug reactions. At BioCreative VI, the ChemProt track tried to promote the development of novel systems between chemicals and genes for groups of biologically related association types (ChemProt track relation groups or CPRs). Although the obtained results did have a considerable impact in the development and evaluation of new biomedical relation extraction systems, a limitation of grouping more specific relation types into broader groups was the difficulty to directly exploit the results for database curation efforts and biomedical knowledge graph mining application scenarios. The considerable interest in the integration of chemical and biomedical data for drug-discovery purposes, together with the ongoing curation of relationships between biological and chemical entities from scientific publications and patents due to the recent COVID-19 pandemic, motivated the DrugProt track of BioCreative VII, which proposed using more granular relation types. In order to facilitate the development of more granular relation extraction systems large manually annotated corpora are needed. Those corpora should include high-quality manually labled entity mentions together with exhaustive relation annotations generated by domain experts. TRACK AND CORPUS DESCRIPTION Corpus description To carry out the DrugProt track at BioCreative VII, we have released a large manually labelled corpus including annotations of mentions of chemical compounds and drugs as well as genes, proteins and miRNAs. Domain experts with experience in biomedical literature annotation and database curation annotated by hand all abstracts using the BRAT annotation interface. The manual labeling of chemicals and genes was done in separate steps and by different experts to avoid introducing biases during the text annotation process. The manual tagging of entity mentions of chemicals and drugs as well as genes, proteins and miRNAs was done following a carefully designed annotation process and in line with publicly released annotation guidelines. Gene/protein entity mentions were manually mapped to their corresponding biologic al database identifiers whenever possible and classified as either normalizable to databases (tag: GENE-Y) or non normalizable mentions (GENE-N). Teams that participated at the DrugProt track were only provided with this classification of gene mentions and not the actual database identifier to avoid usage of external knowledge bases for producing their predictions. The corpus construction process required first annotating exhaustively all chemical and gene mentions (phase 1). Afterwards the relation annotation phase followed (phase 2), were relationships between these two types of entities had to be labeled according to public available annotation guidelines. Thus, to facilitate the annotation of chemical-protein interactions, the DrugProt track organizers constructed very granular relation annotation rules described in a 33 pages annotation guidelines document. These guidelines were refined during an iterative process based on the annotation of sample documents. The guidelines provided the basic details of the chemicalprotein interaction annotation task and the conventions that had to be followed during the corpus construction process. They incorporated suggestions made by curators as well as observations of annotation inconsistencies encountered when comparing results from different human curators. In brief, DrugProt interactions covered direct interactions (when a physical contact existed between a chemical/drug and a gene/protein) as well as indirect regulatory interactions that alter either the function or the quantity of the gene/gene product. The aim of the iterative manual annotation cycle was to improve the quality and consistency of the guidelines. During the planning of the guidelines some rules had to be reformulated to make them more explicit and clear and additional rules were added wherever necessary to better cover the practical annotation scenario and for being more complete. The manual annotation task basically consisted of labeling or marking manually through a customized BRAT webinterface the interactions given the article abstracts as content. Figure 1 summarizes the DrugProt relation types included in the annotation guidelines. Fig. 1. Overview of the DrugProt relation type hierarchy. The corpus annotation carried out for the DrugProt track was exhaustive for all the types of interactions previously specified. This implied that mentions of other kind of relationships between chemicals and genes (e.g. phenotypic and biological responses) were not manually labelled. Moreover, the DrugProt relations are directed in the sense that only relations of “what a chemical does to a gene/protein" (chemical → gene/protein direction) were annotated, and not vice versa. To establish a easy to understand relation nomenclature and avoid redundant class definitions, we reviewed several chemical repositories that included chemical – biology information. We revised DrugBank, the Therapeutic Targets Database (TTD) and ChEMBL, assay normalization ontologies (BAO) and previously existing formalizations for the annotation of relationships: the Biological Expression Language (BEL), curation guidelines for transcription regulation interactions (DNA-binding transcription factor – target gene interaction) and SIGNOR, a database of causal relationships between biological entities. Each of these resources inspired the definition of the subclasses DIRECT REGULATOR (e.g. DrugBank, ChEMBL, BAO and SIGNOR) and the INDIRECT REGULATOR (e.g. BEL, curation guidelines for transcription regulation interactions and SIGNOR). For example, DrugBank relationships for drugs included a total of 22 definitions, some of them overlapping with CHEMPROT subclasses (e.g. “Inhibitor”, “Antagonist”, “Agonist”,...), some of them being regarded as highly specific for the purpose of this task (e.g. “intercalation”, “cross-linking/alkylation”) or referring to biological roles (e.g. “Antibody”, “Incorporation into and Destabilization”) and others, partially overlapping between them (e.g. “Binder” and “Ligand”), that were merged into a single class. Concerning indirect regulatory aspects, the five classes of casual relationships between a subject and an object term defined by BEL (“decreases”, “directlyDecreases”, “increases”, “directlyIncreases” and “causesNoChange”) were highly inspiring. Subclasses definitions of pharmacological modes of action were defined according to the UPHAR/BPS Guide to Pharmacology in 2016. For the DrugProt track a very granular chemical-protein relation annotation was carried out, with the aim to cover most of the relations that are of importance from the point of view of biochemical and pharmacological/biomedical perspective. Nevertheless, for the DrugProt track only a total of 13 relation types were used, keeping those that had enough training instances/examples and sufficient manual annotation consistency. The final list of relation types used for this shared task was: INDIRECT-DOWNREGULATOR, INDIRECTUPREGULATOR, DIRECT-REGULATOR, ACTIVATOR, INHIBITOR, AGONIST, ANTAGONIST, AGONISTACTIVATOR, AGONIST-INHIBITOR, PRODUCT-OF, SUBSTRATE, SUBSTRATE_PRODUCT-OF or PART-OF. The DrugProt corpus was split randomly into training, development and test set. We also included a background and large scale background collection of records that were automatically annotated with drugs/chemicals and genes/proteins/miRNAs using an entity tagger trained on the manual DrugProt entity mentions. The background collections were merged with the test set to be able to get team predictions also for these records. Table 1 shows a su
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要