Chrome Extension
WeChat Mini Program
Use on ChatGLM

The PSI-MOD community standard for representation of protein modification data

NATURE BIOTECHNOLOGY(2008)

Cited 124|Views51
No score
Abstract
As workers in proteomics, mass spectrometry and bioinformatics, acting with others to develop and promote standards for storing data, and submitting and publishing results, we propose a community standard ontology that reconciles complementary descriptions of protein residue modifications in a hierarchical representation and serves as a tool for precisely annotating ambiguous or incomplete experimental results. This ontology is being developed and maintained by a work group of the Proteomics Standards Initiative (PSI), founded by the Human Proteome Organization (HUPO), as a community effort to create standards for the representation and exchange of proteomics data1, 2. Three freely accessible web resources dedicated to protein modifications follow different approaches in describing those modifications. The RESID Database of Protein Modifications (http://www.ebi.ac.uk/RESID/index.html) is a comprehensive compilation of naturally occurring modifications3 annotated in the UniProt Protein Knowledgebase4. The RESID database focuses on naturally occurring modifications. Proposed modifications later shown not to exist or to be artifacts are tagged as 'deprecated'. The UNIMOD database (http://www.unimod.org/) is dedicated to mass spectrometry and contains both natural and nonnatural modifications with essential annotations in a relational database5. DeltaMass (http://www.abrf.org/index.cfm/dm.home) is a list of modifications and mass spectrometry decomposition products ordered by mass difference6. These web resources were not designed to provide the consistent, hierarchically ordered definitions that are required to support dissemination of data under the PSI data exchange standards. Mass spectrometry–based protein identification and structural characterization software, from public or commercial sources, use dedicated or proprietary databases of modifications that do not provide the required hierarchically ordered definitions. Researchers find it difficult to integrate protein modification data because the underlying terms and criteria they rely on are incompatible. As in other areas of proteomics, research is hampered by the fragmentation of publicly available information. Protein modification data, in particular, is sometimes difficult to interpret because of the frequent use of different nomenclatures or ways of describing protein modifications, especially when experimental methods give ambiguous or incomplete determinations of those modifications. A community effort is required to deal with these difficulties. Two PSI working groups, Proteomics Informatics (PSI-PI) and Molecular Interactions (PSI-MI), are developing data exchange standards7 that provide a community consensus based on a standard data exchange document format specified in an XML (extensible markup language) schema, hierarchical controlled vocabularies relating to the data schema in the Open Biomedical Ontologies (OBO) file format8 and minimum requirement recommendations for release of data in the public domain. In the development of these standards, both PSI-PI and PSI-MI require the precise annotation of protein modifications at different levels of experimental resolution. To avoid both duplication of effort and the introduction of more conflicting terminologies, PSI-MOD is designed to be a shared ontology for protein modifications9. It attempts to represent both naturally occurring and nonnatural modifications with a comprehensive, hierarchical, controlled vocabulary, providing terms for the annotation of ambiguous structures, and includes searchable information on modifications that would allow them to be identified by experimentally determined masses or mass differences. In addition to complementing the data standardization efforts of the PSI-PI and PSI-MI, the proposed PSI-MOD provides a comprehensive controlled vocabulary for proteomic researchers to use in reporting protein modification experimental observations. PSI-MOD is being used in PRIDE database10, a centralized, standards compliant, public data repository for proteomics data, such as mass spectra for protein and peptide identifications with their observed modifications. PSI-MOD is an ontology, a hierarchical controlled vocabulary forming a directed acyclic graph11 consisting of terms and definitions for protein chemical modifications, the nodes, logically linked by specific relationships, the edges. In its current public release, the PSI-MOD ontology has 57 top-level nodes and provides alternative hierarchical paths for classifying protein modifications by either the molecular structure of the modification, the amino acid residue that is modified or isobaric sets at different levels of precision. Five defined relationships are used in the ontology: The PSI-MOD ontology is produced in an OBO format version 1.2 file and an equivalent OBO.XML file. As recommended by the Gene Ontology Consortium12, each term is assigned a unique, numeric identifier following the prefix 'MOD' registered as the ontology namespace at the OBO website. Terms in the ontology have a text definition and cross references to the source databases and to appropriate literature references cited by PubMed identifiers. The terms are provided with a collection of synonyms, including IUPAC (International Union of Pure and Applied Chemistry) systematic names and other names that have appeared in the literature, including tagged misnomers. American English is used in the term and definitions, as recommended by OBO guidelines, and British spellings are provided as synonyms. In addition to the unique, but nondescriptive, numeric identifier, two types of descriptive labels are provided for many of the terms. There are unique, descriptive identifiers that are usable as short labels in computer applications and for identifying modifications by the other PSI working groups. The unique short labels have a length of 20 characters or less and generally follow IUPAC and IUBMB (International Union of Biochemistry & Molecular Biology) recommendations for amino acid derivatives and other biochemical abbreviations13, 14, 15, 16, 17, 18. These short unique labels are exact synonyms flagged in the list of synonyms with the type 'PSI-MOD-short'. There is also a second set of descriptive identifiers, used in mass spectrometry, adopting a modification nomenclature simply based upon the mass change from the unmodified residue rather than on the structure of the product. Thus, isomeric modifications with the same mass have the same mass spectrometry identifier and modifications that do not have a defined mass change, such as generic peptide–DNA or peptide–RNA modifications, are not covered by this nomenclature. The mass spectrometry PSI working group has produced a list of proposed descriptive labels in an attempt to reconcile different names used for general modifications by public and commercial mass spectrometry search engines. The set of labels for types of protein modifications observed in mass spectrometry analysis cover about 540 specific modifications in the ontology and are shown as related synonyms with the type 'PSI-MS-label'. These names are generally more intuitive to the reader but less systematically termed. A document containing the mapping of the descriptive labels to various database and search engine names, along with proposed rules and recommendations, are available in a spreadsheet from the PSI-MOD web page (http://psidev.sourceforge.net/mod/ms/PSI_MS_mod_nomenclature.xls). To be useful to the mass spectrometry community, some terms in PSI-MOD contain an elemental formula and mass information calculated by both the chemical average and by the most common isotope (monoisotopic) methods. The formula and mass data are not strictly part of the ontology but are property values of the terms. Nine property values can appear: PSI-MOD was initially constructed by preparing term entries from the three source databases, the RESID Database, UniMod and DeltaMass, and loading them into spreadsheets with common columns. Term entries were identified as probably being identical by comparing masses, mass differences or the similarity of names with synonyms available in the RESID database. The prospective entries were then automatically written in OBO format. The source databases for each term entry are cited in the definition cross-references by 'RESID:', 'UniMod:' and 'DeltaMass:', followed by the permanent identifiers for RESID and UniMod, and by an internal identifier for DeltaMass. Annotators merge terms that are recognized to be identical. Differences in the formula or calculated masses have to be resolved when entries are matched only by a similarity in name. When an entry is merged into another, the entry identifier is not lost. The entry is tagged 'obsolete' and a dead-end is_obsolete relationship is applied, but the identifier of the remapped entry is placed in the definition. Automatic update procedures are not performed on obsolete entries. Several different cases of one-to-many mappings can arise in merging entries from these databases. Some entries in RESID represent classes of compounds that can be matched with different specific class members present in UniMod or DeltaMass. These entries have a RESID cross-reference tagged 'variant'. Some entries in RESID represent the same modification that can be produced from different amino acids. These entries have a RESID cross-reference tagged 'resulting'. Some entries in UniMod and DeltaMass represent a modification process with a specific difference formula and 'delta-mass' that occur to different amino acid, or to any amino acid at the N or C terminus. These entries have UniMod or DeltaMass cross-references tagged 'site'. The hierarchical organization of the PSI-MOD ontology is imposed by the addition of parent terms. These higher-level ontological terms do not have a cross-reference to any of the source databases but a stub 'PSI-MOD:ref' that will be replaced by a suitable literature cross-reference to the PSI-MOD ontology. An 'uncategorized' node is the parent node for all entries that have not yet been assigned either to a class based on chemical structure or source amino acid. These uncategorized terms are gradually being resolved as additional chemical categories are adopted. In the ongoing work, annotations are being added to supplement what was extracted from the source databases, and definitions and descriptive labels are being written to replace the stub fields produced during the assembly process. PSI-MOD contains 1,300 terms, including all the current entries in the source protein modification databases: RESID database 439 entries in release 53 on 31 March 2008, UniMod 335 entries split in 630 sites on 21 April 2008, and Delta Mass 353 entries on 6 June 2006. More than 300 terms are identified as common in two or more of the source databases, about 780 terms are unique among the source databases and, so far, about 250 terms have been added in building the hierarchy. A subset category, slim, has been created for the most frequently encountered modifications and all the ontological parent terms of those modifications. This subset category can be selected to produce a smaller PSI-MOD slim version of the ontology. The PSI-MOD ontology can be searched by mass at the PRIDE19 site and at the Ontology Lookup Service site by term or by tree browsing. The PRIDE search site allows users to search for modifications either within a range of masses or as a single mass with a given precision. The Ontology Lookup Service is another product of the PRIDE project, which required a centralized query interface for ontology and controlled vocabulary lookup, allowing users to do text searches of term names in multiple ontologies. The PSI-MOD search allows users to query the full content of a number of existing public resources about protein modification and navigate easily through the hierarchy of modifications. The PSI-MOD is maintained and updated by the PSI-MOD working group to include newly reported modifications or newly developed modification reagents. PSI-MOD, following standard PSI procedures, is edited to include new modification terms or synonyms reflecting changes in the source databases and requests from the user community. PSI-MOD is open to public review. Editorial comments and suggestions may be submitted through the SourceForge.net <0x000A>collaborative development system. Requests may also be directed to the PSI-MOD working group mailing list. The mass spectrometry names and rules are open to public comment and should be addressed to this mailing list. Reports will be periodically published on the mass spectrometry component of PSI-MOD. News about PSI-MOD and the work of the other PSI working groups, and their mailing lists, are available at the homepage (http://www.psidev.info/MOD) as well as at the URLs of the source databases. We gratefully acknowledge the work of David Horn of Agilent Technologies, Santa Clara, California, USA, and Detlev Suckau of Bruker Daltonik GmbH, Bremen, Germany, while working on development of the PSI-MS list of descriptive labels. We also acknowledge the contribution made by Ken Mitchelhill in preparing the DeltaMass list and by Len Packman in maintaining it for the Association of Biomolecular Resource Facilities. We thank Richard Cote of the European Bioinformatics Institute Proteomics Services Team for his work in developing the Ontology Lookup Service and the Ontology Browser for the PRIDE Project, and especially for developing the PSI-MOD mass search server. We thank Henning Hermjakob of the European Bioinformatics Institute for his leadership of the Proteomics Services Team and his contributions to the Proteomics Standards Initiative. The PSI meetings were supported by the Human Proteome Organization and by generous sponsorship from academic and commercial organizations. The collaborative development process has been facilitated by the infrastructure provided by Source Forge.
More
Translated text
Key words
Life Sciences,general,Biotechnology,Biomedicine,Agriculture,Biomedical Engineering/Biotechnology,Bioinformatics
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined