Nomenclature for the description of protein sequence modifications

From TheGPMWiki
(Difference between revisions)
Jump to: navigation, search
(Comments and suggestions)
(Multiple modifications in a region)
Line 97: Line 97:
  (13)              up|QRTFRC|:pm.S141+Phospho;T144+Phospho;Q145+Deamidated;
  (13)              up|QRTFRC|:pm.S141+Phospho;T144+Phospho;Q145+Deamidated;
-
would be the notation for 2 specific phosphorylation and one deamination.  
+
would be the notation for 2 specific residue phosphorylations and one specific deamination.  
Combinations of different styles of modification specification can be used together in this type of notation. The following is equivalent to the previous example:
Combinations of different styles of modification specification can be used together in this type of notation. The following is equivalent to the previous example:
-
  (14)              up|QRTFRC|:pm.S141+Phospho;T144=MOD:00047;Q145#-0.984016;
+
  (14)              up|QRTFRC|:pm.S141+Phospho;T144=MOD:00047;Q145#0.984;
==Adding comments==
==Adding comments==

Revision as of 15:29, 15 December 2011

Contents

Rationale

While there have been efforts to create ontologies and controlled vocabularies to describe the various types of amino acid modifications that can be observed in proteomics, there has been no proposed succinct notation for describing those modifications in their biological context. Existing methods of describing protein modifications also tend to be "crisp" (assigning modifications to specific residues) while most measurements are less precise (assigning modifications to regions of a protein). This document lays out a notation meant to be use to realistically describe the results of proteomics experiments, using a format similar to the Human Genome Variation Society's notation for describing protein amino acid polymorphisms. When results have been rendered into this notation, they can then be used to query databases of existing proteomics information to generate validation, QA/QC or general information about the existing information about specific proteins. These queries can be constructed from experimental data, domain-prediction algorithms or any other type of research into protein modifications and amino acid polymorphisms.

Unlike the HGVS notation, amino acid residues may only be represented as their single letter codes: three letter codes or full names are not allowed. For example, "S" may be used to represent serine, but "Ser" or "serine" are not allowed.

Prerequisites

In order to use this notation, it is necessary to meet the following preconditions:

  1. the protein sequence being annotated must be archived and readily available;
  2. the accession number used for the protein must be unique, i.e., there must be a 1:1 relationship between the accession number and the protein sequence; and
  3. the annotation should reflect the information available in the original measurement as completely as possible.

Simple modifications

(+) Specifying modifications as side chain changes

The general nomenclature format proposed for the case where the modification is being described as a change to the structure of the amino acid residue in question:

(1)               ACCESSION:pm.Xnn+MODIFICATION;

where "ACCESSION" is the accession number for the protein, "pm." indicates that it is a protein modification, "X" is the single letter symbol for the amino acid residue, "nn" is its ordinal position in protein co-ordinates, "+MODIFICATION" specifies the change. For example, using the PSI-MS vocabulary for modifications, the notation

(2)               ENSP00000339186:pm.T262+Phospho;

indicates that for the protein accession number ENSP00000339186, the threonine residue number 262 is phosphorylated. It should be noted that the "+" symbol indicates that the side chain has changed: the change may result in either an increase or decrease in residue molecular mass.

(=) Specifying modifications as new amino acid residues

Another approach taken to specifying residue modifications is to consider each modified residue to be a completely new amino acid: the PSI-MOD vocabulary uses this philosophy. If this type of residue replacement idea is to be used, then the notation will be:

(3)               ACCESSION:pm.Xnn=NEW_RESIDUE;

where "=NEW_RESIDUE" uses the residue-specification ontology required. Using this strategy for naming the modification in the previous example:

(4)               ENSP00000339186:pm.T262=MOD:00047;

indicates that for protein accession number ENSP00000339186, the threonine residue number 262 has been replaced with the residue specified by the PSI-MOD ID "MOD:00047" (O-phospho-L-threonine).

In the special case where one amino acid residue has been substituted by another residue at the DNA or mRNA level (an amino acid polymorphism), the HGVS annotation (:p.) should be used rather than this type of modification annotation:

(5)               ACCESSION:p.XnnY;

where "X" is the residue in the sequence associated with the accession number and "Y" is the new residue.

(#) Specifying modifications as a change in mass

Some times, when using mass spectrometry based approaches to proteomics, it is not possible (or wise) to attribute a particular measured mass difference to a specific residue modification. In such a case, the appropriate nomenclature would be:

(6)               ACCESSION:pm.Xnn#DELTA;

where "#DELTA" is the difference between the measured mass and the residue mass of "X" in Daltons. Using the example above,

(7)               ENSP00000339186:pm.T262#79.9663;

indicates that for protein accession number ENSP00000339186, an experimental measurement supports the assignment that the threonine at residue number 262 has been modified by the addition of 79.9663 Daltons.

The the number used in this style of notation should be governed by the normal rules for significant figures in scientific reporting. For example, if the error in a number is ± 0.1 Da, then the number should only be reported to one decimal place.

Modification site occupancy

In most real cases a residue is not fully converted to a modified form. Instead, in a population of protein only a fraction of the residues at a particular site will be modified. To express this concept, the fraction of residues that have been modified can be added to the specification, using French braces, e.g.:

(8)               ENSP00000339186:pm.T262+Dehydration{0.8};

where "{0.8}" is the fraction of residues with this modification.

Modifications in a region

In this case, the specific residue modified cannot be specified from a measurement. It is possible, however, to determine a specific region of a the protein that has been modified. For example, in mass spectrometry-based proteomics it is common to use tryptic peptides to identify modifications and each tryptic peptide corresponds to a region of the original protein. Similarly, alanine-scanning techniques may show similar biological effects of a modification at several closely sited similar residues.

The word "region" in this context means any continuous peptide in a protein primary sequence, defined by a start and end residue. These residues are numbered from the N-terminus of the protein, which is defined as residue 1. The ordinal number of each residue may be referred to as the protein co-ordinate of that residue.

Modification within a region on specific residues

The general format for this type of notation is:

(9)               ACCESSION:pm.(XY)mm-nn+MODIFICATION[cc];

where "ACCESSION" is the protein accession number, the "(XY)" is a list of single letter amino acid codes that are the specific types of residues that may be modified and "mm-nn" are the start and end of the region in protein co-ordinates and "[cc]" is the number of residues in the region with this modification. For example, if a serine/threonine phosphorylation has been traced to the peptide "141 SSSTQLR 147" of the protein with the Uniprot accession code "up|QRTFRC|", this would be expressed as:

(10)               up|QRTFRC|:pm.(ST)141-147+Phospho;

If "[cc]" is not specified, it is taken to be the default value "[1]". The number of modifications may be a range, e.g. "[1-3]". If the range is unknown, then "cc" should be replaced with a question mark, e.g. "[?]".

Modification within a region on unknown residues

If the type of residue modified is unknown, the character "?" is used instead of a list of specific residues:

(11)              up|QRTFRC|:pm.(?)20-34#22[2];

indicates that two residues in the protein region including residues 20 to 34 have been modified by the addition of 22 Da.

Multiple modifications in a region

Experimentally determined modifications in a region are frequently reported together. To facilitate this, multiple modification specifications may be listed together in a single string, using a semi-colon ";" to separate the individual specifications:

(12)              ACCESSION:pm.MOD_SPEC1;MOD_SPEC2; etc.

For example, using one of the examples above:

(13)              up|QRTFRC|:pm.S141+Phospho;T144+Phospho;Q145+Deamidated;

would be the notation for 2 specific residue phosphorylations and one specific deamination.

Combinations of different styles of modification specification can be used together in this type of notation. The following is equivalent to the previous example:

(14)              up|QRTFRC|:pm.S141+Phospho;T144=MOD:00047;Q145#0.984;

Adding comments

In many practical cases it may be useful to add text comments, hints or other types of unspecified extensions to a this type of notation. The special character combination "\\" can be used as delimiters to insert any comment you wish into a modification specification. For example, if your application requires the amino acid sequence of a peptide to be part of the specification, you could insert it using a comment:

(15)               up|QRTFRC|:pm.(ST)141-147\\SSSTQLR\\+Phospho;

Comments can be any text string that does not include "\\" and it can be inserted anywhere in the notation. Parsers should remove the comments prior to parsing the notation, deleting all comment characters with no replacement characters inserted.

White space and readability

The notation used in this document is largely for machine use: its main application was designed to be communication experimental results to and from computer software. In order to facilitate its use for human-readable forms, white space may be added before and following any of the specified punctuation, specifically the characters : . + = ; # ( ) [ ]. The white space will be removed by any automated parser and should not be interpreted as meaningful by the reader.

For example, the previous modification notation may be written using different combinations of white space while retaining the same meaning:

(16.1)             up|QRTFRC| : pm. S141+Phospho ;T144=MOD:00047 ;Q145#-0.984016 ;

                                          or

(16.2)             up|QRTFRC|:pm. S141 + Phospho; T144 = MOD:00047; Q145 # -0.984016;

Other readability considerations:

  1. the order in which modifications are specified in this notation does not have any meaning;
  2. white space before or at the end of a complete specification will be ignored;
  3. protein accession numbers cannot contain the character combinations ":pm." or ":p."; and
  4. a complete specification must be terminated with a semi-colon.

Comparison with XML representation

There has been a conscientious effort by the HUPO Protein Standards Initiative to create an eXtensible Markup Language (XML) capable of storing a variety of elements associated with observed protein modifications. This language (MzIndentML) does not have all of the capabilities of the notation given above, however it can be used to represent specific, individual modifications. The goal of our notation is not to replace MzIdentML (or other languages like it), but to make the associated text simpler to send (and interpret) for a wider range of annotation cases than was possible using the more general XMLs.

For example, the oxidation of methionine 240 in α-actinin-4 can be represented by:

(17a)             IPI00013808.1:pm. M240 + Oxidation;

This modification can also be represented in MzIdentML, but a minimal, valid XML document to replace (17a) would require the following text (an abridged version of an example from the MzIdentiML project web site):

(17b)
<?xml version="1.0" encoding="UTF-8"?>
<MzIdentML id="MPC_use_case_handcrafted_edited" 
           name="MPC_example_edited" 
           creationDate="2011-12-14T16:00:00"
           version="1.1.0"
           xsi:schemaLocation="http://psidev.info/psi/pi/mzIdentML/1.1 ../../schema/mzIdentML1.1.0.xsd"
           xmlns="http://psidev.info/psi/pi/mzIdentML/1.1"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <cvList>
       <cv id="PSI-MS" 
            fullName="Proteomics Standards Initiative Mass Spectrometry Vocabularies"
            uri="...."
            version="2.25.0"/>
       <cv id="BTO"
            fullName="BRENDA tissue 7 enzyme source" 
            uri="http://www.brenda-enzymes.info/"
            version="12/2007"/>
       <cv id="UNIMOD" 
           fullName="UNIMOD CV for modifications" 
           uri="http://www.unimod.org/obo/unimod.obo"/>
       <cv id="UO" 
           fullName="Unit Ontology" 
           uri="http://obo.cvs.sourceforge.net/*checkout*/obo/obo/ontology/phenotype/unit.obo"/>
   </cvList>
   <SequenceCollection>
       <DBSequence id="prot1_IPI" 
            accession="IPI00013808.1" 
            searchDatabase_ref="ipi.HUMAN_decoy">
            <Seq>
                MVDYHAANQSYQYGPSSAGNGAGGGGSMGDYMAQEDDWDRDLLLDPAWEKQQRKTFTAWCN
                SHLRKAGTQIENIDEDFRDGLKLMLLLEVISGERLPKPERGKMRVHKINNVNKALDFIASK
                GVKLVSIGAEEIVDGNAKMTLGMIWTIILRFAIQDISVEETSAKEGLLLWCQRKTAPYKNV
                NVQNFHISWKDGLAFNALIHRHRPELIEYDKLRKDDPVTNLNNAFEVAEKYLDIPKMLDAE
                DIVNTARPDEKAIMTYVSSFYHAFSGAQKAETAANRICKVLAVNQENEHLMEDYEKLASDL
                LEWIRRTIPWLEDRVPQKTIQEMQQKLEDFRDYRRVHKPPKVQEKCQLEINFNTLQTKLRL
                SNRPAFMPSEGKMVSDINNGWQHLEQAEKGYEEWLLNEIRRLERLDHLAEKFRQKASIHEA
                WTDGKEAMLKHRDYETATLSDIKALIRKHEAFESDLAAHQDRVEQIAAIAQELNELDYYDS
                HNVNTRCQKICDQWDALGSLTHSRREALEKTEKQLEAIDQLHLEYAKRAAPFNNWMESAME
                DLQDMFIVHTIEEIEGLISAHDQFKSTLPDADREREAILAIHKEAQRIAESNHIKLSGSNP
                YTTVTPQIINSKWEKVQQLVPKRDHALLEEQSKQQSNEHLRRQFASQANVVGPWIQTKMEE
                IGRISIEMNGTLEDQLSHLKQYERSIVDYKPNLDLLEQQHQLIQEALIFDNKHTNYTMEHI
                RVGWEQLLTTIARTINEVENQILTRDAKGISQEQMQEFRASFNHFDKDHGGALGPEEFKAC
                LISLGYDVENDRQGEAEFNRIMSLVDPNHSGLVTFQAFIDFMSRETTDTDTADQVIASFKV
                LAGDKNFITAEELRRELPPDQAEYCIARMAPYQGPDAVPGALDYKSFSTALYGESDL
           </Seq>
           <cvParam accession="MS:1001088" 
               name="protein description"
               cvRef="PSI-MS" 
               value="IPI:IPI00013808.1; Tax_Id=9606 Gene_Symbol=ACTN4 Alpha-actinin-4"/>
       </DBSequence>
       <Peptide id="prot1_pep4">
           <PeptideSequence>MLDAEDIVNTARPDEK</PeptideSequence>
           <Modification location="1">
               <cvParam accession="UNIMOD:35"
                        name="Oxidation" 
                        cvRef="UNIMOD"/>
           </Modification>
       </Peptide>
       <PeptideEvidence id="PE1_SEQ_spec10_pep1" 
            peptide_ref="prot1_pep4" 
            start="240" 
            end="255" 
            pre="K" 
            post="A" 
            isDecoy="false"
            dBSequence_ref="prot1_IPI" />
   </SequenceCollection>
</MzIdentML>

Comments and suggestions

Any one interested in making suggestions or commenting on the ideas in this document should send them by email to Ron Beavis, rbeavis@thegpm.org.

Draft, last changed 2011.12.14

Personal tools