Nomenclature for the description of protein sequence modifications

From TheGPMWiki
Jump to: navigation, search

This document is a request for comment on a proposed new notation for describing protein residue modifications. The RFC process began on December 14, 2011 and ended on January 14, 2012. The RFC has been adopted by GPM and GPMDB.

The JSON extension of the original specification began September 14, 2012.

While there have been efforts to create ontologies and controlled vocabularies to describe the various types of amino acid modifications that can be observed in proteomics, there has been no proposed succinct notation for describing those modifications in their biological context. Existing methods of describing protein modifications also tend to be "crisp" (assigning modifications to specific residues) while most measurements are less precise (assigning modifications to regions of a protein). This document lays out a notation meant to be use to realistically describe the results of proteomics experiments, using a format similar to the Human Genome Variation Society's notation for describing protein amino acid polymorphisms. When results have been rendered into this notation, they can then be used to query databases of existing proteomics information to generate validation, QA/QC or general information about residue modifications relevant to a particular protein sequence. These queries can be constructed from experimental data, domain-prediction algorithms or any other type of research into protein modifications and amino acid polymorphisms.

In addition to the compact text notation, this document also provides a JSON-formatted alternative that can be used for the automated exchange of information. JSON (JavaScript Object Notation) is a commonly used method to encapsulate complex information in a machine readable form. Many programming languages have specialized parsers that make the use of information in JSON format comparatively simple.



§ 1. In order to use this notation, it is necessary to meet the following preconditions:

  1. the protein sequence being annotated must be archived and readily available;
  2. the accession number used for the protein must be unique, i.e., there must be a 1:1 relationship between the accession number and the protein sequence;
  3. the annotation is not case sensitive; and
  4. the annotation should reflect the information available as completely as possible.

Like the HGVS notation, amino acid residues may only be represented as either their single letter codes or three letter codes. For example, "S" or "Ser" may be used to represent serine. Unlike the HGVS notation, the use of three letter codes is strongly discouraged: they should only be used in cases where there is no acceptable single-letter code substitute.

The full set of commonly accepted amino acid codes for protein residues is given here.

Simple modifications

§ 2. The main purpose of this document is to describe a clear, easy-to-use notation for protein amino acid residue modification. This section describes the fundamental building blocks of such a notation. It is important to note that this notation does not depend on a specific ontology or vocabulary for describing specific modifications: that choice is left to implementers.

(+) Specifying modifications as side chain changes

§ 2.1. The general nomenclature format proposed for the case where the modification is being described as a change to the structure of the amino acid residue in question:

           JSON:  {"acc" : "ACCESSION", "res" : "X", "pos" : nn, "mod" : "MODIFICATION"}

where "ACCESSION" is the accession number for the protein, "pm." indicates that it is a protein modification, "X" is the single letter symbol for the amino acid residue, "nn" is its ordinal position in protein co-ordinates, "+MODIFICATION" specifies the change. For example, using the PSI-MS vocabulary for modifications, the notation

(2.1.2) COMPACT:  ENSP00000339186:pm.T262+Phospho;
           JSON:  {"acc" : "ENSP00000339186", "res" : "T", "pos" : 262, "mod" : "Phospho"}

indicates that for the protein accession number ENSP00000339186, the threonine residue number 262 is phosphorylated. It should be noted that the "+" symbol indicates that the side chain has changed: the change may result in either an increase or decrease in residue molecular mass.

(=) Specifying modifications as amino acid residue substitutions

§ 2.2. Another approach taken to specifying residue modifications is to consider each modified residue to be a completely new amino acid: the PSI-MOD vocabulary uses this philosophy. If this type of residue substitution idea is to be used, then the notation will be:

           JSON:  {"acc" : "ACCESSION", "res" : "X", "pos" : nn, "sub" : "NEW_RESIDUE"}

where "=NEW_RESIDUE" uses the residue-specification ontology required. Using this strategy for naming the modification in the previous example:

(2.2.2) COMPACT:  ENSP00000339186:pm.T262=MOD:00047;
           JSON:  {"acc" : "ENSP00000339186", "res" : "T", "pos" : 262, "sub" : "MOD:00047"}

indicates that for protein accession number ENSP00000339186, the threonine residue number 262 has been replaced with the residue specified by the PSI-MOD ID "MOD:00047" (O-phospho-L-threonine).

There are some common amino acid variants that are probably best reported in this format. Amino acid changes caused by tRNA misloading, such as norvaline, norleucine or selenomethionine, are commonly thought of as residues rather than as modfications, so the "=" notation may be the clearest choice. Selenocysteine, citrulline, ornithine and homoserine are examples of other non-standard residues that may be best reported using the "=" notation.

(#) Specifying modifications as a change in mass

§ 2.3. Some times, when using mass spectrometry based approaches to proteomics, it is not possible (or wise) to attribute a particular measured mass difference to a specific residue modification. In such a case, the appropriate nomenclature would be:

           JSON:  {"acc" : "ACCESSION", "res" : "X", "pos" : nn, "del" : DELTA}

where "#DELTA" is the difference between the measured mass and the residue mass of "X" in Daltons. Using the example above,

(2.3.2) COMPACT:  ENSP00000339186:pm.T262#79.9663;
           JSON:  {"acc" : "ENSP00000339186" , "res" : "T", "pos" : 262, "del" : 79.9663}

indicates that for protein accession number ENSP00000339186, an experimental measurement supports the assignment that the threonine at residue number 262 has been modified by the addition of 79.9663 Daltons.

The the number used in this style of notation should be governed by the normal rules for significant figures in scientific reporting. For example, if the error in a number is ± 0.1 Da, then the number should only be reported to one decimal place.

(*) Protein backbone cleavage

§ 2.4. One of the most common protein post-translational modifications is the scission of the protein's peptide backbone by hydrolysis to form two new chains. These scissions can be part of many biological processes, such as activating a pro-protein, removal of a signal sequence or inactivation of a mature protein. The notation for this type of modification will be:

(2.4.1) COMPACT:  ACCESSION:pm.Xnn*;
           JSON:  {"acc" : "ACCESSION", "res" : "X", "cut" : nn}

which imples that the protein corresponding to ACCESSION has a backbone cleavage between residues nn and nn+1. For example:

(2.4.2) COMPACT:  ENSP00000339186:pm.M1*;
           JSON:  {"acc" : "ENSP00000339186", "res" : "M", "cut" : 1}

implies that this protein sequence is cleaved between residues 1 and 2.

(:p.) Specifying amino acid residue changes caused by nucleic acid sequence variants

§ 2.5. In the special case where the peptide sequence has been affected by changes at the DNA or mRNA level (e.g., amino acid polymorphisms, deletion, insertions, indels, frameshifts, etc.), the HGVS annotation (:p.) should be used rather than the (:pm.) modification annotation. In the most common case, an amino acid polymorphism generated by a non-synonymous single nucleotide polymorphism, the HGVS notation would be simply be as follows:

           JSON:  {"acc" : "ACCESSION", "res" : "X", "pos" :nn, "poly" : "Y",
                   "del" : DELTA, "id" : "ID"}

where "X" is the residue in the sequence associated with the accession number, "Y" is the new residue, "DELTA" (optional) is the associated mass change in Daltons and "ID" (optional) is an external identifier or accession number for associated SNP.

({f}) Modification site occupancy

§ 2.6. In most real cases a residue is not fully converted to a modified form. Instead, in a population of protein only a fraction of the residues at a particular site will be modified. To express this concept, the fraction of residues that have been modified can be added to the specification, using French braces, e.g.:

(2.6.1) COMPACT:  ENSP00000339186:pm.T262+Dehydration{0.8};
           JSON:  {"acc" : "ENSP00000339186" , "res":"T", "pos" : 262,
                   "mod" : "Dehydration" , "occ" : 0.8}

where "{0.8}" is the fraction of residues with this modification.

(x/y) Ambiguous modifications

§ 2.7. Given current measurement technology, there are some cases inwhich it is impossible to distinguish between two (or more) types of modifications based on the evidence provided by an experiment. For example, when using mass spectrometry-based proteomics, the lysine modification trimethylation (K#42.046950) may be indistinguishable from acetylation (K#42.010565) unless unusual care is taken with the mass measurement. To represent this type of ambiguity, the possible modifications should be presented as a list, delimited by the "/" symbol, e.g.:

(2.7.1) COMPACT:  ENSP00000339186:pm.K26+Trimethyl/Acetyl;
           JSON:  {"acc" : "ENSP00000339186", "res" : "K",
                   "pos" : 26, "mod" : ["Trimethyl","Acetyl"]}

which implies that ENSP00000339186 residue lysine 26 has been modified by either trimethylation or acetylation.

Modifications in a region

§ 3. In this case, the specific residue modified cannot be specified from a measurement. It is possible, however, to determine a specific region of a the protein that has been modified. For example, in mass spectrometry-based proteomics it is common to use tryptic peptides to identify modifications and each tryptic peptide corresponds to a region of the original protein. Similarly, alanine-scanning techniques may show similar biological effects of a modification at several closely sited similar residues.

The word "region" in this context means any continuous peptide in a protein primary sequence, defined by a start and end residue. These residues are numbered from the N-terminus of the protein, which is defined as residue 1. The ordinal number of each residue may be referred to as the protein co-ordinate of that residue.

Modification within a region on specific residues

§ 3.1. The general format for this type of notation is:

           JSON:  {"acc" : "ACCESSION", "res" : ["X","Y"],
                   "pos" : {"start" : mm, "end" : nn}, "mod" : "MODIFICATION", "occ" : cc}

where "ACCESSION" is the protein accession number, the "(XY)" is a list of single letter amino acid codes that are the specific types of residues that may be modified and "mm-nn" are the start and end of the region in protein co-ordinates and "[cc]" is the number of residues in the region with this modification. For example, if a serine/threonine phosphorylation has been traced to the peptide "141 SSSTQLR 147" of the protein with the Uniprot accession code "up|QRTFRC|", this would be expressed as:

(3.1.2) COMPACT:  up|QRTFRC|:pm.(ST)141-147+Phospho[1];
            JSON:  {"acc" : "up|QRTFRC|", "res" : ["S", "T"],
                    "pos" : {"start" : 141, "end" : 147}, "mod" : "Phospho", "occ" : 1}

If "[cc]" is not specified, it is taken to be the default value "[1]". The number of modifications may be a range, e.g. "[1-3]". If the range is unknown, then "cc" should be replaced with a question mark, e.g. "[?]".

In a case where there is evidence for the modification of specific residues within a region but not others of the same type, a comma separated list of residues-coordinates pairs may be used. For example, using the previous sequence, if there was evidence that a single phosphorylation could be on S141 or S143, but not S142 or T144, then the appropriate notation would be:

(3.1.3) COMPACT:  up|QRTFRC|:pm.S141,S143+Phospho[1];
           JSON:  {"acc" : "up|QRTFRC|", "res" : ["S", "T"],
                   "pos" : [141, 143], "mod" : "Phospho", "occ" : 1}

Modification within a region on unknown residues

§ 3.2. If the type of residue modified is unknown, the character "?" is used instead of a list of specific residues:

(3.2.1) COMPACT:  up|QRTFRC|:pm.(?)20-34#22[2];
           JSON:  {"acc" : "up|QRTFRC|", "res" : "?",
                   "pos" : {"start" : 20, "end" : 34}, "delta" : 22, "occ" : 2}

indicates that two residues in the protein region including residues 20 to 34 have been modified by the addition of 22 Da.

Multiple modifications in a region

§ 4. Experimentally determined modifications in a region are frequently reported together. To facilitate this, multiple modification specifications may be listed together in a single string, using a semi-colon ";" to separate the individual specifications:

         JSON:  [JSON1, JSON2, ... ]

For example, using one of the examples above:

(4.2) COMPACT:  up|QRTFRC|:pm.S141+Phospho;T144+Phospho;Q145+Deamidated;
         JSON:  [{"acc" : "up|QRTFRC|", "res" : "S", "pos" : "141", "mod" : "Phospho"},
                 {"acc" : "up|QRTFRC|", "res" : "S", "pos" : "144", "mod" : "Phospho"},
                 {"acc" : "up|QRTFRC|", "res" : "Q", "pos" : "145", "mod" : "Deamidated"}]

would be the notation for 2 specific residue phosphorylations and one specific deamination.

Combinations of different styles of modification specification can be used together in this type of notation. The following is equivalent to the previous example:

(4.3) COMPACT:  up|QRTFRC|:pm.S141+Phospho;T144=MOD:00047;Q145#0.984;
         JSON:  [{"acc" : "up|QRTFRC|", "res" : "S", "pos" : "141", "mod" : "Phospho"},
                 {"acc" : "up|QRTFRC|", "res" : "S", "pos" : "144", "sub" : "MOD:00047"},
                 {"acc" : "up|QRTFRC|", "res" : "Q", "pos" : "145", "del" : 0.984}]

Occupancy can be used to describe cases where a single residue may be modified to form more than one final post-translational modification product. For example:

(4.4) COMPACT:   ENSP00000339186:pm.T262+Phospho{0.33};T262+GlcNAc{0.5};
         JSON:  [{"acc" : "ENSP00000339186", "res" : "S", 
                  "pos" : "141", "mod" : "Phospho", "occ" : 0.33},
                 {"acc" : "ENSP00000339186", "res" : "S", 
                  "pos" : "144", "mod" : "Phospho", "occ" : 0.50}]

implies that for protein ENSP00000339186, the fraction of proteins with threonine residue 262 modified by either phosphorylation or glycosylation is 0.33 or 0.5, respectively.

Adding comments

§ 5. In many practical cases it may be useful to add text comments, hints or other types of unspecified extensions to a this type of notation. The special character combination "\\" can be used as delimiters to insert any comment you wish into a modification specification. For example, if your application requires any additional text blocks as part of the specification, you could insert them using comments:

(5.1) COMPACT:  up|QRTFRC|:pm.(ST)141-147\\COMMENT 1\\+Phospho\\COMMENT 2\\;
         JSON:  {{"acc" : "up|QRTFRC|", "res" : ["S, "T"],
                 "pos" : {"start" : 141, "end" : 147}, "mod" : "Phospho",
                 "comment" : ["COMMENT 1", "COMMENT 1"]}

Comments can be any text string that does not include "\\" and it can be inserted anywhere in the notation. Parsers should remove the comments prior to parsing the notation, deleting all comment characters with no replacement characters inserted.

White space and readability

§ 6. The notation used in this document is largely for machine use: its main application was designed to be communicating experimental results to and from computer software. In order to facilitate its use for human-readable forms, white space may be added before and following any of the specified punctuation, specifically the characters : . + = ; # ( ) [ ]. The white space will be removed by any automated parser and should not be interpreted as meaningful by the reader.

For example, the previous modification notation may be written using different combinations of white space while retaining the same meaning:

(6.1)             up|QRTFRC| : pm. S141+Phospho ;T144=MOD:00047 ;Q145#0.984 ;


(6.2)             up|QRTFRC|:pm. S141 + Phospho; T144 = MOD:00047; Q145 # 0.984;

Other readability considerations:

  1. the order in which modifications are specified in this notation does not have any meaning;
  2. white space before or at the end of a complete specification will be ignored;
  3. protein accession numbers cannot contain the character combinations ":pm." or ":p."; and
  4. a complete specification must be terminated with a semi-colon.

Comparison with XML representation

There has been a conscientious effort by the HUPO Protein Standards Initiative to create an eXtensible Markup Language (XML) capable of storing a variety of elements associated with observed protein modifications. This language (MzIndentML) does not have all of the capabilities of the notation given above, however it can be used to represent specific, individual modifications. The goal of our notation is not to replace MzIdentML (or other languages like it), but to make the associated text simpler to send (and interpret) for a wider range of annotation cases than may be possible using the necessarily more general XML representations.

For example, the oxidation of methionine 240 in α-actinin-4 can be represented, using the notation described here as the following text:

(7.1) COMPACT:  IPI00013808.1:pm. M240 + Oxidation;
         JSON:  {"acc" : "IPI00013808.1", "res" : "M", "pos" : 240, "mod" : "Oxidation"}

This modification can also be represented in MzIdentML, but a minimal, valid XML document to replace example (7.1) would require the text shown in example (7.2) (an abridged version of an example from the MzIdentiML project web site):

<?xml version="1.0" encoding="UTF-8"?>
<MzIdentML id="MPC_use_case_handcrafted_edited" 
           xsi:schemaLocation=" ../../schema/mzIdentML1.1.0.xsd"
       <cv id="PSI-MS" 
            fullName="Proteomics Standards Initiative Mass Spectrometry Vocabularies"
       <cv id="BTO"
            fullName="BRENDA tissue 7 enzyme source" 
       <cv id="UNIMOD" 
           fullName="UNIMOD CV for modifications" 
       <cv id="UO" 
           fullName="Unit Ontology" 
       <DBSequence id="prot1_IPI" 
           <cvParam accession="MS:1001088" 
               name="protein description"
               value="IPI:IPI00013808.1; Tax_Id=9606 Gene_Symbol=ACTN4 Alpha-actinin-4"/>
       <Peptide id="prot1_pep4">
           <Modification location="1">
               <cvParam accession="UNIMOD:35"
       <PeptideEvidence id="PE1_SEQ_spec10_pep1" 
            dBSequence_ref="prot1_IPI" />

Implementation of the notation

The notation described in this document may have more options than are required for a particular implementation. If you have created a parser or interface that implements some part of the notation, please indicate that by listing the section numbers (§) that are relevant for your project in the user and API documentation. This document should be referred to as "GPM-2011.12.14".

It is recommended that if a section is listed, then the parser should comply as fully as possible with the specification in that section.

All implementations should comply with § 1 and § 5, as well as at least one of § 2.1, § 2.2, § 2.3 or § 2.4.

Revision date and status

Reference name Revision date Document status Stable URL
GPM-2011.12.14 2012.11.16 Accepted specification
Personal tools