Nomenclature for the description of protein sequence modifications

From TheGPMWiki
Revision as of 20:13, 7 December 2011 by WikiSysop (Talk | contribs)
Jump to: navigation, search

Contents

1. Rationale

While there have been efforts to create ontologies and controlled vocabularies to describe the various types of amino acid modifications that can be observed in proteomics, there has been no proposed succinct notation for describing those modifications in their biological context. Existing methods of describing protein modifications also tend to be "crisp" (assigning modifications to specific residues) while most measurements are less precise (assigning modifications to protein domains). This document lays out a notation meant to be use to realistically describe the results of proteomics experiments, using a format similar to the Human Genome Variation Society's notation for describing protein amino acid polymorphisms.

Unlike the HGVS notation, amino acid residues may only be represented as their single letter codes: three letter codes or full names are not allowed. For example, "S" may be used to represent serine, but "Ser" or "serine" are not allowed.

2. Simple modifications

2.1 (+) Specifying modifications as side chain changes

The general nomenclature format proposed for the case where the modification is being described as a change to the structure of the amino acid residue in question:

ACCESSION:pm.Xnn+MODIFICATION

where "ACCESSION" is the accession number for the protein, "pm." indicates that it is a protein modification, "X" is the single letter symbol for the amino acid residue, "nn" is its ordinal position in protein co-ordinates, "+MODIFICATION" specifies the change. For example, using the PSI-MS vocabulary for modifications, the notation

ENSP00000339186:pm.T262+Phospho

indicates that for the protein accession number ENSP00000339186, the threonine residue number 262 is phosphorylated. It should be noted that the "+" symbol indicates that the side chain has changed: the change may result in either an increase or decrease in residue molecular mass.

2.2 (=) Specifying modifications as new amino acid residues

Another approach taken to specifying residue modifications is to consider each modified residue to be a completely new amino acid: the PSI-MOD ontology uses this philosophy. If this type of residue replacement idea is to be used, then the notation will be:

ACCESSION:pm.Xnn=NEW_RESIDUE

where "=NEW_RESIDUE" uses the residue-specification ontology required. Using this strategy for naming the modification in the previous example:

ENSP00000339186:pm.T262=MOD:00047

indicates that for protein accession number ENSP00000339186, the threonine residue number 262 has been replaced with the residue specified by the PSI-MOD ID "MOD:00047" (O-phospho-L-threonine).

In the special case where one amino acid residue has been substituted by another residue at the DNA or mRNA level, the HGVS annotation (:p.) should be used rather than this type of modification annotation:

ACCESSION:p.XnnY

where X is the residue in the sequence associated with the accession number and Y is the new residue.

2.3 (#) Specifying modifications as a change in mass

Some times, when using mass spectrometry based approaches to proteomics, it is not possible (or wise) to attribute a particular measured mass difference to a specific residue modification. In such a case, the appropriate nomenclature would be:

ACCESSION:pm.Xnn#DELTA

where "#DELTA" is the difference between the measured mass and the residue mass of "X" in Daltons. Using the example above,

ENSP00000339186:pm.T262#79.9663

indicates that for protein accession number ENSP00000339186, a measurement shows that the threonine residue number 262 has been modified by the addition of 79.9663 Daltons.

3. Modified domains

In this case, the specific residue modified cannot be specified from a measurement. It is possible, however, to determine a specific domain in the protein that has been modified. For example, in mass spectrometry-based proteomics it is common to use tryptic peptides to identify modifications and each tryptic peptide is a domain of the original protein. Similarly, alanine-scanning techniques may show similar biological effects of a modification at several closely sited similar residues.

3.1 Modification within a domain on specific residues

The general format for this type of notation is:

ACCESSION:pm.(XY)mm-nn+MODIFICATION[cc]

where ACCESSION is the protein accession number, the "(XY)" is a list of single letter amino acid codes that are the specific types of residues that may be modified and "mm-nn" are the start and end of the domain in protein co-ordinates and "[cc]" is the number of residues in the domain with this modification. If "[cc]" is not specified, it is taken to be the default value "[1]". For example, if a serine/threonine phosphorylation has been traced to the domain "141 SSSTQLR 147" of the protein with the Uniprot accession code "up|QRTFRC|", this would be expressed as:

up|QRTFRC|:pm.(ST)141-147+Phospho

3.2 Modification within a domain on unknown residues

If the type of residue modified is unknown, the character "?" is used instead of a list of specific residues:

up|QRTFRC|:pm.(?)20-34#22.999[2]

indicates that two residues in the protein domain including residues 20 to 34 have been modified by the addition of 22.999 Da.

4. Multiple modification within a domain

Experimentally determined modifications on a domain are frequently reported together. To facilitate this, multiple modification specifications may be concatenated into a single string, using a semi-colon ";" to separate the individual specifications:

ACCESSION:pm.MOD_SPEC1;MOD_SPEC2; etc.

For example, using one of the examples above:

up|QRTFRC|:pm.S141+Phospho;T144+Phospho;Q145+Deamidated

would be the notation for 2 specific phosphorylation and one deamination.

Combinations of different styles of modification specification can be used together in this type of notation. The following is equivalent to the previous example:

up|QRTFRC|:pm.S141+Phospho;T144=MOD:00047;Q145#-0.984016

5. White space and readability

The notation used in this document is largely for machine use: its main application was designed to be communication experimental results to and from computer software. In order to facilitate its use for human-readable forms, white space may be added before and following any of the specified punctuation, specifically the characters ":.+=;#[]()". The white space will be removed by any automated parser and should not be interpreted as meaningful by the reader.

For example, the previous domain notation may be written as:

up|QRTFRC| : pm. S141+Phospho ;T144=MOD:00047 ;Q145#-0.984016
or
up|QRTFRC|:pm. S141 + Phospho; T144 = MOD:00047; Q145 # -0.984016
Personal tools