Tornado

From TheGPMWiki
Jump to: navigation, search

Contents

Purpose

Any set of protein sequences for a species in a FASTA file cannot properly record information about known single nucleotide-induced amino acid polymorphisms (SNAPs) and post-translational modifications. The GPM Tornado Project was created to investigate methods to compensate for this difficulty common to all proteomics informatics.

Methods

To solve this annotation problem in a manor consistent with the structure of the GPM and the input requirements of the X! series search engines, a layered XML approach was used to solve the problem.

SNAPs

In the case of SNAPs, a set of files was constructed using the ENSEMBL Biomart system to record the position, type and id's for all amino acid changes associated with human, mouse and rat non-synonymous SNPs. The paths to each of these files were recorded in the taxonomy.xml commonly used by the X! series search engines to look up sequence information associated with a particular species keyword. For example, following entry for the keyword "human" points to a sequence FASTA file and a file with the associated SNAP information:

<taxon label="human">
       <file format="peptide" URL="../fasta/human.fasta" />
       <file format="saps" URL="../saps/human_snaps.xml" />
</taxon>

The SNAPs file uses the following format to record the sequence variation information:

<?xml version="1.0"?>
<bioml label="human snap listing, 2008.07.31, for ENSEMBL human, version 50">
<protein id="ENSP00000229384">
       <aa id="rs17431446" at="3" type="S" mut="L" />
</protein>
<protein id="ENSP00000243349">
       <aa id="rs7594480" at="482" type="I" mut="V" />
       <aa id="rs35500979" at="355" type="I" mut="V" />
       <aa id="rs17856675" at="289" type="V" mut="M" />
       <aa id="rs17852075" at="231" type="S" mut="Y" />
       <aa id="rs34742924" at="216" type="G" mut="R" />
</protein>
...
</bioml>

The SNAPs for each <protein> are listed in <aa /> tags. The "at" attribute indicates the protein co-ordinate for the SNAP, the "type" attribute indicates the amino acid in the original FASTA listing and the "mut" attribute indicates the SNAP residue. The "id" attribute is optional: in these cases it is the dbSNP id associated with the nsSNP responsible for the SNAP. Deletions are indicated by an empty "mut" string. Multiple SNAPs at the same residue must be listed individually. Insertions are not supported at this time.

Post-Translational Modifications

In the case of PTMs, a different approach was taken. Initially, an effort was made to re-use the approach used for the SNAP case, by constructing files from the PTM information in Swissprot/Uniprot using <aa /> tags to indicate which residue was modified and its modification mass. The resulting search engine implementation worked, but it failed to detect many modifications that were known to be present in the data. Examination of the results indicated that while the Swissprot/Uniprot annotation correctly identified which proteins were modified and the type of modification, the protein coordinates for the modifications were often different in experimental data. This difficulty could be explained by a number of effects:

  1. difficulties in accurately projecting protein sequence coordinates from the reference Swissprot sequences onto ENSEMBL sequences;
  2. real biological variability in which residue is modified; and
  3. errors in the underlying data used to construct the Swissprot entries.

To solve this problem, a more generalized approach was taken to recording PTM information in the BIOML annotation files. The taxonomy file entries were analogous to the SNAP approach:

<taxon label="human">
       <file format="peptide" URL="../fasta/human.fasta" />
       <file format="mods" URL="../mods/human_mod.xml" />
</taxon>

The information associated with known PTMs was then recorded for each protein sequence in a more general format:

<?xml version="1.0"?>
<bioml label="human ENSEMBL potential modification annotation">
    <protein label="ENSP00000355265" pmods="42.010565@K" id="sp|P03928|" />
    <protein label="ENSP00000373773" pmods="79.966331@T" id="sp|Q8NB50|" />
    <protein label="ENSP00000383681" pmods="79.956815@Y" id="sp|P41597|" />
    <protein label="ENSP00000383680" pmods="79.956815@Y,79.966331@S" id="sp|P51681|" />
    ...
</bioml>

where for each <protein> tag, the known PTMs are recorded in the "pmods" attribute in the format used by to indicate variable modifications in X! Tandem and P3. In this case, the search engine uses the "pmods" attribute to set the variable modifications used in the search on a protein sequence-specific manner. The "id" attribute is optional, and in this case it indicates the Swissprot/Uniprot sequence accession number associated with the annotation.

This approach was further extended to take into account proteins that have so many PTMs that it can be difficult to identify them using the approaches commonly used in proteomics. Examples of proteins in this category are histones (modification of lysine and arginine), collagens (hydroxyproline and lysine) and ubiquitins (ubiquitination of lysines). The nonsense modification "-1@B" was chosen as a hint to the search engine that a protein should always be checked for the annotated modifications. For example, the entry for one of the splice variants of human COL6A1 (ENSP00000355180) is shown here:

<?xml version="1.0"?>
<bioml label="human ENSEMBL potential modification annotation">
    ...
    <protein label="ENSP00000355180" pmods="15.994915@P,-1@B,15.994915@K" id="gpmdb|COL6A1|" />  
    ...
</bioml>

When processing a data set, the search engine X! Tandem will always check this sequence for these modifications, unless the user has explicitly specified different modifications for these residues.

The use of SwissProt/Uniprot as a source of annotation for this process has been largely replaced by a more general system that can be used for any species, based on protein domain predictions. For example, if a protein has a predicted SH2 domain it is reasonable to expect that the protein will be phosphorylated. Similarly, if a protein has a collagen domain it is reasonable to expect hydroxyproline formation. By compiling a list of domains and their associated modifications it is possible to create an annotation file for any proteome, so long as domain predictions can be performed.

History

  1. The GPM Tornado Project was started in 2006.
  2. The first phase of the Project was completed in February 2008, with the release of the Tornado versions of X! Tandem, P3 and Hunter.
  3. Initial release of SNAP and PTM annotation files for ENSEMBL v 49 sequences and SwissProt PTM annotations at ftp://ftp.thegpm.org/fasta/eukaryote in the "saps" and "mods" folders, respectively.
Personal tools