Beavis.ron (Talk | contribs) |
m (→Post-Translational Modifications) |
||
Line 38: | Line 38: | ||
<taxon label="human"> | <taxon label="human"> | ||
<file format="peptide" URL="../fasta/human.fasta" /> | <file format="peptide" URL="../fasta/human.fasta" /> | ||
- | <file format=" | + | <file format="mods" URL="../mods/human_mod.xml" /> |
</taxon> | </taxon> | ||
The information associated with known PTMs was then recorded for each protein sequence in a more general format: | The information associated with known PTMs was then recorded for each protein sequence in a more general format: |
Contents |
Any set of protein sequences for a species in a FASTA file cannot properly record information about known single nucleotide-induced amino acid polymorphisms (SNAPs) and post-translational modifications. The GPM Tornado Project was created to investigate methods to compensate for this difficulty common to all proteomics informatics.
To solve this annotation problem in a manor consistent with the structure of the GPM and the input requirements of the X! series search engines, a layered XML approach was used to solve the problem.
In the case of SNAPs, a set of files was constructed using the ENSEMBL Biomart system to record the position, type and id's for all amino acid changes associated with human, mouse and rat non-synonymous SNPs. The paths to each of these files were recorded in the taxonomy.xml commonly used by the X! series search engines to look up sequence information associated with a particular species keyword. For example, following entry for the keyword "human" points to a sequence FASTA file and a file with the associated SNAP information:
<taxon label="human"> <file format="peptide" URL="../fasta/human.fasta" /> <file format="saps" URL="../saps/human_snaps.xml" /> </taxon>
The SNAPs file uses the following format to record the sequence variation information:
<?xml version="1.0"?> <bioml label="human snap listing, 2008.07.31, for ENSEMBL human, version 50"> <protein id="ENSP00000229384"> <aa id="rs17431446" at="3" type="S" mut="L" /> </protein> <protein id="ENSP00000243349"> <aa id="rs7594480" at="482" type="I" mut="V" /> <aa id="rs35500979" at="355" type="I" mut="V" /> <aa id="rs17856675" at="289" type="V" mut="M" /> <aa id="rs17852075" at="231" type="S" mut="Y" /> <aa id="rs34742924" at="216" type="G" mut="R" /> </protein> ... </bioml>
The SNAPs for each <protein> are listed in <aa /> tags. The "at" attribute indicates the protein co-ordinate for the SNAP, the "type" attribute indicates the amino acid in the original FASTA listing and the "mut" attribute indicates the SNAP residue. The "id" attribute is optional: in these cases it is the dbSNP id associated with the nsSNP responsible for the SNAP. Deletions are indicated by an empty "mut" string. Multiple SNAPs at the same residue must be listed individually. Insertions are not supported at this time.
In the case of PTMs, a different approach was taken. Initially, an effort was made to re-use the approach used for the SNAP case, by constructing files from the PTM information in Swissprot/Uniprot using <aa /> tags to indicate which residue was modified and its modification mass. The resulting search engine implementation worked, but it failed to detect many modifications that were known to be present in the data. Examination of the results indicated that while the Swissprot/Uniprot annotation correctly identified which proteins were modified and the type of modification, the protein coordinates for the modifications were often different in experimental data. This difficulty could be explained by a number of effects:
To solve this problem, a more generalized approach was taken to recording PTM information in the BIOML annotation files. The taxonomy file entries were analogous to the SNAP approach:
<taxon label="human"> <file format="peptide" URL="../fasta/human.fasta" /> <file format="mods" URL="../mods/human_mod.xml" /> </taxon>
The information associated with known PTMs was then recorded for each protein sequence in a more general format:
<?xml version="1.0"?> <bioml label="human ENSEMBL potential modification annotation"> <protein label="ENSP00000355265" pmods="42.010565@K" id="sp|P03928|" /> <protein label="ENSP00000373773" pmods="79.966331@T" id="sp|Q8NB50|" /> <protein label="ENSP00000383681" pmods="79.956815@Y" id="sp|P41597|" /> <protein label="ENSP00000383680" pmods="79.956815@Y,79.966331@S" id="sp|P51681|" /> ... </bioml>
where for each <protein> tag, the known PTMs are recorded in the "pmods" attribute in the format used by to indicate variable modifications in X! Tandem and P3. In this case, the search engine uses the "pmods" attribute to set the variable modifications used in the search on a protein sequence-specific manner. The "id" attribute is optional, and in this case it indicates the Swissprot/Uniprot sequence accession number associated with the annotation.