Updating Ontology Collections

From TheGPMWiki
Revision as of 19:18, 25 January 2011 by WikiSysop (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The contents of the files viewable through http://gpmdb.thegpm.org/go/index.html can be both refreshed and updated on a standard installation of the GPM. Below is a set of directions on how to refresh the existing files, how to obtain new data and how to update ontologies with the new proteins.

The display information is stored in /thegpm/go/, and any refreshing of the information will result in new versions of the files available in that directory being created.

The source files for the lists of proteins in a given ontology are stored in /thegpm/scripts/ont_builder/ and three subfolders:

  • /thegpm/scripts/ont_builder/human_go/
  • /thegpm/scripts/ont_builder/mouse_go/
  • /thegpm/scripts/ont_builder/yeast_go/

Contents

Scripts

The scripts that update the various ontologies are available in the /thegpm/scripts/ont_builder/ folder on a standard installation of the GPM's software. Some manual editing of the scripts must currently be done to either update or refresh the gene ontology pages.

The scripts that perform the refreshing and updating work are:

  • ont_builder.pl: performs the actual creation of display files.
  • ont_builder_batch.pl: creates a list of files and calls ont_builder.pl on each of them.
  • ont_builder_update_chromosome_proteins.pl: updates the chromosome-based ontology of proteins for both human and mouse.
  • ont_builder_update_go_proteins.pl: updates the gene-based ontology of proteins for human, mouse and yeast.
  • ont_builder_update_go_details.pl: updates the file containing the identified vs. total proteins counts for each species, in each gene ontology.

Getting data files

Data files for both chromosome and GO ontologies are downloaded from BioMart on the ENSEMBL website. Use the most recent Ensembl release, and then choose the appropriate species below.

Data files for the BTO ontologies are generated separately. The source protein lists are generated periodically through the Normal Clinical Tissue Alliance and updated at that time.

GO data files

Download a tab-delimited file for any of the species to be updated. The column order for the data files for updating the GO category displays is as follows:

  • Ensembl Protein Identifier
  • GO Group Name
  • GO Group Name (mf)
  • GO Group Name (cc)

The GO Group fields can be in any order, but the protein identifier must be first. Any subsequent fields will be ignored.

Chromosome data files

Download a tab-delimited file for any of the species to be updated. The column order for the data files for updating the chromosome displays is as follows:

  • Chromosome Name
  • Start position (bp)
  • Ensembl Protein Identifier

Any subsequent fields will be ignored.

Updating ontology content

Updating the list of proteins associated with an ontology will modify the source XML files from which the display files are built. This step must be done first if there is new information available from Ensembl (e.g., after a new data release). However, if refreshing the best expect and identification counts for each protein in an ontology is all that is required, this step can be skipped.

Chromosomes

script editing

human

mouse

GO categories

script editing

human

mouse

yeast

Refreshing ontology content

Refreshing the content of an ontology will recalculate the identification counts and best expect scores for each of the proteins defined as belonging to the ontology in the source XML files.

Chromosomes

script editing

human

mouse

GO categories

script editing

human

mouse

yeast

Personal tools