The process for annotation generally involves synthesizing your knowledge of phage genomics with the evidence tracks available to you.
The CPT encourages phage annotators on the CPT’s Apollo instance to follow these conventions:
Field | Recommended Value |
---|---|
Name | Your gene’s name. Could be something like DHFR or Hypothetical Novel . |
Symbol | Do not use. |
Description | Do not use. |
DBXRefs | Use only if experienced annotator, please ensure format is correct. |
Attributes | Do not use. |
Pubmed IDs | Do not use. |
Gene Ontology IDs | Do not use. |
Comments | Apply any free-text comments you would like here. |
We integrate as many data sources as is faesible. If you need another data source that does not appear in Apollo, please contact Eric Rasche and the CPT can work on adding that to the Phage Annotation Pipeline (PAP).
We run BlastP against three databases:
These databases can give you good insights into possible names and functionalities for proteins.
The CPT’s PAP integrates gene calls from numerous sources, specifically GeneMarkS, MetaGeneAnnotator, and Glimmer3
We run a number of “phage analyses” which are mostly tools the CPT has developed to aid in phage specific annotation.
Candidate ISP and OSPs are our attempts to determine possible phage spanin component locations. This track will feature a huge number of false positives, so you should be sure that the data occurs somewhere around your lysis cluster (where appropriate).
The ISP track naively searches the genome for every possible CDS, and then analyses them with TMHMM. We do this in case you miscalled, or missed your I-Spanin. The OSP track searches through every possible CDS which contains a lipobox, defined by the CPT’s regular expression for them.
Both of these datasets are filtered for proximity. Co-incidence of a possible ISP gene and a possible OSP gene is a good sign, but you will need to use genomic context information to complete the functionality inference.
This track analyses your BlastP against NR data for locations where two (or more) disjoint, called, CDSs match separate locations on the same target protein. An example from Phage K is illuminating here:
Both 195a and 195b align to distinct regions of the same protein, based on BLAST data. It can be theorized that these are actually one protein with one intron and two exons, however this evidence should not be taken as 100% correct. Similar results may happen for other reasons such as separation of domains from a single protein due to evolution, sequencing errors, and a host of other possibilities.
This track, likewise, is very optimistic in what it calls a possible frameshift. It searches for the XXXYYYZ pattern (allowing for some wobble) wherein a frameshift would not change both codons. This is based on evidence in 10.1016/j.molcel.2004.09.006.
These analyses are other sequence or structural predictions.
Here TMHMM is run over your genome to pick out genes containing likely TMDs. TMHMM data is used in a number of other tracks and analyses as well.
Terminators are produced from TransTermHP. You cannot currently annotate these in Apollo. That is being worked on upstream. For now, when you go to publish a genome, the CPT will work with you to run a number of other automated annotation processes as part of a post-processing step.
ARAGORN provides us with quality tRNA annotations. You should feel comfortable annotating those in Apollo. Bear in mind that tRNA are not likely to be embedded within genes.
Missing from this graphic is InterProScan which is an extremely useful domain finder. The results from InterProScan can help inform protein identies and provide solid evidence for functionality.
InterProScan produces IPR##### entries which will be automatically applied to your genome as part of a finishing process. Please contact Eric for more information.