Toward a standard in structural genome annotation for prokaryotes
- H. James Tripp†1,
- Granger Sutton2,
- Owen White3,
- Jennifer Wortman4,
- Amrita Pati1,
- Natalia Mikhailova1,
- Galina Ovchinnikova1,
- Samuel H. Payne5,
- Nikos C. Kyrpides1 and
- Natalia Ivanova†1Email author
© Tripp et al. 2015
Received: 18 December 2014
Accepted: 1 July 2015
Published: 25 July 2015
In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, 1,004,576 peptides were collected from various publicly available resources, and were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes.
We found that the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study.
A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of exprimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.
As of July 13, 2013, more than a third of the 29,183 bacterial and archaeal genome sequencing projects listed in the Genomes On-line Database (GOLD)  are attributable to four major sequencing centers: DOE Joint Genome Institute (JGI, 4,250 projects), The Broad Institute (3,155 projects), J. Craig Venter Institute (JCVI, 1,976 projects), and Institute for Genome Sciences (IGS, 1,269 projects). Assuming an average of 3,000 gene predictions per genome for the 10,650 projects at these sequencing centers, an estimated 31,950,000 gene predictions will have been made by the completion of these projects. Given that each sequencing center has its own automated gene prediction pipeline, using software that has evolved separately over more than a decade, the question arises as to best current practices in structural genome annotation. In this context the phrase “structural gene annotation” refers only to finding the loci of protein-coding genes, not to annotating protein functions or predicting their 3D structure. Implementation of a single best practice would have the benefit of producing a single gene locus identifier for ease of cross-referencing in the scientific literature and for use by comparative genomics software . A related motivation for this study was the need to consistently reannotate public genomes whose annotations are now more than a decade old.
Functional genomics data, such as RNA sequencing (RNA-Seq) and proteomics, provide a useful reference for evaluating and improving genome annotations [3–8]. A combination of the two is especially powerful, since RNA-seq data reveals transcript boundaries, whereas proteomics helps mapping translated sequences (coding sequences or CDSs). We found very few genomes where peptide data was available to confirm RNA-seq data, and since all available prokaryotic gene finders predict translated products rather than transcript boundaries, we explored whether proteomics alone could serve as a tool to identify a best practice for updating gene calls in outdated genome annotations.
A test set of genomes with varying GC% was identified and the gene calls for GeneMarkS , Glimmer3 , and Prodigal , which are the three most popular ab initio methods, were obtained from RefSeq’s public ftp site . Peptides for most of the genomes were compiled from a PNNL website , with other data obtained from the PRIDE BioMart  and the publications of several independent research labs [6, 14, 15] (Additional file 1). We used the peptides to evaluate the accuracy of the GeneMarkS, Glimmer3, and Prodigal2.5 gene callers. In addition, we evaluated gene calling by the gene-finding post processor GenePRIMP  developed by JGI. Notably, the genome annotation versions of GeneMarkS, Glimmer3, Prodigal and GenePRIMP correspond to July 2013, when the work on this project was started. Based on the results of this work, reannotation of all public genomes integrated in IMG has begun at the JGI using the Prodigal based gene calling pipeline.
Comparison of gene predictions
Peptide coverage of genes
Peptide coverage of predicted genes, which is to say the percentage of genes in the entire dataset that had at least one peptide mapping wholly inside of the gene, was on average approximately 40 % (data not shown). Total peptide support for gene calls, which is to say the total number of peptides that fell wholly inside of any gene prediction, was highest for Prodigal (1,000,574) and lowest for Glimmer3 (994,973) with GeneMarkS intermediate between the two (996,336).
Comparison of detectible errors
No proteomic experiment can guarantee expression of every gene in the genome.
Signal peptides, are often removed from proteins, making it impossible to guarantee peptide data pertaining to the true start site of translation.
Some peptide sequences, particularly highly hydrophobic ones, are not amenable to detection by mass spectrometry.
The mapping of peptide mass spectra to genome sequence may be erroneous and thus presents an opportunity for false positives.
It is impossible to detect “too long” errors in gene start calling using peptide data, since an error correcting peptide will never appear upstream of a predicted start that is already upstream of the true start. It is important to recognize that it is possible for a lower “short” gene error rate to be offset by a higher “long” gene error rate, resulting in a better overall rate of calling correct gene starts. So the “short” gene error rate in itself is not an unbiased measurement of a gene finder’s ability to choose correct gene start sites. However, considering that “short” gene errors prevent identification of functionally important conserved domains and motifs, and therefore can result in erroneous functional predictions, we report it here with this caveat in mind. In addition, we should point out that some genes have alternative translation initiation sites. This may have caused some spurious “short” errors, however all of the gene callers were under the same handicap in this regard.
It is impossible to detect false positive gene calls using peptide data, since peptides can only confirm gene calls; they cannot deny them.
In addition to these general caveats, it must also be reiterated that the genomes chosen for analysis are not a random, representative sample. Therefore, the results presented here must be considered an estimate of gene calling performance detectable with proteomics, not a definitive and absolute measurement of true gene calling performance.
At the same time, there is no definitive measurement of true gene calling performance against a randomly chosen, fully representative set of genomes. The biological knowledge to force expression of every protein in every genome does not exist, nor do high throughput biochemical methods for detecting every amino acid residue in every translation product in a cell, even if such knowledge of expression were available. The expression rate for this study, as measured by the peptide coverage reported above, averaged less than half (~40 %), but there is no reason to assume that this sample is systematically biased for or against any particular gene caller. Also, while it is true that peptides cannot detect false positive gene calls, statistical observations can give evidence of false positives: Glimmer3 made twice as many unique gene calls as Prodigal or GeneMarkS, but had the fewest number of confirming peptides. This does not prove that it makes more false positive predictions than the other gene callers; it simply offers some evidence that it might. High throughput proteomic data is the only option available for performing a wide survey of gene calling accuracy for thousands of genes in dozens of genomes. Use of high throughput proteomics is therefore an operational necessity if one wishes to perform a survey of gene-calling methods in preparation for a task such as reannotating all public genomes.
An important parameter affecting gene predictions made by ab initio gene callers is minimum gene length. Other things being equal, a shorter minimum gene length yields more candidate ORFs, which can result in a larger number of genes called. A biologically meaningful minimum gene length is 39 nucleotides (nt), which is the length of the PatS peptide, the shortest CDS yet detected . However, such a short length generates so many spurious candidate ORFs that it is not recommended by designers of ab initio gene callers. The default minimum gene lengths recommended by program designers are: 90 nt for Prodigal, 81 nt for GeneMarkS, and 120 nt for Glimmer. These defaults were suggested by the developers of the corresponding tools to ensure their optimal performance. Selecting any other minimum gene length cutoff than 39 nt cannot be biologically justified but will undoubtedly result in poor performance; furthermore, it is likely to bias the analysis against one or another gene finder. For these reasons we chose to proceed with default cutoffs.
Turning to an analysis of the data regarding the three ab initio gene callers, it appears that Glimmer3’s “aggressive” algorithm for finding novel coding regions makes it prone to errors detectable with proteomics, while Prodigal’s design objective of eliminating false positives while retaining sensitivity makes it the least prone to such errors. The version of GeneMarkS tested, which now has been improved but not yet released, produced intermediate results. It is possible that the “aggressive” gene calling of Glimmer might be appropriate for a different set of genomes from novel single cell organisms. It may also be that the careful modeling of non-coding regions done by GeneMarkS (Mark Borodovsky, personal communication) may avoid more false positives than Prodigal in genomes with exceptionally low coding percentage and low GC content.
A similar situation occurs when sequencing error introduces an interrupted gene. GenePRIMP attempts to detect these instances and joins coding domains it deems to have likely been interrupted by the introduction of a spurious stop codon due to sequencing error. The partial coding domains are annotated as “exons,” even though they are not pieces of a gene interrupted by introns (as in eukaryotes), but rather pieces of a gene with spurious interruptions introduced by sequencing errors that result in frameshifts and internal stop codons. The missed gene data suggests that the GenePRIMP algorithm for detecting and annotating sequencing error is reliable and appropriate, albeit with a novel interpretation of “exon.” For similar reasons, NCBI has recently changed the guidelines for annotation of interrupted genes in RefSeq genomes (personal communication). Interrupted genes are annotated as partial coding regions if their translated protein products have significant similarity to full-length proteins in closely related genomes.
Proteomics is a valuable aid to evaluating and improving gene-calling programs. When applied to 45 replicons of interest to the participants of this study, a combination of ab initio gene calling by Prodigal followed by GenePRIMP post processing had a lower estimated, operational error rate than GeneMarkS followed by Glimmer3. We have also compared these data against the RefSeq pipeline (version available in Spring 2013) and the results showed that the its overall performance was between that of GeneMark and Glimmer (data not shown). Nonetheless, due to inherent biologically-based limitations, we cannot conclude that proteomics alone should be used to define a best practice as the basis for a general standard in prokaryotic structural genome annotation; this must wait for better tools and expanded datasets that cover more taxonomic groups without biases in GC content, genome length, and gene expression levels. Some participants have already improved their pipelines, especially gene data models and reference databases, with the goal of one day achieving a much needed standard. Moving forward, a consensus approach, employing multiple gene callers and additional forms of expression verification such as RNA-seq, should be also explored as the possible basis for a standard in prokaryotic structural genome annotation.
Selection of genomes of interest
The annotation was thought to be in need of updating.
The organism was well studied; preferably a type strain whose annotation had been heavily curated.
The genome added taxonomic diversity to the dataset.
Because of the genome selection criteria, the sample set is diverse and relevant to the participants, but it cannot be considered a representative random sample of genomes in nature or in public databases. The average GC content was 57.7 %, with a range of 30.8 % to 74.3 %. The average genome size was 4.3 Mbp with a range of 1.5 Mbp to 8.2 Mbp.
Collection of data and loading to MySQL warehouse
A mySQL warehouse of proteogenomic information was created by acquiring, transforming, and loading publically available gene calls and proteomic data. The data sources for gene calls and peptides are shown in Additional file 1 and are available at http://portal.nersc.gov/dna/microbial/prokpubs/SIGS_proteogenomics/. GeneMarkS, Glimmer3, and Prodigal-2.5 in .gff format were downloaded from RefSeq public ftp site at the time of this study (Spring, 2013). These gene call coordinates were extracted from the .gff files and loaded into the warehouse. It must be noted that as a result of this study, development of a new version of GeneMarkS has started (GeneMarkS-2), however, predictions made by the new version have not been used in this study. The PNNL peptide data was also provided in .gff format, with mapping to its associated GenBank nucleotide sequence, allowing easy extraction and loading into MySQL. The non-PNNL peptide data was often not provided in .gff format, and sometimes did not have end coordinates. However, it always included individual peptide sequences, allowing each peptide to be mapped to its coordinates in the corresponding GenBank fasta file. Mapping was accomplished using a Perl script that searched for exact, unique matches in one of the six translation frames of the corresponding nucleotide sequence for the peptide. Short peptides that could not be mapped unambiguously were discarded. Unambigously mapped peptides were loaded into MySQL.
Identification and analysis of peptides conflicting with gene calls
The authors wish to thank Doug Hyatt of Oak Ridge National Laboratory for a detailed description of the algorithms employed by Prodigal as well as Mark Borodovsky (Georgia Tech) and Tatiana Tatusova (NCBI) for helpful comments and discussion. This work was performed under the auspices of the US Department of Energy’s Office of Science, Biological and Environmental Research Program, and by the University of California, Lawrence Berkeley National Laboratory under contract No. DE-AC02-05CH11231.
- Reddy TB, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, et al. The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 2014, doi: 10.1093/nar/gku950.
- Kyrpides NC. Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotechnol. 2009;27(7):627–32.View ArticlePubMedGoogle Scholar
- Tanner S, Shen Z, Ng J, Florea L, Guigó R, Briggs SP, et al. Improving gene annotation using peptide mass spectrometry. Genome Res. 2007;17:231–9.View ArticlePubMedPubMed CentralGoogle Scholar
- de Souza GA, Softeland T, Koehler CJ, Thiede B, Wiker HG. Validating divergent ORF annotation of the Mycobacterium leprae genome through a full translation data set and peptide identification by tandem mass spectrometry. Proteomics. 2009;9:3233–43.View ArticlePubMedGoogle Scholar
- Zivanovic Y, Armengaud J, Lagorce A, Leplat C, Guérin P, Dutertre M, et al. Genome analysis and genome-wide proteomics of Thermococcus gammatolerans, the most radioresistant organism known amongst the Archaea. Genome Biol. 2009;10:R70.View ArticlePubMedPubMed CentralGoogle Scholar
- Baudet M, Ortet P, Gaillard JC, Fernandez B, Guérin P, Enjalbal C, et al. Proteomics-based refinement of Deinococcus deserti genome annotation reveals an unwonted use of non-canonical translation initiation codons. Molecular Cellular proteomics: MCP. 2010;9:415–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Castellana N, Bafna V. Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteome. 2010;73:2124–35.View ArticleGoogle Scholar
- Venter E, Smith RD, Payne SH. Proteogenomic analysis of bacteria and archaea: a 46 organism case study. PLoS One. 2011;6, e27587.View ArticlePubMedPubMed CentralGoogle Scholar
- Borodovsky M, Lomsadze A. Gene identification in prokaryotic genomes, phages, metagenomes, and EST sequences with GeneMarkS suite. Curr Protoc Microbiol. 2014;32(1E):7.PubMedGoogle Scholar
- Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23(6):673–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ, et al. Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119.View ArticlePubMedPubMed CentralGoogle Scholar
- Tatusova T, Ciufo S, Federhen S, Fedorov B, McVeigh R, O'Neill K, et al. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42:D553–559.View ArticlePubMedPubMed CentralGoogle Scholar
- Vizcaíno JA, Côté RG, Csordas A, Dianes JA, Fabregat A, Foster JM, et al. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013;42013, 1:D1063–1069. doi:D1063.
- Gallien S, Perrodou E, Carapito C, Deshayes C, Reyrat JM, Van Dorsselaer A, et al. Ortho-proteogenomics: multiple proteomes investigation through orthology and a new MS-based protocol. Genome Res. 2009;19:128–35.View ArticlePubMedPubMed CentralGoogle Scholar
- Clair G, Roussi S, Armengaud J, Duport C. Expanding the known repertoire of virulence factors produced by Bacillus cereus through early secretome profiling in three redox conditions. Molecular Cellular proteomics: MCP. 2010;9:1486–98.View ArticlePubMedPubMed CentralGoogle Scholar
- Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A, et al. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. Nat Methods. 2010;7:455–7.View ArticlePubMedGoogle Scholar
- Yoon HS, Golden JW. Heterocyst pattern formation controlled by a diffusible peptide. Science. 1998;282:935–8.View ArticlePubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.