The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data
© The Author(s) 2010
Published: 30 April 2010
The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms , EC numbers , and JCVI functional role categories .
KeywordsJ. Craig Venter Institute prokaryotic shotgun metagenomics environmental sequencing functional annotation Global Ocean Sampling Sargasso Sea
Shotgun metagenomics sequencing datasets are among the most challenging types of biological information to successfully handle from the perspectives of both size and complexity . Nonetheless, they represent the best method available for sampling the largely uncharacterized diversity of microbes (and their array of functional genes) present in environmental samples [2,3]. In this context, the term “environmental” includes both traditional (e.g., soil, water, air) and host-based (e.g., oral biofilm, distal gut) samples. Moreover, this technique greatly enables the study of uncultivated organisms, which represent the vast majority of life in a number of biomes, including an astonishing 99% in soil .
The issue of how to handle this diversity and complexity became especially pressing at the time of JCVI’s Sargasso Sea and Global Ocean Survey expeditions [1,5], which each produced and required the analysis of several million Sanger reads. Although these data sets were revolutionary in their size at the time, the switch from Sanger to next-generation sequencing technologies for the study of environmental complexity has caused metagenomics data sets to become exponentially larger and more complex [6,7]. Using the JCVI prokaryotic metagenomics analysis pipeline, we routinely process data an order of magnitude larger than the original Sargasso collection.
In order to maximize scientific flexibility, the prokaryotic metagenomics pipeline, as reported here, is compartmentalized into structural and functional annotation components, which can be run together or separately. It is designed to process data sets of the scale of tens of millions of sequencing reads given JCVI’s current computing resources, and is scalable to even larger data sets.
There are a number of systems other than the JCVI metagenomics pipeline designed for the purpose of rapid and consistent functional annotation of environmental shotgun sequencing data, including MG-RAST  and IMG/M .
The JCVI prokaryotic metagenomics pipeline is designed for significant input flexibility. Gene finding, which is referred to in this paper as structural annotation, requires as input a multi fasta file containing nucleotide sequence, while the functional annotation component accepts multi fasta inputs of peptide sequence. The various structural and functional annotation activities also rely on the presence of sequence, profile, and HMM databases (e.g., Pfam, TIGRFAM) for comparison, as described in the appropriate sections below.
Structural annotation results in the identification of the most probable proteins or fragments thereof, present in nucleotide sequence data. It also produces lists of ncRNAs. Functional annotation, the assignment of functional attributes to putative protein sequences, is derived based on a value hierarchy established via homology to the corpus of available resources using BLAST , RPS-BLAST , HMM , and other homology search algorithms. The primary attributes assigned by the functional annotation component are: gene name, gene symbol, GO terms , EC numbers , and JCVI functional role categories . The EC system is a numerical classification scheme based on the chemical reaction(s) that a specific protein catalyzes , while GO terms seek to impose structured controlled vocabularies to describe molecular functions, biological processes, and cellular components . JCVI role categories are a two-level functional classification system of assignments for protein cellular function .
Third party tools, cutoffs, and parameters used in this pipeline
tRNAscan-SE -q -b -G
ncRNA finder stage 1
blastall -p blastn -i -d -e 0.1 -F ″T″-b 1 -v 1 -z 3000000000 -W 9
ncRNA finder stage 2
blastall -p blastn -i -d -e 1e-4 -F ″m L″-b 1500 -v 1500 -q -5 -r 4 -X 1500 -z 3000000000 -W 9 -U T
blastall -p blastp -v 10 -b 10 -X 15 -e 1e-5 -M BLOSUM62 -J F -K 10 -f 11 -Z 25.0 -W 3 -U F -I F -E -1 -y 7.0 -G -1 -A 40 -Y 0.0 -F ″T″ -g T -d -i -o -z 1702432768 -m 7
rpsblast -i -d -m 8 -e 1e-10
The function of this component is to identify the best possible open reading frames from the metagenomics shotgun sequencing reads. This is performed in full knowledge that the putative proteins identified are likely to be fragments of the full-length protein — and as such the beginning and end of each read are treated as putative start and stop sites. This process can be run with the “clear-range” mode either on or off; the former mode is useful primarily for Sanger data. In this case, only the region of each trace specified by the clear-range information in the fasta header is used in the analysis. Clear-ranges are established using the base correctness metric “quality values” [16,17].
Prior to identification of protein coding genes in the sequence data, ncRNAs must first be found and masked. This is accomplished through a pair of processes, tRNAScan-SE  and a set of two increasingly stringent BLAST  searches performed against a JCVI’s internal reduced-complexity rRNA database (Table 1). The latter contains a representative sampling of known 5S, 16S, 18S, and 23S rRNA sequences. In both cases, the identified regions are “soft-masked”, and the ncRNAs written to separate output multi fasta files for downstream rRNA analyses. Soft-masking is a method used by the structural annotation pipeline to prevent information loss at the read level while allowing certain zones of the fasta record to be differentiated. Converting the region of the sequence containing ncRNAs to lower case letters transfers enough information downstream to exclude ncRNAs before putative proteins are identified.
The resulting soft-masked sequences are subjected to a three step putative protein identification process. Step 1 is a naïve 6-frame translation that identifies each possible ORF with a minimum size of 180 nucleotides. This cutoff was selected based on the expected size of typical bacterial genes (∼900 bp, unpublished), such that a reasonable fraction (∼20%) of a putative gene is the minimum for annotation to proceed. Note that smaller cutoffs can be used as needed, such as in the case of viral metagenomics processing. Each run of the pipeline requires a user specified codon table that determines the length and actual amino acid sequence of each ORF. ORFs are defined as the longest possible frame from start to stop. The beginning and end of each sequence record is treated as both stop and start, for purposes of maximizing the sensitivity of the system for fragments. Step 2 requires the use of MetaGeneAnnotator , an ab initio gene predictor tool which uses empirical data including sequence base composition, distance, and orientation of genes of completely sequenced genomes to identify open reading frames. Step 3 consists of using the putative proteins identified in step 2 to “tag” the ORFs found in step 1 — those overlapping the nucleotide space of the MetaGene calls are defined as the most likely proteins, even if they extend past the defined MetaGene prediction boundary. This process produces set of the longest possible putative proteins from sequence data. The output produced is a multi fasta file of putative peptides for functional annotation. Overlaps between ncRNAs and putative proteins are allowed if they compose less than 30 of the 180 nucleotide minimum (i.e., 150 unmasked nucleotides required for structural annotation). The results of the JCVI structural annotation pipeline are frequently supplemented by putative proteins identified through incremental clustering processing of the same sequence data . The putative proteins are processed by the functional annotation component.
Unlike MG-RAST, our pipeline does not confine itself to BLASTX based peptide identification using a defined dataset, and as a result has a larger yield of putative proteins for an equivalent number of input reads. IMG/M, on the other hand, takes a two-tiered approach, with the proteins in reads assessed directly using RPS-BLAST against Pfam and COG  databases, and indirectly through BLASTX-derived “proxygenes” . Both IMG/M and MG-RAST thus consolidate the structural and functional searches into a single step, while the JCVI system obtains additional flexibility by separating them.
Functional annotation, the assignment of the most probable biological role for a given peptide, occurs in two phases: the collection of a wide array of information for each putative protein (“data collection components”), and the application of an annotation value hierarchy to that information corpus. In such a way, each putative protein is given annotation that can be conservatively supported by the available collection of homology-based scientific knowledge. For all the steps below, the pipeline provides raw outputs that may be used for downstream analysis. Note that the data collection components operate independently and in parallel, but the value hierarchy operation must wait on completion of all the data collection components to proceed to functional assignment.
BLASTP evidence classes.
BLAST Hit Class
35% identity or greater, across 85% or more of the length
less than 35% identity, but across 80% or more of the length
35% identity or greater, but across less than 80% of the length
less than 35% identity across less than 80% of the length
HMM hit isotype classes.
HMM Hit Class
All proteins scoring above the trusted cutoff have the exact same function
All domains scoring above the trusted cutoff have the same function; can be part of a multi-function protein
Defines a region of homology that may or may not have a known function, and need not be the full length of the polypeptide
Hits in this category represent full-length homology within a subgroup comtained within a gene family
This defines a set of proteins with full-length homology of sequence and domain architecture, but not necessarily the same function
Hypothetical - isotype
PFAM model cannot be assigned
The third data collection component involves a RPS-BLAST  against the PRIAM database  of metabolic enzymes. This is primarily for the purpose of assigning EC#s in those cases without a TIGRFAM hit, which would otherwise take precedence. The cutoff used is 1e-10 (Table 1).
The final data collection component involves searches for lipoprotein motifs and transmembrane helices in the putative proteins. The former is accomplished using a regular expression search in the amino acid sequence, while the latter is performed using TMHMM , a HMM-based search for transmembrane motifs (Table 1). These two searches represent annotation states that fall well short of complete functional annotation (e.g., “putative lipoprotein”), but are more informative than the absence of any functional annotation.
Metagenomics annotation hierarchy.
TIGRfam Hypothetical Equivalog
Pfam Hypothetical Equivalog
TIGRfam Hypothetical Equivalog Domain
TIGRfam Subfamily Domain
Pfam Equivalog Domain
Pfam Hypothetical Equivalog Domain
Pfam Subfamily Domain
Panda BLASTP High Confidence
Panda BLASTP Putative
Panda BLASTP Conserved Domain
Putative proteins without any evidence but identified by MetaGeneAnnotator  are classified as “hypothetical”. This set of non-conserved hypothetical peptides usually constitutes at least 30% of all putative proteins in a dataset.
Output format of the JCVI prokaryotic metagenomics functional annotation multi fasta file header, with example entries.
JCVI PEP 5160785.1
Common Name Section Starts
glutamine synthetase, catalytic domain
Common Name Evidence
Gene Symbol Section Starts
Gene Symbol Evidence
RF|YP 266724.1|71084004|NC 007205
GO Section Starts
GO:0004356 // GO:0006542
GO Term Evidence
PF00120 // PF00120
EC Section Starts
TIGR Role Section starts
Tigr Role Id
Tigr Role Evidence
The prokaryotic metagenomics pipeline is divided into structural and functional annotation components, which can be run together or separately. The underlying codebase itself is divided in two distinct sections. The first section is written in Java, and leverages JCVI’s high-throughput computing platform, which provides a framework for scalable and robust implementations of data analysis pipelines. This software platform itself is built on top of JBoss - an open-source JEE server designed to allow for easy failover setup and clustering. Layered design, strict adherence to standard interfaces such as JMS, EJB, Web Services, and high level of adoption of standard open source packages (e.g., Hibernate, DRMAA) ensures platform stability and ease of integration with other software packages. Other built-in capabilities include a robust workflow subsystem, grid integration tier, and a growing set of bioinformatics tools implemented to scale to modern data and compute requirements. The remaining portion of this application lies within the PERL codebase, which includes parsers of all the raw results and the execution of the hierarchical annotation algorithm, all of which are invoked from within the Java portion.
The progress of metagenomics as a field requires both that consistent, high quality functional annotation be achievable in a timely manner, and that new computational methods to improve those functional assignments be incorporated. The JCVI prokaryotic metagenomics analysis pipeline provides an efficient system for identifying and functionally classifying the proteins present in shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. This prokaryotic pipeline complements ongoing activity in viral metagenomics annotation also underway at JCVI.
It is JCVI’s intention to continually upgrade this tool to take advantage of the not only the newest versions of existing resources, but to incorporate new resources and technologies (e.g., cloud computing) as they become available. It is relevant to note that reproducibility of the results produced by this system depends substantially on the versions of the databases underlying the pipeline (e.g., PANDA, Pfam). As these data resources are iteratively updated over time with newer versions, both a net improvement in functional assignments and cumulative decrease in comparability between older and newer data sets are expected.
It is the intention of JCVI to make this resource available to the scientific community in the near future.
J. Craig Venter Institute
Global Ocean Sampling
Hidden Markov Model
Basic local alignment search tool
Protein and Nucleic Acid Database
open reading frame
Cluster of Orthologous Genes
non coding RNA
Extensible Markup Language
This work was performed with support from the US Department of Energy grant #DE-FG02-02ER63453, the NIH Human Microbiome Project grant #1 U54 AI84844-01, and the Gordon and Betty Moore Foundation (GBMF).
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 2004; 304:66–74. PubMed doi:10.1126/science.1093857View ArticlePubMedGoogle Scholar
- Eisen JA. Environmental shotgun sequencing: Its potential and challenges for studying the hidden world of microbes. PLoS Biol 2007; 5:384–388. doi:10.1371/journal.pbio.0050082View ArticleGoogle Scholar
- Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLOS Comput Biol 2005; 1:106–112. PubMed doi:10.1371/journal.pcbi.0010024View ArticlePubMedGoogle Scholar
- Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA. Microbial ecology and evolution: a ribosomal RNA approach. Annu Rev Microbiol 1986; 40:337–365. PubMed doi:10.1146/annurev.mi.40.100186.002005View ArticlePubMedGoogle Scholar
- Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, et al. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 2007; 5:e16. PubMed doi:10.1371/journal.pbio.0050016PubMed CentralView ArticlePubMedGoogle Scholar
- Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, Peterson DM, Saar MO, Alexander S, Alexander EC, Jr., Rohwer F. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 2006; 7:57. PubMed doi:10.1186/1471-2164-7-57PubMed CentralView ArticlePubMedGoogle Scholar
- Anonymous. Metagenomics versus Moore’s law. Nat Methods 2009; 6:623. doi:10.1038/nmeth0909-623View ArticleGoogle Scholar
- Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, et al. The metagenomics RAST server — a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008; 9:386. PubMed doi:10.1186/1471-2105-9-386PubMed CentralView ArticlePubMedGoogle Scholar
- Markowitz VM, Ivanova N, Palaniappan K, Szeto E, Korzeniewski F, Lykidis A, Anderson I, Mavromatis K, Kunin V, Garcia Martin H, et al. An experimental metagenome data management and analysis system. Bioinformatics 2006; 22:e359–e367. PubMed doi:10.1093/bioinformatics/btl217View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403–410. PubMedView ArticlePubMedGoogle Scholar
- Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001; 29:2994–3005. PubMed doi:10.1093/nar/29.14.2994PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR. Profile hidden Markov models. Bioinformatics 1998; 14:755–763. PubMed doi:10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25:25–29. PubMed doi:10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Webb EC. Enzyme Nomenclature. Commission TE, editor. San Diego, California: Academic Press; 1992.Google Scholar
- Riley M. Functions of the gene products of Escherichia coli. Microbiol Rev 1993; 57:862–952. PubMedPubMed CentralPubMedGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998; 8:175–185. PubMedView ArticlePubMedGoogle Scholar
- Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998; 8:186–194. PubMedView ArticlePubMedGoogle Scholar
- Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997; 25:955–964. PubMed doi:10.1093/nar/25.5.955PubMed CentralView ArticlePubMedGoogle Scholar
- Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res 2008; 15:387–396. PubMed doi:10.1093/dnares/dsn027PubMed CentralView ArticlePubMedGoogle Scholar
- Yooseph S, Li W, Sutton G. Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering. BMC Bioinformatics 2008; 9:182. PubMed doi:10.1186/1471-2105-9-182PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4:41. PubMed doi:10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Dalevi D, Ivanova NN, Mavromatis K, Hooper SD, Szeto E, Hugenholtz P, Kyrpides NC, Markowitz VM. Annotation of metagenome short reads using proxygenes. Bioinformatics 2008; 24:i7–i13. PubMed doi:10.1093/bioinformatics/btn276View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG. Performance evaluation of amino acid substitution matrices. Proteins 1993; 17:49–61. PubMed doi:10.1002/prot.340170108View ArticlePubMedGoogle Scholar
- Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res 2007; 17:377–386. PubMed doi:10.1101/gr.5969107PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, et al. The Pfam protein families database. Nucleic Acids Res 2004; 32(Database issue):D138–D141. PubMed doi:10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
- Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res 2003; 31:371–373. PubMed doi:10.1093/nar/gkg128PubMed CentralView ArticlePubMedGoogle Scholar
- Davies K. CLC bio satisfies Next-Gen Bioinformatics Cravings. Bio-IT World 2009 (September 23).
- Claudel-Renard C, Chevalet C, Faraut T, Kahn D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res 2003; 31:6633–6639. PubMed doi:10.1093/nar/gkg847PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnhammer EL, von Heijne G, Krogh A. A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 1998; 6:175–182. PubMedPubMedGoogle Scholar
- Davidsen T, Beck E, Ganapathy A, Montgomery R, Zafar N, Yang Q, Madupu R, Goetz P, Galinsky K, White O, Sutton G. The comprehensive microbial resource. Nucleic Acids Res 2010; 38(Database issue):D340–D345. PubMed doi:10.1093/nar/gkp912PubMed CentralView ArticlePubMedGoogle Scholar
- Overbeek R, Disz T, Stevens R. The SEED: A peer-to-peer environment for genome annotation. Commun ACM 2004; 47:46–51. doi:10.1145/1029496.1029525View ArticleGoogle Scholar