ClaMS: A Classifier for Metagenomic Sequences
© The Author(s) 2011
Published: 30 November 2011
ClaMS — “Classifier for Metagenomic Sequences” — is a Java application for binning assembled contigs in metagenomes using user-specified training sets and initial parameters. Since ClaMS trains on sequence composition-based genomic signatures, it is much faster than binning tools that rely on alignments to homologs; ClaMS can bin ∼20,000 sequences in 3 minutes on a laptop with a 2.4 GH× Intel Core 2 Duo processor and 2 GB RAM. ClaMS is meant to be a desktop application for biologists and can be run on any machine under any Operating System on which the Java Runtime Environment can be installed.
Metagenome binning is the process of assigning nucleotide sequences in a metagenome to known taxonomic groups. Mapping sequences to their taxonomic groups of origin leads to better characterization of a metagenome, which facilitates the accomplishment of objectives such as genome assembly from metagenomes and assembly and annotation improvement. Existing binning methods can be characterized in two ways — (1) Composition-based binning tools and homology-based binning tools (2) ab initio unsupervised classifiers and supervised/training-based classifiers. In unsupervised binning, a dataset is classified to pre-existing bins trained on genomic sequences without any interference or supervision from the user. In supervised binning, the user integrates additional known facts about the dataset into the binning process by participating in the training process – by specifying sequences for each training bin and/or selecting the taxonomic units to which the dataset must be binned. Homology-based classifiers such as MEGAN  rely on alignments of sequences to homologs and are extremely computation-intensive. For large metagenomic datasets sequenced using next-generation sequencing technologies, homology-based binning can be prohibitive in terms of time and computation. While existing composition-based binning tools (Phylopythia , TETRA ) are much faster than homology-based binning tools, they are mostly unsupervised, and their accuracy is limited since the information about the presence and abundance of specific phylogenetic populations is not used in the binning process, even though such information obtained by 16S rDNA amplicon analysis results is available for many metagenomic datasets. Even in the absence of rRNA amplicon analysis experiments, some intelligence about the constituent organisms of a metagenome can be obtained by a few iterations of ab initio binning. The objective of ClaMS is to integrate this information into the binning process thus achieving higher accuracy of binning, and to produce a desktop/laptop application that is platform-independent, fast, and easily usable by biologists.
ClaMS works by characterizing a sequence with a signature vector that is derived from its composition and described as a de Bruijn chain (DBC) signature . A double stranded DNA sequence is treated as a walk in a de Bruijn graph and artifacts such as the stationary distribution of the underlying Markov chain and the strength of connectivity of various graph-components to the graph are used to compute the DBC signature. The transition probability matrix of the underlying Markov chain of even a relatively short sequence can accurately predict its stationary distribution, and this property is exploited in the computation of DBC signatures. The DBC signature is highly conserved within a species while varying between species and this can be proved both mathematically and experimentally . This property also manifests at higher taxonomic levels. It is more complex than the oligonucleotide frequency signatures used by Phylopythia and TETRA, and different from the interpolated Markov Models used by Phymm . Since a DBC signature of order k incorporates information about k-mers and (k+1)-mers in its computation, it is much faster to train. While the greater amount of information used by applications such as Phylopythia and Phymm does mean higher accuracy, ClaMS is targeted for use on assembled contigs with supervision from the user and in this scenario, accuracy is not compromised. Pre-computed signatures at various word lengths (2-4) are included with ClaMS for all finished genomes. These signatures have been computed using the taxonomy and isolate genome sequences in IMG  and will be updated with each release of ClaMS or on request. The users can define training sequence sets either by clicking a node in the phylogenetic tree in the ClaMS-GUI or by uploading their own fasta files of sequences. For each sequence to be binned, its signature, which is a vector, is computed. This signature is compared individually with the centroid signatures of all training sets and the best match is declared as the bin for that sequence.
Results and Discussion
To demonstrate the accuracy of binning using ClaMS, we binned a real metagenome and a simulated metagenome using ClaMS. The real metagenome, the Phrap-assembled phosphorus removal sludge metagenome (SLU) sampled from laboratory-scale bioreactor (IMG/M, taxon OID: 2000000000 ), is 56.6M bases long, has 60.45% GC, and contains 31,742 assembled contigs. The simulated metagenome, the assembled medium complexity simulated simMC dataset from FAMeS , has 15109 non-chimeric contigs that were 1000 bases or longer and candidates for binning using ClaMS. We evaluated the results using cross-validation of the binned contigs. In the case of simMC, the correct bins of the contigs were already known for cross-validation, in the case of SLU, best hits from Blast alignment were used to cross-validate bins.
ClaMS can run in a command-line mode, which makes it convenient to be included in processing pipelines and large-scale batch-processing jobs. Screenshots of the ClaMS user-interface and a demonstration of the usage including visualization of results are available at http://clams.jgi-psf.org. The user-friendly interface, built-in taxonomy browser, bundled genomic signatures, and fast computations make ClaMS an ideal desktop supervised binning application for biologists.
ClaMS was developed under the auspices of the US Department of Energy Office of Science, Biological and Environmental Research Program and by the University of California, Lawrence Berkeley National Laboratory under contract DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 and Los Alamos National Laboratory under contract DE-AC02-06NA25396.
- Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res 2007; 17:377–386. PubMed doi:10.1101/gr.5969107PubMed CentralView ArticlePubMedGoogle Scholar
- McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 2007; 4:63–72. PubMed doi:10.1038/nmeth976View ArticlePubMedGoogle Scholar
- Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO. A web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 2004; 5:163. PubMed doi:10.1186/1471-2105-5-163PubMed CentralView ArticlePubMedGoogle Scholar
- Heath LS, Pati A. Genomic signatures in de Bruijn chains. WABI 2007, LNBI 4645, 216–227.
- Brady A, Salzberg SL. Phymm and phymmbl: metagenomic phylogenetic classification with interpolated markov models. Nat Methods 2009; 6:673–676. PubMed doi:10.1038/nmeth.1358PubMed CentralView ArticlePubMedGoogle Scholar
- Markowitz VM, Ivanova NN, Szeto E, Palaniappan K, Chu K, Dalevi D, Chen IM, Grechkin Y, Dubchak I, Anderson I, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res 2007; 36:D534–D538. PubMed doi:10.1093/nar/gkm869PubMed CentralView ArticlePubMedGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 2007; 4:495–500. PubMed doi:10.1038/nmeth1043View ArticlePubMedGoogle Scholar
- García Martín H, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol 2006; 24:1263–1269. PubMed doi:10.1038/nbt1247View ArticlePubMedGoogle Scholar
- Comas I, Moya A, Azad RK, Lawrence JG, Gonzalez-Candelas F. The Evolutionary Origin of Xanthomonadales Genomes and the Nature of the Horizontal Gene Transfer Process. Mol Biol Evol 2006; 23:2049–2057. PubMed doi:10.1093/molbev/msl075View ArticlePubMedGoogle Scholar