Large-scale contamination of microbial isolate genomes by Illumina PhiX control
© Mukherjee et al.; licensee BioMed Central. 2015
Received: 21 November 2014
Accepted: 29 January 2015
Published: 30 March 2015
With the rapid growth and development of sequencing technologies, genomes have become the new go-to for exploring solutions to some of the world’s biggest challenges such as searching for alternative energy sources and exploration of genomic dark matter. However, progress in sequencing has been accompanied by its share of errors that can occur during template or library preparation, sequencing, imaging or data analysis. In this study we screened over 18,000 publicly available microbial isolate genome sequences in the Integrated Microbial Genomes database and identified more than 1000 genomes that are contaminated with PhiX, a control frequently used during Illumina sequencing runs. Approximately 10% of these genomes have been published in literature and 129 contaminated genomes were sequenced under the Human Microbiome Project. Raw sequence reads are prone to contamination from various sources and are usually eliminated during downstream quality control steps. Detection of PhiX contaminated genomes indicates a lapse in either the application or effectiveness of proper quality control measures. The presence of PhiX contamination in several publicly available isolate genomes can result in additional errors when such data are used in comparative genomics analyses. Such contamination of public databases have far-reaching consequences in the form of erroneous data interpretation and analyses, and necessitates better measures to proofread raw sequences before releasing them to the broader scientific community.
KeywordsNext-generation sequencing PhiX Contamination Comparative genomics
The ability to produce large numbers of high-quality, low-cost reads has revolutionized the field of microbiology [1–3]. Starting from a meager 1575 registered projects in September 2005, there has been a steady increase in the number of sequencing projects according to the Genomes OnLine Database . As of November 17th 2014, there were 41,553 bacterial and archaeal isolate genome sequencing projects reported in GOLD [4, 5]. This explosion of genome sequencing projects especially during the last 5 years has been largely catalyzed by the development of several next-generation sequencing platforms offering rapid and accurate genome information at a low cost. Among the different NGS technologies available commercially, the sequencing by synthesis technology  championed by Illumina  is the most widely used.
Despite its high accuracy, the Illumina sequencing platform does come with its share of challenges  that need to be addressed by the users of this technology. One such challenge is the protocol in which PhiX is used as a quality and calibration control for sequencing runs. PhiX is an icosahedral, nontailed bacteriophage with a single-stranded DNA. It has a tiny genome with 5386 nucleotides and was the first DNA genome to be sequenced by Fred Sanger . Due to its small, well-defined genome sequence, PhiX has been commonly used as a control for Illumina sequencing runs. For the majority of its library preparations Illumina recommends using PhiX at a low concentration of 1%, which can be raised up to 40% for low diversity samples. Depending on the concentration of PhiX used, it can be spiked in the same lane along with the sample or used as a separate lane. Addition of PhiX as a sequencing control necessitates subsequent quality control steps to remove the sequences such that they do not get integrated as part of the target genome.
Here, we identify and catalog more than 1000 genomes in public databases (i.e. Genbank) that are contaminated with PhiX sequences and the approximately 10% of the genomes that are published in literature. In an era where sequencing data is growing exponentially along with the need to rapidly churn out novel sequences, our report serves as a reminder that it is equally important to develop effective downstream screening and quality control measures to prevent large-scale contamination of public databases. Since preliminary analyses of initial draft sequences lead to formulation of key scientific questions, contamination can result in misinterpretation of data and drawing of erroneous biological conclusions.
We screened the current list of isolate microbial genomes in the Integrated Microbial Genomes (IMG v 4.0)  against the PhiX genome. The nucleotide sequence of each query genome was compared against PhiX using NCBI-BLASTn  and hits above a percent identity of 90% and e-value of 0.01 were retained. A hit was flagged as being contaminated with PhiX sequences if its total length was at least 80% of the length of the contig.
Summary of genomes and their corresponding scaffolds contaminated with PhiX sequences
Number of Genomes
Number of contaminated scaffolds/genome
Average contaminated sequence length (bp)/ scaffold
Average contaminated sequence length (bp)/ genome
The presence of PhiX sequences within individual genomes first attracted our attention while manually curating a small number of isolate genomes. Initially thought of as an exciting biological phenomenon or the result of horizontal gene transfer, after careful analyses, these scaffolds turned out to be nothing but sequencing artifacts. Sequencing centers generate massive amounts of data, which calls for strict quality control measures. The sheer volume of data being generated on a daily basis necessitates well-defined, automatic quality control protocols at source. Contaminated sequences once released to public databases typically trace thousands of analysis routes and can add to error propagation and incorrect hypotheses . Thus, it is extremely important to detect contaminated sequences at the source and prevent them from affecting subsequent downstream analyses.
Contamination and sequence artifacts can come from multiple sources including but not limited to sequencing controls such as PhiX, cloning vectors, adapters, PCR primers, nucleic acid impurities present in reagents required for sample isolation and preparation and human error. Salter et al.  identified a wide range of contaminants from DNA extraction kits and other laboratory reagents affecting the outcome of culture-independent microbiota research; while Lusk  detected widespread contamination in four independent high throughput sequencing experiments. A study  scanning DNA sequences from The Thousand Genome Project  identified significant contamination by Mycoplasma sequences. While DNA contamination has been a long-standing issue in research laboratories, its potential long-term implications were highlighted recently in light of developments in high throughput sequencing and human microbiome research. A recent commentary published in Nature  summarizes the problem well.
Several tools have been developed over the years for quality control of raw sequence reads such as Phred , Sequence Scanner  (specifically for first generation sequence data) and NCBI’s VecScreen and UniVec [24, 25] to get rid of contaminants of vector origin. More recent programs have been designed for analyzing NGS data such as TileQC , FastQC , PRINSEQ , NGS-QC , programs to detect contamination such as DeconSeq , as well as multi genome alignment (MGA)  and QC-Chain  which can provide both rapid QC and contamination filtering of NGS data. Such programs are meant to prevent release of contaminated sequences. However, our results from scanning publicly available microbial isolate genome sequences for contamination shows that large number of errors can be detected in spite of the easy availability of multiple quality control measures. The sheer volume of PhiX contaminated genomes is alarming and calls for implementation of stricter quality control measures especially at large genome centers with high rates of sequence turnover.
either a partial or complete mixtures of two or more strains
genomes contaminated with short fragments of two or more species
‘isolate’ genomes where a complete genome is cloned inside another
The list of such genomes is available in Additional file 4 and their nucleotide sequences are available on a JGI public ftp site . The IMG database has already implemented a quality control step to identify and remove these artifacts during data submission, and the sequence data in the system is free of PhiX contamination. We are currently in the process of cleaning up additional contaminated genomes. Most have already been removed from IMG completely or are being re-instated after cleaning up of contaminated scaffolds. At the same time, most of the PhiX contaminated genomes continue to exist in other public databases such as NCBI/RefSeq or Genbank and are easily accessible to researchers over the world. While we welcome the technological advances associated with NGS platforms and acknowledge their long-term benefits, we expect principal investigators (PI) of large-scale sequencing projects to be aware of the possible pitfalls and take corrective measures as necessary. For the genomes contaminated with PhiX sequences, we recommend individual PI’s to retract the corresponding sequences, remove contaminating scaffolds, and re-upload the clean sequences to public databases.
Integrated Microbial Genomes
Human Microbiome Project
Genomes OnLine Database
sequencing by synthesis.
This work was performed under the auspices of the US Department of Energy’s Office of Science, Biological and Environmental Research Program and by the University of California, Lawrence Berkeley National Laboratory under contract DEAC02-05CH11231, Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344 and Los Alamos National Laboratory under contract DE-AC02-06NA25396. The work conducted by the US Department of Energy Joint Genome Institute is supported by the Office of Science of the US Department of Energy under contract DE-AC02-05CH11231.
- Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, et al.: A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 2009,462(7276):1056–60. doi:10.1038/nature08656 10.1038/nature08656View ArticlePubMed CentralPubMedGoogle Scholar
- MacLean D, Jones JDG, Studholme DJ: Application of ‘next-generation’ sequencing technologies to microbial genetics. Nat Rev Micro 2009,7(4):287–96. doi:10.1038/nrmicro2088Google Scholar
- Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ, Cheng J-F, et al.: Insights into the phylogeny and coding potential of microbial dark matter. Nature 2013,499(7459):431–7. doi:10.1038/nature12352 10.1038/nature12352View ArticlePubMedGoogle Scholar
- Pagani I, Liolios K, Jansson J, Chen IMA, Smirnova T, Nosrat B, et al.: The Genomes OnLine Database (GOLD) v. 4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 2012,40(D1):D571–9. doi:10.1093/nar/gkr1100 10.1093/nar/gkr1100View ArticlePubMed CentralPubMedGoogle Scholar
- Woese CR, Kandler O, Wheelis ML: Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 1990, 87:4576–9. 10.1073/pnas.87.12.4576View ArticlePubMed CentralPubMedGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008,456(7218):53–9. doi: 10.1038/nature07517 10.1038/nature07517View ArticlePubMed CentralPubMedGoogle Scholar
- Illumina next-generation sequencing. 2014. http://www.illumina.com/technology/next-generation-sequencing.html
- Kircher M, Heyn P, Kelso J: Addressing challenges in the production and analysis of illumina sequencing data. BMC Genomics 2011.,12(1): doi:10.1186/1471–2164–12–382
- Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes JC, et al.: Nucleotide sequence of bacteriophage [phi]X174 DNA. Nature 1977,265(5596):687–95. doi:10.1038/265687a0 10.1038/265687a0View ArticlePubMedGoogle Scholar
- Markowitz VM, Mavromatis K, Ivanova NN, Chen IMA, Chu K, Kyrpides NC: IMG ER: a system for microbial genome annotation expert review and curation. Bioinformatics (Oxford, England) 2009,25(17):2271–8. doi:10.1093/bioinformatics/btp393 10.1093/bioinformatics/btp393View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990,215(3):403–10. doi:10.1016/S0022–2836(05)80360–2 10.1016/S0022-2836(05)80360-2View ArticlePubMedGoogle Scholar
- Everett KD, Bush RM, Andersen AA: Emended description of the order Chlamydiales, proposal of Parachlamydiaceae fam. nov. and Simkaniaceae fam. nov., each containing one monotypic genus, revised taxonomy of the family Chlamydiaceae, including a new genus and five new species, and standards for the identification of organisms. Int J Syst Bacteriol 1999,49(Pt 2):415–40.View ArticlePubMedGoogle Scholar
- Skerman VBD, McGowan V, Sneath PHA: Approved lists of bacterial names. Int J Syst Bacteriol 1980, 30:225–420. 10.1099/00207713-30-1-225View ArticleGoogle Scholar
- Page LA: Proposal for the recognition of two species in the genus Chlamydia Jones, Rake and Stearns 1945. Int J Syst Bacteriol 1968, 18:51–66. 10.1099/00207713-18-1-51View ArticleGoogle Scholar
- Kundim BA, Itou Y, Sakagami Y, Fudou R, Yamanaka S, Ojika M: Novel antifungal polyene amides from the myxobacterium Cystobacter fuscus: isolation, antifungal activity and absolute structure determination. Tetrahedron 2004,60(45):10217–21. doi: 10.1016/j.tet.2004.09.013 10.1016/j.tet.2004.09.013View ArticleGoogle Scholar
- Kyrpides NC, Ouzounis CA: Whole-genome sequence annotation: ‘Going wrong with confidence’. Mol Microbiol 1999,32(4):886–7. 10.1046/j.1365-2958.1999.01380.xView ArticlePubMedGoogle Scholar
- Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, et al.: Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 2014.,12(1): doi:10.1186/s12915–014–0087-z
- Lusk RW: Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS One 2014.,9(10): doi:10.1371/journal.pone.0110808
- Langdon WB: Mycoplasma contamination in the 1000 Genomes Project. BioData Mining 2014.,7(1): doi: 10.1186/1756–0381–7-3
- Genomes Project C, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al.: A map of human genome variation from population-scale sequencing. Nature 2010,467(7319):1061–73. doi: 10.1038/nature09534 10.1038/nature09534View ArticleGoogle Scholar
- Cressey D: Contamination threatens microbiome science. Nature 2014. doi:10.1038/nature.2014.16327Google Scholar
- Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I Accuracy assessment Genome Res 1998,8(3):175–85.View ArticleGoogle Scholar
- Sequence Scanner. v1.0 ed. Applied Biosystems; 2012. https://products.appliedbiosystems.com/ab/en/US/adirect/ab?cmd=catNavigate2&catID=600583&tab=Overview
- The UniVec Database. 2013. http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/#Overview
- VecScreen. 2013. http://www.ncbi.nlm.nih.gov/tools/vecscreen/about/#aboutvecScreen
- Dolan PC, Denver DR: TileQC: a system for tile-based quality control of Solexa data. BMC Bioinformatics 2008., 9: doi:10.1186/1471–2105–9-250Google Scholar
- Andrews S: FastQC. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Google Scholar
- Schmieder R, Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011,27(6):863–4. doi:10.1093/bioinformatics/btr026 10.1093/bioinformatics/btr026View ArticlePubMed CentralPubMedGoogle Scholar
- Patel RK, Jain M: NGS QC Toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012.,7(2): doi:10.1371/journal.pone.0030619
- Schmieder R, Edwards R: Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 2011.,6(3): doi:10.1371/journal.pone.0017288
- Hadfield J, Eldridge MD: Multi-genome alignment for quality control and contamination screening of next-generation sequencing data. Front Genet 2014., 5: doi: 10.3389/fgene.2014.00031Google Scholar
- Zhou Q, Su X, Wang A, Xu J, Ning K: QC-Chain: fast and holistic quality control method for next-generation sequencing data. PLoS One 2013.,8(4): doi: 10.1371/journal.pone.0060234
- Additional Contamination 2014. http://portal.nersc.gov/project/m342/contamination
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.