Quality scores for 32,000 genomes
© Land et al.; licensee BioMed Central Ltd. 2014
Received: 27 June 2014
Accepted: 4 November 2014
Published: 8 December 2014
More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences.
Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes.
The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’) are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.
KeywordsDNA Sequencing Database Quality Evaluation Status
The introduction of second-generation sequencing began an exponential growth in sequencing data [1–4] and in the number of genomes submitted to public repositories. The drop in sequencing cost that came with this technology, however, had little effect the mostly manual cost of finishing genomes. Finishing second-generation sequenced genomes continues to be expensive and many researchers have no plans to finish their draft genomes . There is still an open question of whether whole genome sequencing projects with less than 5% of the genes missing is adequate quality for most purposes  or if there continues to be value in finishing most microbial genomes . Even though single molecule, or ‘third-generation’ sequencing will facilitate the generation of closed genomes, currently most of the genomes in the database are of varying levels of draft quality.
The establishment of a quality nomenclature by Chain et al. in 2009  provides a mechanism for comparing draft sequences and understanding the qualifiers associated with a single genome sequence. It does not, however, shine any light on the impact that predominately draft genomes have on the quality of the repository databases. With more than 30,000 unique publicly available genome sequences of varying qualities, there is enough data to score genomes on the basis of completeness and compare quality among data sources.
DNA sequences were obtained from two sources at GenBank and the National Center for Biotechnology Information : draft genomes (WGS or ‘draft’) and complete finished genomes (‘complete’). An assembled version of the GenBank Sequence Read Archive was obtained for analysis . Despite major overlaps, three additional data sources, the Broad Institute , Pathosystems Resource Integration Center , and the United States Department of Energy (DOE) Systems Biology Knowledgebase (http://kbase.us), were acquired because they contained additional unique genome sequences. The DNA sequences were then scored for completeness.
Estimates of genome quality were based on 1.) the sequence quality (number of contigs and number of non-standard bases); 2.) the presence of a full-length 5S, 16S, and 23S rRNA; 3.) the presence of at least one tRNA coding for all of the 20 standard amino acids; and 4.) the presence of a set of essential genes containing 102 conserved Pfam-A  domains found in nearly all bacteria and archaea (Additional file 1: Table S1). Software tools were either selected or developed to provide an estimate for each of these measures of completeness. The four individual scores ranged from zero to one and they were averaged for a combined score. The data sources and calculation of the scores are described in more detail in the Methods section.
To keep all scores comparable, we ran standard predictions using the same settings across all genomic DNA sequences; tRNAscan-SE  was used to predict tRNAs, RNAmmer  was used to predict rRNAs, and Prodigal  was used to predict protein coding genes in all the acquired sequences. HMMER3  was used to find Pfam-A  domains. We chose not to use the predictions from the source databases because consistent annotation was not available for all sequences and the resultant scores would have been a reflection of the source annotation and not just the completeness of the sequence. A score for annotation quality may be added to future versions of this scoring system.
Results and discussion
Counts of total and unique genomes acquired from each data source
Unique to this source
GenBank complete finished
GenBank SRA - assembled
As noted in the discussion, factors other than data source can influence the score. With over 20,000 genomes, it is possible to analyze some of these factors and highlight a few intriguing observations.
Sequence quality score
The sequence quality is a function of the number of contigs per megabase (counting N's as gaps) and the number of non-standard bases per genome. The sequence quality scores varied between the different sources (Figures 1a and b and Additional file 1: Table S2). Surprisingly, about 2% of the ‘complete’ genomes did not have perfect scores, while 3% of the WGS genomes received perfect scores (0.99 or better), despite being ‘draft’. As might be expected, the collection of SRA genomes had lower quality scores, with a maximum sequence quality of 0.88 and an average of 0.38. All other databases had an average score of 0.75 or better. The SRA genomes scored low enough on sequence quality that additional analysis was not done on these genomes.
The number of contigs per genome ranged from 1 to 13,915 (Additional file 2: Figure S1 and Additional file 1: Table S3). The average for the ‘complete’ genomes was about 5 replicons per genome and the average contig counts for ‘draft’, KBase, PATRIC, and Broad were 190, 130, 151, and 48, respectively. The results included a ‘complete’ genome with 930 contigs and 70 ‘draft’ genomes that contained a single contig/scaffold.
tRNA score and anticodons
Most genomes scored well with respect to having at least one tRNA codon for each amino acid (Figure 1c and Additional file 1: Table S4). Even among the WGS genomes, nearly 60% of the genomes had perfect scores for tRNA genes (all 20 types). As expected, the ‘complete’ genomes scored better than other data sources, although 7 of these genomes had a score of 0.1 (9 or more missing tRNAs). Among the low-scoring ‘complete’ genomes was Pyrobaculum calidifontis JCM 11548, Thermoproteus uzoniensis 768-20, and two “Candidatus Tremblaya princeps”. Factors that may contribute to a low score are: 1) P. calidifontis and T. usoniensis are part of the 0.1% of genomes that have 9 or more predicted pseudo tRNA genes not included in the calculations and 2) the Candidatus genomes are endosymbionts with less than 140 kbp DNA in their chromosome.
The number of tRNAs per genome were compared across the data sources (Additional file 2: Figure S2 and Additional file 1: Table S5). These did not affect the score but provided some interesting observations. The maximum number found was 280 for Escherichia coli HVH 33 (4-2174936). This genome was found at both Broad and WGS and the number of tRNAs was high compared to the average of 79 +/- 13 for the other 1500 E. coli genomes. The number of rRNA molecules for this genome is in line with other Escherichias (5 23S and 7 16S), and it is likely that this is a single genome.
Comparison of 3 selected genera and the number of genomes with rare anticodons found in the analysis
Number in study
Number with rare anticodons
rRNA score and length distribution
The lengths of the rRNA molecules were compared across data sources (Additional file 2: Figures S4-S7 and Additional file 1: Tables S10-S13). In our scoring scheme, the lowest possible score for rRNAs was 0.1. All genomes scored 0.3 or better (Figure 1d and Additional file 1: Table S9) and 5.7% of the ‘complete’ genomes scored below 0.9.
Some of the extremely long 16S predictions were investigated. The predictions of 7 genomes with extremely long rRNA genes were compared to the annotations at GenBank. Two had the same predictions in GenBank, four had more reasonable predictions at GenBank, and one genome was missing all 16S predictions, including an abnormally long one predicted by RNAmmer.
Most genomes with abnormally long 16S rRNA (greater than 2300 bases) fell into one of the following categories, 1) the DNA encoding the rRNA gene contained Ns, 2) the genome contained one or more “normal” predictions, or 3) the genome was less than 40% GC. The first is an indicator of a problem in the sequencing, the second may indicate an atypical region of the genome, and the third may be a weakness of the RNAmmer tool.
Essential gene score
Protein-coding genes were predicted for all genomes using Prodigal  and average gene length and density were calculated (Additional file 2: Figures S8-S9 and Additional file 1: Tables S14-S15). The average gene length was expected to be slightly less than 1000. The Broad genes had the longest average length at 940 bases and the PATRIC had the shortest average length at 902 bases. WGS contained the individual genome with the shortest average length (200 bases) while WGS, KBase, and PATRIC all had the genome with the longest average length (1291 bases).
The distribution of essential gene scores shows that all of the data sources are very similar (Figure 1e and Additional file 1: Table S16). The set of ‘universally conserved domains’ was surprisingly well conserved, being found in nearly all of the more than 20,000 genomes. The percent of genomes with perfect scores of 1 ranged from 96% for WGS to 99.9% for Broad. The Pfam-A model that was missed the most often is the 60-65 residue ribosomal protein S14p. This may be a reflection of the inability of the gene finder to find the genes rather than missing domains. The protein predictions for Clostridium clariflavum DSM 19732 and Clostridium thermocellum DSM 1313 were each missing a S14p domain-containing protein. A 6-frame translation search of the genomes revealed that a sequence matching the model was in the DNA.
Total combined score
The total score was calculated by averaging the other normalized four scores (Figure 1f and Additional file 1: Table S17). The ‘complete’ genomes had an average score of 0.97 and the average score was 0.85 for the WGS draft genomes. Although only 6% of the genomes had a perfect quality score, most (~88%) of the genomes had quality scores of 0.8 or better. At the other extreme, about 3% of the genomes had a score below 0.6 and probably have too low a quality to yield reliable analysis. This score corresponds to more than 1000 contigs.
While the ‘complete’ genomes were the best scoring on average, there are a few low quality genomes in this database. Among them are Borrelia valaisiana VS116 (0.27) and Bacillus anthracis str. A2012 (0.27). The draft WGS genomes on average have lower scores, although a 28 genomes in the WGS dataset scored a perfect 1.0. To achieve a perfect score, the sequence must be in one contiguous piece and contain no runs of ‘N’ greater than 10 bases.
The data, algorithm and score cards for all the genomes are accessible from our website . The results of the study can be downloaded from the results page of our website.
The data were examined to identify underlying factors that may have contributed to the score. From GenBank files and the PATRIC web site it was possible to gather the sequencing technology, the assembly method, coverage, and update date for many of the genomes. From Broad it was possible to gather a sequencing technology and coverage for many genomes. It was not possible to entirely account for the effect of read length, experience of the researchers, all version changes, the wide disparity in the number of available genomes, or the fact that the information was not available for most early genomes. Care should be used when drawing conclusions from the data.
The analysis by date showed that older genomes were predominately complete and tended to score better than newer draft genomes (Additional file 2: Figure S10). The data is consistent with graphs showing the differences between complete and draft genomes (Figure 1).
In several histograms, Broad appears to be an outlier. For example, in Additional file 2: Figure S9, Broad has a higher percentage of genomes between .9 and 1.0 genes per kilobase than any other source. This was investigated and the taxonomic makeup of the Broad is also an outlier. Eighty percent of the Broad genomes belong to 8 genera (Escherichia, Enterococcus, Staphylococcus, Brucella, Acinetobacter, Mycobacterium, Ba.cil.lus and Pseudomonas), compared to 34% in other sources (Additional file 1: Table S20).
A final conclusion from this scoring review is that widely used analysis tools performed well most of the time, but each had a point where they seemed to miss the mark. tRNAscan predicted pseudogenes rather than real genes in some genomes, RNAmmer predicted unrealistically long genes under some circumstances, and Prodigal occasionally missed a few valid small genes instead of lots of small false-positive genes.
The normalization process did not change the scores by very much. Each investigator’s preferences for a score cutoff will have greater impact on the assignment of a “good” vs a “bad” score. For future use, it is recommended that the standardization step be skipped for analysis of a single genome.
Automated annotation should be checked for the minimal quality components presented here. The annotation may need to be manually edited to compensate for the rare occasions when the tools give misleading or missing results.
Genome sequences are used by researchers in several fields to answer many different types of scientific questions. The score presented in this work is one metric among many that can be used even though none are suitable in all circumstances. For example, when comparing assembly methods and/or strains within the same species, well established measures such as N50 or use of a reference sequence will be more targeted and specific than a single score. Also, the score cannot address errors associated with sample preparation, contamination, or misidentification of the genus, species and/or strain. A very high quality sequence can lead to a flawed analysis if it is incorrectly identified.
Obtaining the data
Complete finished GenBank genomes were obtained from the NCBI ftp site . GenBank and Fasta files were extracted from each subdirectory.
Draft GenBank genomes were obtained from the NCBI WGS ftp directory . All projects labeled as belonging to the BCT division were downloaded and GenBank and Fasta files were stored for each of them.
Broad genomes were obtained from the Broad Olive web site . Fasta files were downloaded for each Broad project. Genomes were divided by sequencing status.
PATRIC genomes were obtained from the PATRIC ftp site . Fasta files were downloaded for each PATRIC genome. Genomes were divided by sequencing status.
KBase genomes were downloaded using the API commands all_entities_Genome to get the master list of Bacterial and Archaeal genomes. The API commands genomes_to_contigs and contigs_to_sequences were used to download the sequences for the genomes. KBase does not provide a sequencing status for genomes.
Additional data repositories could have been considered for this paper. It was assumed that repositories that attempt to include “all possible genomes” would have largely overlapping sets of genomes. This was supported by the data in Table 1. Only the data sets with a unique focus, such as the Sequence Read Archive and the Broad (selected genera), had a large percentage of unique genomes. The scoring system could be applied to any repository that routinely finds rRNAs, tRNAs, and functional domains for its genomes.
Definition of unique genomes
Identification of different assemblies for the same genome sequence is difficult, as strain meta-data can be missing or slightly different. For this analysis it was necessary to automatically determine unique genome sequences, due to the number of genomes. Name matching is problematic because names change with time and data sources can represent strain names with different syntax. Even if the data sources had names of other identifiers that linked their data to one or more of the other sources, there is the issue of ensuring they are the same version of the genome.
While not perfect, an MD5 checksum algorithm was used to identify duplicate assemblies. Minor differences in assemblies or treatment of gaps produce different checksums and therefore two genomes that are for all practical purposes the same, appear to be different. Most data sources had at least one internal duplicate using the MD5 checksum.
The MD5 hex checksum for a genome was calculated by first creating an MD5 checksum of all component contigs. These checksums were sorted and concatenated into a new string with comma separators and no spaces. The checksum of the genome was the checksum of this new string.
Genomes were sorted by size and the names of the largest and smallest genomes were examined. Genomes with a total size of less than 138,500 bp were found to be plasmids and genomes greater than 18,000,000 bp were found to be eukaryotic. The plasmids and eukaryotic genomes were deleted from the databases. One GenBank genome was eliminated because it was a project containing a set of 10 different genomes, rather than representing a single genome.
How the scores are assigned, by component
Calculation of quality scores was performed in two stages. In the first stage, algorithms were run to determine the range of values associated with a metric (e.g., the number of ‘essential genes’ found for a given genome). In the second stage, a score was scaled to represent our assessment of the quality of that level of completeness. For example, a score of 0.9 was assigned if 90% of essential genes were found.
The numerator was the number of “good” bases across all contigs (i.e., A, C, T, or G).
The denominator was the number of “good” bases, plus
o the count of “bad” (bad = anything except A, C, T, G, or N) bases, plus
o a 10,000 bp “penalty” bases for each contig after the first, plus
‘Complete’ genomes were penalized for gaps of 10 or more N's but were not penalized for additional contigs, which were assumed to be additional plasmids or chromosomes.
The sequence quality score was designed to estimate how close the genome was to being completely sequenced – that is, once contiguous piece per replicon. In principle, this would likely be reflective of the reliability of coverage – that is, that enough of the genome is present in good enough quality so as to minimize errors in the prediction of all the genes and features by protein-coding and RNA gene prediction algorithms. Gene prediction algorithms may fail to predict genes in extremely short contigs, or at the edges of longer contigs. It was generally assumed that each contig would lose, on average, half a gene at each edge, or one gene total, with an average size of 900-1000 bp. However, we did not consider an assembly that could only capture 90% of genes to be of high quality, so we scaled the score downward by increasing this “missing bases” penalty by a factor of 10.
Start with a minimum of 0.1.
Defined an ideal length range for each molecule type:
23S to be between 2900 and 3500
16S to be between 1450 and 1700
5S to be between 100 and 120
For each molecule type,
add .3 if length was within the ideal range
else, add .2 if length was greater than 0.5 times the minimum
else, add .1 if a prediction of any length exists
Start with a maximum of 1.0
Subtract 0.1 for each amino acid with no tRNA, until a minimum of 0.1 is reached.
Start with a maximum of 1.0
Subtract 0.01 for each missing Pfam-A until a minimum of 0.1 is reached.
The four individual scores were each standardized to a zero mean and unit variance (xnew = (x-μ)/σ) and then averaged. The average was transformed to a scale with a minimum of zero and a maximum of one. The total combined score is this transformed value.
Numbers in all the Supplementary tables were rounded to two significant digits. Some integers presented in the text (e.g., maximum length of 23S), are given with more digits of precision.
DWU conceived the idea. DWU, MLL, PDH, SRJ, and LJH designed the scoring methodology. MLL, PDH, OL, and SRJ analyzed the data. GHK implemented the website. MLL wrote the manuscript with the assistance of the other authors. All authors read and approved the final manuscript.
We would like to thank the Broad, PATRIC, and KBase for making their data available to the public for this comparison. While much of the data overlaps, the differences in objectives make the comparisons interesting. We gratefully acknowledge funding support for this research by the Genomic Science Program, U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER) as part of the Plant Microbe Interfaces Scientific Focus Area (http://pmi.ornl.gov) and the BER’s BioEnergy Science Center (BESC) at the Oak Ridge National Laboratory (contract DE-PS02-06ER64304). Additional funding came from the Laboratory Directed Research and Development (LDRD) funding from Oak Ridge National Laboratory. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725.
- Schatz MC, Langmead B: The DNA data deluge fast, efficient genome-sequencing machines are spewing out more data than geneticists can analyze. Ieee Spectrum 2013,50(7):28–33.View ArticleGoogle Scholar
- Deus HF, Stanislaus R, Veiga DF, Behrens C, Wistuba II, Minna JD, Garner HR, Swisher SG, Roth JA, Correa AM, Broom B, Coombes K, Chang A, Vogel LH, Almeida JS: A semantic web management model for integrative biomedical informatics. Plos One 2008,3(8):e2946.PubMed CentralView ArticlePubMedGoogle Scholar
- Beaman RS, Traub GH, Dell CA, Santiago N, Koh J, Cellinese N: TOLKIN - Tree of Life Knowledge and Information Network: filling a gap for collaborative research in biological systematics. Plos One 2012,7(6):e39352.PubMed CentralView ArticlePubMedGoogle Scholar
- Bhattacharyya M, Bandyopadhyay S: Recent Directions in Compressing Next Generation Sequencing Data. Current Bioinformatics 2012,7(1):2–6.View ArticleGoogle Scholar
- Mavromatis K, Land ML, Brettin TS, Quest DJ, Copeland A, Clum A, Goodwin L, Woyke T, Lapidus A, Klenk HP, Cottingham RW, Kyrpides NC: The fast changing landscape of sequencing technologies and their impact on microbial genome assemblies and annotation. Plos One 2012,7(12):e48837.PubMed CentralView ArticlePubMedGoogle Scholar
- Anderson I, Rodriguez J, Susanti D, Porat I, Reich C, Ulrich LE, Elkins JG, Mavromatis K, Lykidis A, Kim E, Thompson LS, Nolan M, Land M, Copeland A, Lapidus A, Lucas S, Detter C, Zhulin IB, Olsen GJ, Whitman W, Mukhopadhyay B, Bristow J, Kyrpides N: Genome sequence of Thermofilum pendens reveals an exceptional loss of biosynthetic pathways without genome reduction. J Bacteriol 2008,190(8):2957–65.PubMed CentralView ArticlePubMedGoogle Scholar
- Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL: The value of complete microbial genome Sequencing (you get what you pay for). J Bacteriol 2002,184(23):6403–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Chain PSG, Grafham DV, Fulton RS, FitzGerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T: Genome Project Standards in a New Era of Sequencing. Science 2009,326(5950):236–7.View ArticlePubMedGoogle Scholar
- Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res 2013,41(D1):D36-D42.PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Ponten T, Ussery DW, Aarestrup FM, Lund O: Multilocus Sequence Typing of Total-Genome-Sequenced Bacteria. J Clin Microbiol 2012,50(4):1355–61.PubMed CentralView ArticlePubMedGoogle Scholar
- Institute B. Microbial Genomes Research Areas Available from: https://olive.broadinstitute.org/
- Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL, Shukla MP, Dalay O, Driscoll T, Hix D, Mane SP, Mao CH, Nordberg EK, Scott M, Schulman JR, Snyder EE, Sullivan DE, Wang CX, Warren A, Williams KP, Xue T, Yoo HS, Zhang CD, Zhang Y, Will R, Kenyon RW, Sobral BW: PATRIC: the Comprehensive Bacterial Bioinformatics Resource with a Focus on Human Pathogenic Species. Infection and Immunity 2011,79(11):4286–98.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer EL, Tate J, Punta M: Pfam: the protein families database. Nucleic Acids Res 2013,42(D1):D222-D230.PubMed CentralView ArticlePubMedGoogle Scholar
- Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997,25(5):955–64.PubMed CentralView ArticlePubMedGoogle Scholar
- Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW: RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 2007,35(9):3100–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ: Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010, 11: 119.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 2011, 39: W29–37.PubMed CentralView ArticlePubMedGoogle Scholar
- Chan PP, Lowe TM: GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 2009, 37: D93-D7.PubMed CentralView ArticlePubMedGoogle Scholar
- Quality scores for 32,000 genomes Available from http://genomes.ornl.gov/studyResults.txt
- Roberts RJ, Carneiro MO, Schatz MC: The advantages of SMRT sequencing. Genome Biology 2013,14(6):405.View ArticlePubMedGoogle Scholar
- NCBI: Bacterial Genome ftp site for Finished Genomes. Available from ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria
- NCBI: Bacterial Genomes ftp site for Draft Genomes. Available from ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs
- Pathosystems Resource Integration Center (PATRIC) ftp download site: Available from: ftp://ftp.patricbrc.org/patric2/genomes/, Pathosystems Resource Integration Center (PATRIC) ftp download site: Available from: ftp://ftp.patricbrc.org/patric2/genomes/
- NCBI: Sequence Read Archive. Available from: http://www.ncbi.nlm.nih.gov/sra
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.