GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes
© The Author(s) 2009
Published: 29 September 2009
We present an interactive web application for visualizing genomic data of prokaryotic chromosomes. The tool (GeneWiz browser) allows users to carry out various analyses such as mapping alignments of homologous genes to other genomes, mapping of short sequencing reads to a reference chromosome, and calculating DNA properties such as curvature or stacking energy along the chromosome. The GeneWiz browser produces an interactive graphic that enables zooming from a global scale down to single nucleotides, without changing the size of the plot. Its ability to disproportionally zoom provides optimal readability and increased functionality compared to other browsers. The tool allows the user to select the display of various genomic features, color setting and data ranges. Custom numerical data can be added to the plot allowing, for example, visualization of gene expression and regulation data. Further, standard atlases are pre-generated for all prokaryotic genomes available in GenBank, providing a fast overview of all available genomes, including recently deposited genome sequences. The tool is available online from http://www.cbs.dtu.dk/services/gwBrowser. Supplemental material including interactive atlases is available online at http://www.cbs.dtu.dk/services/gwBrowser/suppl/.
The development of fast and inexpensive genome sequencing technologies has led to the generation of vast amounts of genomic information. As genomic sequencing becomes both more powerful and affordable, the handling and analysis of the generated data produces novel challenges and shifts the focus away from the discovery process towards technical considerations of handling, storing and analyzing sequence data. An important step when exploring a new genome is to compare it to existing sequences, in order to identify both novel and conserved features. Many automated computational methods are available that attempt to derive protein function from sequence [1–3]. In a metagenomic study by Harrington and co-workers it was estimated that 76% of the examined protein coding genes could be assigned a function. However, to assess predictions for individual genes the visualization remains critical to provide the biologist with an overview of the genomic context. Are genes of interest situated in clusters? In operons? How are they regulated? How does their DNA base composition compare with that of the rest of the genome? In order to display such features both on a genome scale and in close-up down to the level of nucleotides, we developed the GeneWiz browser which is based on the ‘Genome Atlas’ concept [4,5]. This tool can also display local DNA structural properties, so that regulatory or repeat regions can easily be identified and interpreted in a chromosomal context.
During development of the GeneWiz browser, it became apparent that novel sequencing technology creates a further demand. The current generation of sequencing instruments utilizes primed synthesis in flow cells to simultaneously obtain the sequences of millions of different DNA templates, an approach that changed the field of DNA sequencing [6,7]. Flow sequencing, also known as sequencing by synthesis (SBS) on a solid surface, tracks nucleotides as they are added to a growing DNA strand . SBS is used by high-throughput sequencing systems which have become commercially available in the past two years. Examples include the sequencer GS Titanium (commercialized by 454/Roche); Genome Analyser GA-II (Solexa/Illumina); and SOLiD™ 3 system (Applied Biosystems).
These developments have increased the speed of sequencing while significantly reducing its cost [9,10]. This much higher throughput provides greater coverage, but at the cost of much shorter read-lengths: from 50 bases with SOLiD 3 to 75 bases with Illumina GA II. Even reads of 500 bases obtained with the 454-Titanium are still shorter than read lengths typically obtained using the Sanger method [9,11]. The output from modern high-through sequencing equipment challenges the assembly software by generating shorter and ambiguous reads. Processing of this flood of sequence data has rapidly become a bottleneck, and developing the necessary skills and tools will most likely be a driving factor in the execution of second-generation sequencing . As a first step in this development, it needs to be determined to what extent assembly of short-read sequences can be trusted, an assessment for which the GeneWiz browser can also be used.
Our method of visualization is based on color-encoded lanes to display numerical information on a genome atlas similar to GeneWiz [4,5]. The color encoding can be done either using a linear scale with a fixed minimum and maximum range, or a dynamic scale of standard deviations. Using the latter, color intensity decreases as data approach average values, thereby emphasizing regions of significant variation. The web interface is divided into four optional sections, to address various biological viewpoints of chromosomes: 1) DNA properties 2) Mapping of homologous genes by BLAST 3) Mapping of short sequencing reads 4) Custom lanes such as Single Nucleotide Polymorphism (SNP) or microarray data. The output of each method is a numerical vector of length corresponding to that of the reference sequence, and the methods used for this construction are described in detail below.
Read quality assessment
Gene duplications, rRNA operons and other repetitive chromosomal regions are known to cause difficulties during the assembly of short reads . To assess the degree of ambiguity of sequencing reads, a method was developed that derives the uniqueness of all reads, accounting for both the read quality and the match to the reference genome.
From the mapping, five different parameters were calculate which together summarizes the trustworthiness of the reads given the assembly:
From all the q′r(i) values obtained at each position in the genome, the maximum uniqueness-weighted quality is chosen when all reads have been mapped.
Information content provides a number in bits of information  representing to what degree the reads agree: zero bits means equal distribution of The value is plotted on a color scale whereby low information (random distribution, least expected) is given in dark colors, and high information (high conservation, most expected) as light or neutral color. This measure may be useful for visualizing single nucleotide polymorphisms.
Read absence. A boolean where ‘one’ indicates complete absence of aligned reads.
Visualization of whole-genome homology
ment is then mapped back to the reference ge-nome. A match adds a ‘one’ whereas a mismatch adds a ‘zero’ at each position along the chromo-some. These ones and zeros translate into smooth color zones due to binning
DNA properties and DNA destabilization
A designated section of the GeneWiz browser is assigned for custom data. It allows the user to provide a per-nucleotide list of numerical values along with a desired color and data range. Although not presented here, this allows for visualization of additional information such as microarray data that has been pre-processed by the user, by mapping gene expression, regulation change, or p-values back to genomic coordinates. In addition to the main genome annotation covering CDSs, tRNAs, and rRNAs, the user may specify miscellaneous and pseudo-gene annotations separately. A button allows the query of selected reference genomes against a replicate of pseudogenes.org . Other annotations of possible pseudogenes can be added, such as GenePRIMP output (geneprimp.jgi-psf.org/).
The GeneWiz browser allows dynamic disproportional zooming, meaning that zooming occurs nearly instantly when requested by the user, by redrawing all the components like tracks, legends, marks and text for every view. This allows the browser to scale the plot to make use of the entire plotting area, by not rescaling all parts of the plot equally. For example, zooming 10x will stretch a data lane 10 in genome position axis, however the lane height and distance to the neighbor lane will remain constant. The dynamic nature of the GeneWiz browser requires pre-binning of data for each zoom level, all of which are stored on a central server; for improved efficiency only data requested by the user are sent. The approach to store per-nucleotide information as table records in a database (e.g. MySQL) has proved unfeasible, as the number of records per genome exceeds millions, and the construction of indexes would be very time consuming. Instead, a memory mapping technique was chosen, that allows the server to directly obtain the values from binary files when provided with the zoom window and level, for any chromosome in the database. (Examples are provided as supplemental data, http://www.cbs.dtu.dk/services/gwBrowser/suppl/).
GeneWiz Browser server options.
The unique identifier for the atlas
Feature type (e.g. CDS, rRNA, tRNA) when returning annotations
Data field to return
Begin of window
End of window
Enable zlib compression of output
Return the genome length
Return aggregate data for window/genome
Return data values provided field, window and zoom level
Return colors provided two or three-step ranges
Return nucleotides provided the window
Return annotations (used together with option ‘ft’)
These options (Table 1) can be incorporated into a single URL. For example, one could request all numerical data for field f=dnap0, at zoom level l=5, from position b=1 to e=37,473bp, using compression, z=true (http://ws.cbs.dtu.dk/cgi-bin/gwBrowser-0.91/server.cgi?d=AL111168GENOMEatlas&m=d&f=dnap0&b=1&e=37473&l=5&z=true). The field names and their configurations are described in the xml record, which can be downloaded from the web (http://ws.cbs.dtu.dk/cgi-bin/gwBrowser-0.91/fetchxml.cgi?AL111168GENOMEatlas). Further examples are provided in the supplemental data section.
The GeneWiz browser workflow and data displayed
When submitting a job via the web interface, the request is assigned a job identifier, under which all data lanes and configurations are kept. After the job has been processed the user may alter lane order, colors, ranges, and append various types of marks to the plot. The layout of a given browser instance is governed by an XML file, located on the server. When generating the graphical representation of the genome, the client Java program will make requests to the server to acquire aggregated values, such as the averages, standard deviations, minima, and maxima as well as lane data and annotations.
The genome sequences stored in the CBS Genome Atlas Database  are synchronized with NCBI Entrez genome projects and have been pre-processed for all of the eight standard atlases mentioned above. This allows the user to select from currently 1,636 pre-binned replicons from 864 prokaryotic sequencing projects, searchable by replicon name, GenBank accession number, or organism name (http://www.cbs.dtu.dk/services/gwBrowser/precalc/)
Evaluation of re-sequencing quality
Sequencing details of three bacterial genomes, two of which were re-sequenced using 454-Titanium and one with Illumina GA technology.
E. coli K12 MG1655
C. jejuni NCTC11168
S. typhi Ty2
Illumina GA II
Avg. read length ((std. dev))
Accession and original Reference
Genome homology: Comparing multiple Burkholderia species
The SIDD atlas: Annotation of regulatory elements
Visualization of the multidimensional information that is represented by a single genome sequence remains complex. An indispensable property of a genome visualization tool is that it must be zoomable, so that information can be interpreted at varying scales. Two recently published methods, the DNAPlotter  and the Genome Projector , both enable the user to build circular plots of numerical data related to genes as well as graphs of numerical data pertaining to the nucleotides. These tools create static graphics and allows only for proportional zooming, hence making the plot hard to interpret when zooming too deep. Both of these tools allow for visualization of individual genomes, but do not allow easy comparison across multiple genomes. With the ease of new genome sequences becoming available, it is essential to be able to quickly compare other genomes to a reference.
A number of other tools approach genome visualization from different angles: Genome Diagram  and Circos  are command line programs generating publication quality static images and vector graphics. Although these tools allow comparison of other genomes, are flexible and allow visualization of numerical data, they lack an interactive layer.
The GeneWiz browser described here uses disproportional zooming to overcome this. From a technical perspective, the choice of programming language for writing graphical browsers is of importance. There are obvious advantages of providing platform-independent Java software like that of the GeneWiz browser, but often this is at the cost of performance. Nevertheless, our tool demonstrates the usefulness of a genome browser that relies on interactive, true disproportional zooming to visualize annotated genes and features as well as numerical data provided at single nucleotide resolution. By building a comprehensive tool that is both scalable and flexible, we have shown how different types of genomic data can be integrated into a single, easily navigated graphic that can be annotated further by the user.
P.F.H. wrote the paper and composed the web interfaces, as well as most parts of the server back end. H.H.S. wrote the c-code of the data binning and retrieval software and contributed to the Java Applet; E.R. wrote the majority of the Java Applet code and formulation of the XML configurations. T.T.B. provided source data and analysis of C. jejuni and E. coli sequencing reads and C.J.B. assisted writing the paper (paragraphs on SIDD energy). D.W.U. assisted in writing the paper, supervised the project and provided ideas for figures and analysis. All authors have read and made corrections to the manuscript.
This work is funded in part by grants from the Danish Center for Scientific Computing, NSF Research Grant DBI-0416764, The Danish Research Council grant 26-06-0349, and the EU EMBRACE network of Excellence, contract number LSHG-CT-2004-512092. We thank Mark Driscoll and Marcel Margulies from 454 Life Sciences for providing the data for C. jejuni and E. coli and Julian Parkhill at the Sanger institute for providing the S. typhi sequencing data. We thank also Dr. Trudy Wassenaar and Dr. Lars Juhl Jensen for making suggestions to the manuscript.
- Harrington ED, Singh AH, Doerks T, Letunic I, von Mering C, Jensen LJ, Raes J, Bork P. Quantitative assessment of protein function prediction from metagenomics shotgun sequences. Proc Natl Acad Sci USA 2007; 104:13913–13918. PubMed doi:10.1073/pnas.0702636104PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K, Workman C et al. Prediction of human protein function from post-translational modifications and localization features. J Mol Biol 2002; 319:1257–1265. PubMed doi:10.1016/S0022-2836(02)00379-0View ArticlePubMedGoogle Scholar
- Friedberg I. Automated protein function prediction—the genomic challenge. Brief Bioinform 2006; 7:225. PubMed doi:10.1093/bib/bbl004View ArticlePubMedGoogle Scholar
- Jensen LJ, Friis C, Ussery DW. Three views of microbial genomes. Res Microbiol 1999; 150:773–777. PubMed doi:10.1016/S0923-2508(99)00116-3View ArticlePubMedGoogle Scholar
- Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW. A DNA structural atlas for Escherichia coli. J Mol Biol 2000; 299:907–930. PubMed doi:10.1006/jmbi.2000.3787View ArticlePubMedGoogle Scholar
- Hall N. Advanced sequencing technologies and their wider impact in microbiology. J Exp Biol 2007; 210:1518–1525. PubMed doi:10.1242/jeb.001370View ArticlePubMedGoogle Scholar
- Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res 2008; 18:839–846. PubMed doi:10.1101/gr.073262.107View ArticlePubMedGoogle Scholar
- Käller M, Lundeberg J, Ahmadian A. Arrayed identification of DNA signatures. Expert Rev Mol Diagn 2007; 7:65–76. PubMed doi:10.1586/1473718.104.22.168View ArticlePubMedGoogle Scholar
- Gupta PK. Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 2008; 26:602–611. PubMed doi:10.1016/j.tibtech.2008.07.003View ArticlePubMedGoogle Scholar
- Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008; 26:1135–1145. PubMed doi:10.1038/nbt1486View ArticlePubMedGoogle Scholar
- Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem N, Stromberg MP et al. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res 2008; 18:1638–1642. PubMed doi:10.1101/gr.077776.108PubMed CentralView ArticlePubMedGoogle Scholar
- Lin F, Schröder H, Schmidt B. Solving the Bottleneck Problem in Bioinformatics Computing: An Architectural Perspective. J VLSI Signal Process 2007; 48:185–188. doi:10.1007/s11265-007-0088-zView ArticleGoogle Scholar
- Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive misassembly. Genome Biol 2008; 9:R55. PubMed doi:10.1186/gb-2008-9-3-r55PubMed CentralView ArticlePubMedGoogle Scholar
- Tolstrup N, Rouzé P, Brunak S. A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucleic Acids Res 1997; 25:3159–3163. PubMed doi:10.1093/nar/25.15.3159PubMed CentralView ArticlePubMedGoogle Scholar
- Hallin PF, Binnewies TT, Ussery DW. The genome BLASTatlas-a GeneWiz extension for visualization of whole-genome homology. Mol Biosyst 2008; 4:363–371. PubMed doi:10.1039/b717118hView ArticlePubMedGoogle Scholar
- Bolshoy A, McNamara P, Harrington RE, Trifonov EN. Curved DNA without A-A: experimental estimation of all 16 DNA wedge angles. Proc Natl Acad Sci USA 1991; 88:2312–2316. PubMed doi:10.1073/pnas.88.6.2312PubMed CentralView ArticlePubMedGoogle Scholar
- Brukner I, Sanchez R, Suck D, Pongor S. Sequence-dependent bending propensity of DNA as revealed by DNase I: parameters for trinucleotides. EMBO J 1995; 14:1812–1818. PubMedPubMed CentralPubMedGoogle Scholar
- van Noort V, Worning P, Ussery DW, Rosche WA, Sinden RR. Strand misalignments lead to quasipalindrome correction. Trends Genet 2003; 19:365–369. PubMed doi:10.1016/S0168-9525(03)00136-7View ArticlePubMedGoogle Scholar
- Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc Natl Acad Sci USA 1998; 95:11163–11168. PubMed doi:10.1073/pnas.95.19.11163PubMed CentralView ArticlePubMedGoogle Scholar
- Ornstein RL, Rein R, Breen DL, MacElroy RD. An optimized potential function for the calculation of nucleic acid interaction energies. I-Base stacking. Biopolymers 1978; 17:2341–2360. doi:10.1002/bip.1978.3601 71005View ArticlePubMedGoogle Scholar
- Satchwell SC, Drew HR, Travers AA. Sequence periodicities in chicken nucleosome core DNA. J Mol Biol 1986; 191:659–675. PubMed doi:10.1016/0022-2836(86)90452-3View ArticlePubMedGoogle Scholar
- Ussery D, Soumpasis DM, Brunak S, Staerfeldt HH, Worning P, Krogh A. Bias of purine stretches in sequenced chromosomes. Comput Chem 2002; 26:531–541. PubMed doi:10.1016/S0097-8485(02)00013-XView ArticlePubMedGoogle Scholar
- Wang H, Benham CJ. Superhelical destabilization in regulatory regions of stress response genes. PLOS Comput Biol 2008; 4:e17. PubMed doi:10.1371/journal.pcbi.0040017PubMed CentralView ArticlePubMedGoogle Scholar
- Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 2007; 35:D55–D60. PubMed doi:10.1093/nar/gkl851PubMed CentralView ArticlePubMedGoogle Scholar
- Hallin PF, Ussery DW. CBS Genome Atlas Database: a dynamic storage for bioinformatic results and sequence data. Bioinformatics 2004; 20:3682–3686. PubMed doi:10.1093/bioinformatics/bth423View ArticlePubMedGoogle Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF et al. The complete genome sequence of Escherichia coli K-12. Science 1997; 277:1453–1462. PubMed doi:10.1126/science.277.5331.1453View ArticlePubMedGoogle Scholar
- Parkhill J, Wren BW, Mungall K, Ketley JM, Churcher C, Basham D, Chillingworth T, Davies RM, Feltwell T, Holroyd S et al. The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature 2000; 403:665–668. PubMed doi:10.1038/35001088View ArticlePubMedGoogle Scholar
- Deng W, Liou SR, Plunkett G, Mayhew GF, Rose DJ, Burland V, Kodoyianni V, Schwartz DC, Blattner FR. Comparative genomics of Salmonella enterica serovar Typhi strains Ty2 and CT18. J Bacteriol 2003; 185:2330–2337. PubMed doi:10.1128/JB.185.7.2330-2337.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Brett PJ, DeShazer D, Woods DE. Burkholderia thailandensis sp. nov., a Burkholderia pseudomallei-like species. Int J Syst Bacteriol 1998; 48:317–320. PubMedView ArticlePubMedGoogle Scholar
- Smith MD, Angus BJ, Wuthiekanun V, White NJ. Arabinose assimilation defines a nonvirulent biotype of Burkholderia pseudomallei. Infect Immun 1997; 65:4319–4321. PubMedPubMed CentralPubMedGoogle Scholar
- Ong C, Ooi CH, Wang D, Chong H, Ng KC, Rodrigues F, Lee MA, Tan P. Patterns of large-scale genomic variation in virulent and avirulent Burkholderia species. Genome Res 2004; 14:2295–2307. PubMed doi:10.1101/gr.1608904PubMed CentralView ArticlePubMedGoogle Scholar
- Hirvonen CA, Ross W, Wozniak CE, Marasco E, Anthony JR, Aiyar SE, Newburn VH, Gourse RL. Contributions of UP elements and the transcription factor FIS to expression from the seven rrn P1 promoters in Escherichia coli. J Bacteriol 2001; 183:6305–6314. PubMed doi:10.1128/JB.183.21.6305-6314.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Ross W, Salomon J, Holmes WM, Gourse RL. Activation of Escherichia coli leuV transcription by FIS. J Bacteriol 1999; 181:3864–3868. PubMedPubMed CentralPubMedGoogle Scholar
- Wang H, Noordewier M, Benham CJ. Stress-induced DNA duplex destabilization (SIDD) in the E. coli genome: SIDD sites are closely associated with promoters. Genome Res 2004; 14:1575–1584. PubMed doi:10.1101/gr.2080004PubMed CentralView ArticlePubMedGoogle Scholar
- Bauer BF, Kar EG, Elford RM, Holmes WM. Sequence determinants for promoter strength in the leuV operon of Escherichia coli. Gene 1988; 63:123–134. PubMed doi:10.1016/0378-1119(88)90551-3View ArticlePubMedGoogle Scholar
- Carver T, Thomson N, Bleasby A, Berriman M, Parkhill J. DNAPlotter: circular and linear interactive genome visualization. Bioinformatics 2009; 25:119–120. PubMed doi:10.1093/bioinformatics/btn578PubMed CentralView ArticlePubMedGoogle Scholar
- Arakawa K, Tamaki S, Kono N, Kido N, Ikegami K, Ogawa R, Tomita M. Genome Projector: zoomable genome map with multiple views. BMC Bioinformatics 2009; 10:31. PubMed doi:10.1186/1471-2105-10-31PubMed CentralView ArticlePubMedGoogle Scholar
- Pritchard L, White JA, Birch PR, Toth IK. GenomeDiagram: a python package for the visualization of large-scale genomic data. Bioinformatics 2006; 22:616–617. PubMed doi:10.1093/bioinformatics/btk021View ArticlePubMedGoogle Scholar
- Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. Circos: an information aesthetic for comparative genomics. Genome Res 2009; 19:1639–1645. PubMed doi:10.1101/gr.092759.109PubMed CentralView ArticlePubMedGoogle Scholar