Solving the Problem: Genome Annotation Standards before the Data Deluge
© The Author(s) 2011
Published: 15 October 2011
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Annotation Issues in Genome Records
Even before the first genome sequence for a cellular organism was completed in 1995, it was recognized that the functional content encoded by and annotated on nucleotide records represented both a blessing and a curse [1–3]. With the complete genome sequence obtained and annotated, a full understanding of the biology of an organism was thought to be within reach. However, deposition of an annotated record into the sequence archives, excepting the rare occasion when a record is updated, meant that the archival record represented a snapshot in time of both the sequence and annotation. Scientists have sought to address the annotation issue by creating curated databases, developing computational tools for the assessment of annotation, and publishing a variety of solutions in numerous papers [4,5].
Throughout the sequencing era, continuous reassessment of annotations based on new evidence led to improved annotations on a number of sequences, even though the process is recognized as being time-intensive [6,7]. With the exponential increase in sequence data, annotation updates have become increasingly unlikely events. Errors in annotation impact downstream analyses . Errors that affect the location of annotated features or that result in a missed genomic feature greatly impact the evolutionary studies and biological understanding of an organism, whereas mistakes in functional annotation lead to subsequent problems in the analyses of pathways, systems, and metabolic processes. The presence of inaccurate annotation in biological databases introduces a hidden cost to researchers that is amplified by the amount of data being produced. For prokaryotic organisms, as of August 10, 2010, there were 1,218 complete and more than 1,400 draft genomes that had been sequenced and released publicly. The Genome Project database and other online efforts to catalog genome sequencing initiatives list thousands of additional sequence projects that have been initiated but for which sequence data has not yet been released [9,10]. Investigators relying on the complete genome set consisting of sequenced and closed replicon molecules and annotations as a gold standard are becoming increasingly affected by the size of the dataset even without having to take into account the presence of erroneous annotation . As rapidly decreasing sequencing costs for next generation sequencing are producing unprecedented levels of data and errors that can easily inflate in size and propagate throughout many datasets, it is essential that steps be taken to address these issues [8,12].
A large body of literature devoted to describing annotation problems is available ([13,14] and references within). Errors that plague genome annotations range from simple spelling mistakes that may affect a few records, to incorrectly tuned parameters in automatic annotation pipelines that can affect thousands of genes. Discrepancies can impact the genomic coordinates of a feature, or the function ascribed to a feature such as the protein or gene name, or both . The commonly used Gene Ontology annotations are also subject to errors . As our understanding of genome biology and evolution has improved, a number of methods have been developed to assess annotation quality. Typically, several pieces of evidence are combined in order to assign confidence levels to a particular annotation or to predict new functions. In some cases these methods have led investigators to target a specific function for experimental validation after the prediction was made, a process that both validated the prediction method and provided improved and experimentally determined annotations such as in the detection of the GGDEF and EAL domains as a major part of prokaryotic regulation [17–19]. Some of these methods include sequence similarity, phylogenomic or genomic context, metabolic reconstruction to determine pathway holes, comparative genomics, and in many cases a combination of all of the above (reviewed in ). A number of tools have been developed to predict annotations based on curated and experimental data. Curated model organism databases or datasets for specific molecules such as transfer RNAs, ribosomal RNAs, or other non-coding RNAs have been developed along with tools to predict their presence in a novel sequence [21–24].
Several large-scale curated databases have been created at large centers, such as at EBI and NCBI. NCBI initiated the Reference Sequence database to create a curated non-redundant set of sequences derived from original submissions to INSDC . The sequences include genomic DNA, transcripts, and proteins and the annotations may consist of submitter-derived, curated, or computational predictions. One major resource for improving functional annotation is the NCBI Protein Clusters database that consists of cliques of related proteins (ProtClustDB ;). A subset of clusters are curated and utilized as sources of functional annotation in the annotation pipeline as well as to incrementally update RefSeq records (see below). RefSeq records are also updated from model organism databases such as those for E. coli K-12 or Flybase. The UniProt Knowledgebase (UniProtKB) provided by the UniProt consortium is an expertly curated database, a central access point for integrated protein information with cross-references to multiple sources . The Genome Reviews portal that was a comprehensively up-to-date set of genomes has now been incorporated at ENSEMBL genomes [28,29]. Ongoing collaboration between NCBI and EBI ensures that annotation will continue to be curated and improved in all databases.
RefSeq is committed to ensuring that all current and future RefSeq prokaryotic records meet the minimal standards presented in this article. However, high throughput next generation sequencing increasingly results in a large number of non-reference sequences populating the databases with the expectation that there could be tens of thousands of genomes available for all prokaryotes. Community acceptance of a set of minimal annotation standards puts the burden on all genome submitters to provide quality annotation especially for those complete genomes that are often considered gold standard records for sequencing and annotation such as Escherichia coli K-12 MG1655.
The Need for Standards
Standards and guidelines facilitate the submission, retrieval, exchange, and analysis of data. Both the format and content of data can be standardized (syntactic and semantic). Syntactic standardization is easier to implement and enforce. The format and representation of genomic records has long been established and is not discussed in this article. Semantic standardization is more difficult. Standardization of the genomic content and annotation will facilitate analyses at the functional and systems levels, in other words, the biology will be easier to understand and to put into an evolutionary context which will have a real impact on how researchers approach scientific studies.
Databases, tools, resources for genomes and annotation.
NCBI Genome Annotation Workshop
All information from this publication, the Annotation Workshop, and futureannouncements will be made available
Difference between Archive and Curated Databases
GenBank, RefSeq, TPA and UniProt:What’s in a Name?
Difference between Archive and Curated Databases
GenBank, RefSeq, TPA and UniProt:What’s in a Name?
International Nucleotide Sequence Database Collaboration
INSDC Feature Table
Feature table document
DNA Databank of Japan
European Nucleotide Archive
Automated Annotation providers
NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP)
Intended for use during the annotation of prokaryotic genomes in preparation for submission to GenBank — capable of annotating complete genomes as wells WGS genomes
JCVI Annotation Service
Anyone with a prokaryotic genome sequence in need of annotation may submit it to the JCVI Annotation Service completely free-of-charge
IGS Annotation Engine
A free resource for genomics researchers and educators bringing advanced bioinformatics tools to the lab bench and the classroom.
KAAS - KEGG Automatic Annotation Server
KAAS (KEGG Automatic Annotation Server)provides functional annotation of genes by BLAST comparisons against the manually curated KEGG GENES database with resulting KO (KEGG Orthology) assignments and automatically generated KEGG pathways
RAST (Rapid Annotation using Subsystem Technology) is a fully automated service for annotating bacterial and archaeal genomes — provides high quality genome annotations for these genomes across the whole phylogenetic tree
Expert Review Data Submission: Microbial Genomes & Management
Annotation Cleanup, Analyses, and Validation Tools
NCBI Submission Check Tool
For the validation of genome submissions to GenBank — utilizes a series of self-consistency checks as well as comparison of submitted annotations to computed annotations — web-based and downloadable versions available
NCBI Sequin Validation
Sequin is a standalone tool for submitting and updating sequences
Command-line tool for automation of sequence records to GenBank
NCBI Discrepancy report
Evaluation of ASN.1 files for annotation discrepancies-part of Sequin, available separately as downloadable command line version, and part of tbl2asn
Broad’s Gene Pidgin (formerly BioName)
A free resource for genomics researchersand educators bringing advanced bioinformaticstools to the lab bench and the classroom.
JCVI’s Protein Naming Utility
KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST comparisons against themanually curated KEGG GENES database with resultingKO (KEGG Orthology) assignments andautomatically generated KEGG pathways
RAST (Rapid Annotation using Subsystem Technology) is afully-automated service for annotating bacterial andarchaeal genomes — provides high quality genome annotations forthese genomes across the whole phylogenetic tree
Expert Review Data Submission: Microbial Genomes & Management
GenBank Bacterial Genome Submission Guidelines
For the validation of genome submissions to GenBank-utilizes a series of self-consistency checksas well as comparison of submitted annotations tocomputed annotations — web-based anddownloadable versions available
Sequin is a standalone tool for submittingand updating sequences
Command-line tool for automation of sequencere-cords to GenBank
Evaluation of ASN.1 files for annotation discrepancies-part of Sequin, available separately as downloadablecommand line version, and part of tbl2asn
UniProt’s Protein Naming Guidelines
UniProt’s prokaryotic-specific protein naming guidelines — adopted by INSDC
GSC Structured Format
Accepted structured format for genome metadata including SOPs
Insertion sequence finder, nomenclature, and registry
Transposon nomenclature and registry
Enzyme Commission Numbers
Official NC-IUBMB site
ENZYME is a repository of information relative to the nomenclature of enzymes.
Functional Annotation/Protein Families
Clusters of orthologous groups - no longer actively curated
Cliques of related proteins — curated and uncurated — for multiple organism groups including prokaryotes and Viruses
NCBI Cluster Comparison Tool
Protein family comparison for functional annotation
NCBI Cluster Comparison Tool - Core Mode
Protein family core comparison for functional annotation
List of Core Clusters
Protein family core list
system, based on manual protein annotation, that identifies and semi-automatically annotates proteins that are part of well-conserved families or subfamilies in prokaryotes and plastids
KEGG Orthology Groups
Manually defined ortholog groups that correspond to KEGG pathway nodes and BRITE hierarchy nodes
Protein families based on Hidden Markov Models
Database dedicated to the collection and classification of mobile genetic elements
E. coli CCDS Project
Comparison of annotation for model E. coli K-12 MG1655
Milestones from all three workshops include: 1) the E. coli CCDS project (ECCDS), 2) a publication detailing the differences between archival and curated databases, 3) a locus_tag registry, and 4) release of a set of annotation assessment tools. Specific proposals on problems of genome annotation were generated from a number of working groups and focused on the following issues: 1) standard operating procedures, 2) structured evidence, 3) structural annotation, 4) pseudogenes, 5) protein naming guidelines, 6) comparison of functional annotation, 7) and viral annotation. Several of these proposals were submitted as guidelines and standards to be approved by INSDC while others are already accepted. Some of the proposals include reports and data sources that are available online (Table 1). The outcomes of each are summarized below.
The human genome CCDS project, an active collaboration project between EBI, NCBI, Sanger, and UCSC, was established to create a core set of consistently annotated protein coding genes . This project has now grown to include the mouse genome, and there are considerations for expanding this to other eukaryotic organisms. Using this project as a model, the E. coli consensus CDS project was established to reconcile the annotation differences for the model organism E. coli K-12 MG1655 which was first sequenced in 1997 (GenBank Accession Number U00096 ;). An updated annotation snapshot was released in 2006, and numerous curated and archival databases contain annotation for this organism . Of those, the ones actively contributing to the ECCDS project include GenBank, RefSeq, EcoGene, EcoCyc, and UniProt ,  [54–56]. Consistent annotation has been established between EcoGene, GenBank, and RefSeq with all three synchronizing the annotation several times a year. Reconciliation of this consistent annotation set with the EcoCyc and UniProtKB/Swiss-Prot databases is an ongoing process that has resulted in improved annotations in all five databases benefiting not only E. coli researchers but also the entire field of prokaryotic genomics (Table 1).
Differences between Archival and Curated Databases
Archival and curated databases serve different needs for the genomic and bioinformatics communities, but there is still confusion about the exact roles of all of these databases in the representation of genome sequencing data. A short article (“GenBank, RefSeq, TPA and UniProt: What’s in a Name?”) clarifying these issues was authored by NCBI and published in the ASM journal Microbe and is also available online at NCBI (Table 1). The article discussed the differences between the archival databases (GenBank), curated databases such as RefSeq and UniProtKB/Swiss-Prot, and Third Party Annotation (TPA), and helped researchers to understand the exact role of each database and how sequences and annotations are handled in each. Archival databases such as GenBank contain primary submissions and redundant sequences whereas the TPA database provides the ability for peer reviewed and published information to be used to update the information in the primary archives. RefSeq and UniProt have been described above. These resources constitute a major part of the dataflow for the annotation, submission, retrieval, and analysis of genomic records.
Locus_tags are systematic identifiers used for the enumeration of annotated genes even for cases when the genes have no known function. ASM journal editors had noticed that there was an increased use of locus_tags to refer to genes in the scientific literature, both in the primary genome sequencing paper as well as in subsequent publications describing specific genes and functions. However, as these identifiers were annotated by individual investigators and research labs, there were increasing instances of the same locus_tag being used to describe different but unrelated genes in different organisms. Hence the utility of a unique identifier was being lost and the use of locus_tags in a scientific article to identify particular genes was resulting in confusion. The solution was to create a locus_tag registry in conjunction with the Genome Project (soon to be BioProject ) database. Prefixes consisting of alphanumeric characters that met the standards could be registered along with a genome project submission (Table 1). The assignment of a unique locus_tag prefix to each genome assures that each gene feature in the dataset of all genomes records can be correctly identified.
Annotation Assessment Tools
NCBI committed to produce additional annotation assessment tools to help submitters find problems with genome annotations (Table 1). These tools are used during the submission process to GenBank, in the Prokaryotic Genome Automatic Annotation Pipeline, and are available separately and include: 1) the Discrepancy Report which includes internal consistency checks without the use of external databases, and is available in Sequin, as part of the tbl2asn tool or as a stand-alone command-line tool, 2) the subcheck/frameshift tool which incorporates sequence searches in external databases during annotation assessment in order to find potentially frameshifted genes and other annotation issues and is available via the web or as a command line tool. NCBI encourages submitters to utilize these tools prior to submission to aid in the identification and correction of annotation discrepancies. A new annotation report that lists quantitative annotation measures and provides comparison with multiple organisms is also available and is detailed below.
Capturing Annotation Methods and Information Sources
The results of genome annotation processes are deposited along with sequence records in the archival databases. The combination of methods and information sources that were used in the creation of a particular genome annotation are usually detailed in a publication. With increasing numbers of genomes being deposited that do not have an associated scientific publication, it is of paramount importance that there is a process to capture the methods and databases used in creating a set of annotated features.
Standard Operating Procedures
Standard Operating Procedures (SOPs) in the context of genome annotation should: 1) document specific processes used to generate annotations, 2) with enough detail to replicate the process, 3) list the input and outputs, 4) reference any external tools, and 5) and describe how the outputs of software packages are interpreted, filtered, or combined. The concept of SOPs, along with an example using the NCBI prokaryotic genome automatic annotation pipeline (PGAAP), has been detailed elsewhere . The Genome Standards Consortium (GSC), which has set forth a structured format to capture genome metadata, provides optional fields to link to an online accessible SOP via a digital object identifier (DOI) or other mechanism . INSDC has agreed to adopt this structured format for genome metadata, thus providing the capability to document SOPs and link them to each genome record with the metadata appearing in the COMMENT section. An example record with structured metadata can be found in GenBank Accession Number CP002903 (although the annotation SOP is not yet provided for this particular genome). All submitters are encouraged to use this structured format to capture genome metadata.
Structured standards evidence in annotation
SOPs describe the processes used to make an annotation decision including a list of information sources which may include sequence, structure, domain databases, or protein family resources. Since many of these bioinformatics sources are large databases with many records, it is essential to note the exact record from which an annotation is derived, thus providing a one-to-one or many-to-one link from annotation sources to the novel predicted annotation in a new genome. The source becomes a vital reference that facilitates analysis and comparison and the link to a particular record provides a trail through which annotation updates or problems can be addressed.
A variety of evidence or confidence-based systems are currently used. The Evidence Viewer at NCBI displays the sequences that provide evidence for the sequence of a particular gene model or mRNA . The RefSeq status key provides varying levels of confidence to a particular annotation based on the level of manual review a particular annotation has received . The curated Pseudomonas aeruginosa database incorporates evidence levels for functional assignments . UniProt has developed an evidence attribution system which attaches an evidence tag to each data item in a UniProtKB entry identifying its source(s) and/or methods used to generate it. Users can easily identify information added during the manual curation process, imported from other databases or added by automatic annotation procedures. In addition, UniProt has developed the protein existence concept which provides the level of evidence available for the existence of a protein . The Gene Ontology (GO) system provides evidence for function, component, and process and is one of the better known systems used in annotation today . However, GO cannot be used for all features on a genome, nor are all genome sequencing centers and large-scale institutes routinely using GO or any of the other ontologies, and similar issues arise with all of the above-mentioned evidence systems.
The INSDC flatfile is a commonly used format. It provides the capability to annotate many features such as genes, protein-binding sites, or ribosomal RNAs. For each feature there is a set of mandatory and optional qualifiers (Table 1) that provide detailed information in a structured format for each particular feature. For example, the gene name, the protein binding the DNA, or the ribosomal RNA product. The flatfile format is reviewed every year by the member databases and proposed changes are discussed before acceptance.
Summary of structured evidence for INSDC feature annotation1
free text describing the experiment
Non experimental structured format
structured format of TYPE + EVIDENCE_BASIS (type includes “non experimental”, “similar to”, “profile”, or “alignment”, evidence basis can include algorithm with version, or database with accession.version)
support for annotated coordinates
support for description including function
support for existence of feature in this organism
PMID or DOI
publication describing experimental evidence
Structural annotation and gene calling standards, validation (reports and outcomes)
Selected annotation report examples1
Organism name No. of replicons
No. of proteins
No. of RNAs
No. of amino acids with tRNA5
No. of hypothetical proteins3
Avg. protein length (aa)
Min. protein length (aa)
Short proteins [%]6
Percent standard start codon [%]7
Escherichia coli str. K-12 substr. MG1655 (1)
Bacillus subtilis subsp. subtilis str. 168 (1)
Candidatus Carsonella ruddii PV (1)
Candidatus Hodgkinia cicadicola Dsem (1)
Streptomyces bingchenggensis BCW-1 (1)
Rickettsia rickettsii str. Iowa (1)
Clostridium tetani E88 (1)
Anaeromyxobacter dehalogenans 2CP-C (1)
Propionibacterium freudenreichii subsp. shermanii CIRM-BIA1 (1)
Lactobacillus salivarius CECT 5713 (1)
Haloarcula marismortui ATCC 43049 (2)
Photobacterium profundum SS9 (2)
Haliangium ochraceum DSM 14365 (1)
Nostoc sp. PCC 7120 (1)
Vibrio harveyi ATCC BAA-1116 (2)
Sorangium cellulosum ‘So ce 56’ (1)
Rhizobium leguminosarum bv. viciae 3841 (1)
Mycobacterium leprae Br4923 (1)
Neisseria gonorrhoeae NCCP11945 (1)
Although genome streamlining can impact these measures, for example many genomes from the Prochlorococcus genus exhibit increased coding density; there are other factors at play [64,67,68]. This is more clearly seen when closely related genomes are compared as in a heatmap . Selected annotation measures for the gammaproteobacteria are compared in a heatmap in Figure 2. In several cases, increases or decreases in physical (length, GC content) or derived measures are due to biological causes. For example, gammaproteobacterial endosymbionts such as Buchnera spp. exhibit reduced genome size and decreased GC content [70,71]. In other cases a particular strain or set of strains exhibit skewed annotation measures as compared to other genomes of the same species. For example, one particular Salmonella genome exhibits an increased coding density, ratio of short proteins, and number of hypothetical proteins along with a decreased average protein length (Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7). In other cases subclusters of a particular species are formed due to potential erroneous annotations such as the three Yersinia pestis genomes that cluster separately from other Y. pestis strains due to skews in annotation that were derived from the same pipeline . In other cases, substrains do not cluster together as the annotations were derived from three different annotation pipelines such as the case for E. coli BL21 where three isolates were sequenced and annotated by three different research groups . Evolutionary events that result in altered annotations in a particular organism are significant and aid our understanding of the biology of not only that particular organism but of related organisms. Annotation differences due to the utilization of different methods and sources skew these results and the conclusions that result from them.
Researchers are encouraged to update their annotations on archival records to meet the minimal standards and to correct any annotation discrepancies. Systems are being developed at NCBI to check newly submitted genomes for compliance with minimal standards and reports will be provided to submitters for quality assurance. Genomic records where the minimal standards cannot be met for real biological reasons will have explanatory comments added to the record.
Pseudogene Identification, Nomenclature, and Annotation
Pseudogene annotation strategies and outcomes
How to Annotate
no translation; product name is in note, associated feature (CDS, tRNA, rRNA, etc.) will be annotated
normal gene annotated, potential pseudogene status in note
no CDS feature, not documented as a pseudogene, not trackable as protein vs. RNA-coding
Frameshifted gene and sequence IS correct
combine intervals into a single gene with /pseudo
no translation; product name is in note
Frameshifted gene and sequence MAY be correct
keep both and add a note to each CDS
two separate coding regions and two protein translations
Frameshifted gene and there are sequence ERRORS
/“exception=”annotated by transcript or proteomic data” AND (“/experiment” OR “/inference”)
experimental evidence defining the evidence that translation is correct and/or inference pointing to Accession Number with correct translation
protein sequence imported-translation does not match nucleotide
Frameshifted gene and there are sequence ERRORS
locations altered for ‘correct’ location
all protein deflines prefaced with “LOW-QUALITY PROTEIN:”
Region of similarity
misc_feature denoting location of region of similarity
no gene, no locus_tag, not systematically enumerated
Potential unresolvable problems
note explaining the issue
no change in annotation
Split/interrupted gene in the case of an insertion (ex. transposon insertion)
could be either a single interval, or a split interval, annotation depends on consequence of insertion
no standards for split genes, locations do not match regions of similarity
Functional annotation results include guidelines on protein naming as well as a project to compare different protein naming resources in an effort to converge towards a consistent set of protein names by utilizing common guidelines.
Functional Annotation - Protein Naming Guidelines
Establishing protein naming standards has been a keystone of various curation efforts. In particular, this issue recognizes the protein name as the lowest common denominator of information exchange. The protein name is what is used in BLAST definition lines, which many users utilize as the sole information source. Ontologies were discussed but were not considered a priority. Ensuring up-to-date and well formatted protein names aids functional comparison and reliable hypotheses can be generated based on a set of consistent names, while the converse is true for badly formed names. UniProt had established publicly available naming guidelines that were modified during discussions and a set of prokaryotic-specific naming guidelines was adopted. The guidelines provide a basis for efficient and effective protein naming that is being used in the curation of both UniProt and RefSeq annotations. It is expected that all genomes submitted to INSDC will also follow these guidelines. A separate publication will detail the UniProt naming guidelines which are currently available online (Table 1). In addition, there is a general functional naming guideline that is applicable to protein names for all organisms (Table 1).
One particular issue of protein naming is the issue of specific names for proteins that have unknown or uncertain functional assignments. The final accepted resolution is that only two synonymous names will be acceptable: “hypothetical protein” or “uncharacterized protein”. Names such as “conserved hypothetical protein”, “novel protein”, or “protein of unknown function” are no longer acceptable in genome submissions.
Comparison of functional annotation sources
Numerous resources are used in the annotation of protein functions and names and there are two established models for curation. Either a model organism database has been established for particularly important or well-studied organisms, or a set of protein families with similar function have been curated. One of the earliest examples of the latter was the Clusters of Orthologous Groups developed at NCBI which is no longer actively curated . Since that time extensive work has been done by at least four separate groups: JCVI has produced the TIGRFAM set of protein families with a subset identified as equivalogs with the same function, UniProt’s High-quality Automated and Manual Annotation of microbial and chloroplast Proteomes (HAMAP), the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology groups (KO) that uses NCBI Reference Sequences, and NCBI’s Protein Clusters database that includes prokaryote, viral, and selected eukaryotic organism groups (ProtClustDB) , [46,47,49,77]. The TIGRFAMs and HAMAP projects contain only curated families, whereas KEGG and ProtClustDB have both curated and uncurated clusters. In 2009 NCBI and JCVI jointly collaborated on an initiative to compare the functional names derived from TIGRFAMs with NCBI’s curated protein clusters. The comparison results led to improvements in both databases (data not shown). A comparison of protein family annotation from all four databases is available online (Table 1).
Core proteins added to RefSeq genomes1
Number of additions3
30S ribosomal protein S8
30S ribosomal protein S11
30S ribosomal protein S14
30S ribosomal protein S15
30S ribosomal protein S19
50S ribosomal protein L2
50S ribosomal protein L11
50S ribosomal protein L23
50S ribosomal protein L29
elongation factor P
translation initiation factor IF-1
The core set establishes the initial set for functional name comparison for the 61 functions and 191 clusters. Comparison to TIGRFAM, HAMAP, and KEGG resulted in mapping to 127, 99, and 77 families (or subfamilies), respectively. A total of 122 of the 191 clusters have mappings to all other sources. Of those, only 26 have identical curated names. Multi-way comparison shows that most non-identical names are synonymous, except in a few cases. Examples include the tRNA synthetases, which almost always have identical names, but in a few cases are named as the ligase and not the synthetase. An example is ‘tryptophanyl-tRNA synthetase’ which in some instances is named ‘tryptophan—tRNA ligase’ the accepted NC-IUB (Nomenclature Committee of the International Union of Biochemistry) name for the Enzyme Commission number 22.214.171.124 (Table 1). Pairwise comparison of ProtClustDB clusters and the other protein family sources shows two things: 1) a number of protein family resources are missing curated core functions or that these families mapped below threshold levels, and 2) that there are substantially higher numbers of identically curated protein names in two- and three-way comparisons. All four databases have agreed to resolve differences and to work to incorporate the UniProt guidelines into the curated functional names. As these resources are heavily used in genome annotation pipelines, improvements to these records will improve annotations in many genomes and set a standard for other resources. Additional protein family resources are encouraged to be included if they agree to the same goals and are welcome to contact us. InterPro, for example, is another database that integrates information from a variety of source databases and their ongoing effort was acknowledged at the workshop .
Viral/phage annotation standards
Viral annotation standards were discussed for the first time at the 2010 annotation workshop. A set of proposals was published separately and synthesizes many of the ideas presented above with respect to issues of annotation, capturing experimental data, meta-data, and genome classification, all in the context of viral genomes .
Minimal annotation standards and guidelines accepted At 2010 NCBI genome annotation workshop1
1. A complete prokaryotic genome should have:
a. set of ribosomal RNAs (at least one each 5S, 16S, 23S)
b. a set of tRNAS (at least one each for each amino acid)
c. protein-coding genes at expected density (not all named ‘hypothetical protein’ and all core genes annotated)
2. Annotations should follow INSDC submission guidelines:
Annotation standards should follow feature table format and submission guidelines (GenBank/ENA/DDBJ - Table 1)
a. prior to genome submission a submitted Bioproject record with a registered locus_tag prefix is required and the genome record should contain the Bioproject ID. All proper features should have genes and locus_tags
b. the genome submission should be valid according to feature table documentation and follow the standards
3. Methodologies and SOPs (Standard Operating Procedures):
Information about SOPs and additional meta data can be provided in a structured comment with more specific information about experimental or inference support provided on annotated features (see Table 2).
Exceptions (unusual annotations, annotations not within expected ranges - see Table 1) should be documented on the genome record and strong supporting evidence should be provided.
Annotated pseudogenes should follow the accepted formats (see Table 4).
6. Additional/enriched annotations:
Additional (enriched) annotations should follow INSDC guidelines, and be documented as above (SOPs and evidence).
7. Catalog of reputable annotation guidelines, software, and pipelines:
This non-exhaustive list of reliable software, sources, and databases for the production of microbial genome annotation is a useful community resource that aids in producing high quality genome annotation (Table 1).
8. Validation checks and annotation measures:
Validation checks should be done prior to the submission of a new genome record. NCBI has already provided numerous tools to validate and ensure correctness of annotation and additional checks and reports will be put in place to ensure minimal standards are met (see Table 1).
The authors would like to thank the J. Craig Venter Institute for hosting the workshop and especially Tanja Davidsen and Ramana Madupu for help in the organization before, during, and after the workshop. Funding for the open access charge was provided by the Intramural Research Program of the National Institutes of Health; National Library of Medicine.
- Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E. Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Sci 1992; 1:1677–1690. PubMed doi:10.1002/pro.5560011216PubMed CentralView ArticlePubMedGoogle Scholar
- Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E. What’s in a genome? Nature 1992; 358:287. PubMed doi:10.1038/358287a0View ArticlePubMedGoogle Scholar
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269:496–512. PubMed doi:10.1126/science.7542800View ArticlePubMedGoogle Scholar
- Madupu R, Brinkac LM, Harrow J, Wilming LG, Bohme U, Lamesch P, Hannick LI. Meeting report: a workshop on Best Practices in Genome Annotation. Database (Oxford) 2010;2010:baq001.Google Scholar
- White O, Kyrpides N. Meeting Report: Towards a Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation. Stand Genomic Sci 2010; 3:240–242. PubMed doi:10.4056/sigs.1323436PubMed CentralView ArticlePubMedGoogle Scholar
- Ouzounis CA, Karp PD. The past, present and future of genome-wide re-annotation. Genome Biol 2002;3(2):COMMENT2001.Google Scholar
- Ouzounis C, Bork P, Casari G, Sander C. New protein functions in yeast chromosome VIII. Protein Sci 1995; 4:2424–2428. PubMed doi:10.1002/pro.5560041121PubMed CentralView ArticlePubMedGoogle Scholar
- Kyrpides NC. Fifteen years of microbial genomics: meeting the challenges and fulfilling the dream. Nat Biotechnol 2009; 27:627–632. PubMed doi:10.1038/nbt.1552View ArticlePubMedGoogle Scholar
- Klimke W, Tatusova T. Microbial Genomes at NCBI. Apweiler NMaR, editor. New York: Nova Science Publishers, Inc.; 2006.Google Scholar
- Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz VM, Kyrpides NC. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res;38(Database issue):D346-54.Google Scholar
- Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL. The value of complete microbial genome sequencing (you get what you pay for). J Bacteriol 2002;184(23):6403–5; discusion 6405.PubMed CentralView ArticlePubMedGoogle Scholar
- Metzker ML. Sequencing technologies — the next generation. Nat Rev Genet 2010; 11:31–46. PubMed doi:10.1038/nrg2626View ArticlePubMedGoogle Scholar
- Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLOS Comput Biol 2009; 5:e1000605. PubMed doi:10.1371/journal.pcbi.1000605PubMed CentralView ArticlePubMedGoogle Scholar
- Dall’Olio GM, Bertranpetit J, Laayouni H. The annotation and the usage of scientific databases could be improved with public issue tracker software. Database (Oxford) 2010; 2010:baq035. PubMed doi:10.1093/database/baq035Google Scholar
- Ussery DW, Hallin PF. Genome Update: annotation quality in sequenced microbial genomes. Microbiology 2004; 150:2015–2017. PubMed doi:10.1099/mic.0.27338-0View ArticlePubMedGoogle Scholar
- Andorf C, Dobbs D, Honavar V. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach. BMC Bioinformatics 2007; 8:284. PubMed doi:10.1186/1471-2105-8-284PubMed CentralView ArticlePubMedGoogle Scholar
- Galperin MY, Nikolskaya AN, Koonin EV. Novel domains of the prokaryotic two-component signal transduction systems. FEMS Microbiol Lett 2001; 203:11–21. PubMed doi:10.1111/j.1574-6968.2001.tb10814.xView ArticlePubMedGoogle Scholar
- Pei J, Grishin NV. GGDEF domain is homologous to adenylyl cyclase. Proteins 2001; 42:210–216. PubMed doi:10.1002/1097-0134(20010201)42:2<210::AID-PROT80>3.0.CO;2-8View ArticlePubMedGoogle Scholar
- Römling U, Gomelsky M, Galperin MY. C-di-GMP: the dawning of a novel bacterial signalling system. Mol Microbiol 2005; 57:629–639. PubMed doi:10.1111/j.1365-2958.2005.04697.xView ArticlePubMedGoogle Scholar
- Rentzsch R, Orengo CA. Protein function prediction—the power of multiplicity. Trends Biotechnol 2009; 27:210–219. PubMed doi:10.1016/j.tibtech.2009.01.002View ArticlePubMedGoogle Scholar
- Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997; 25:955–964. PubMed doi:10.1093/nar/25.5.955PubMed CentralView ArticlePubMedGoogle Scholar
- Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 2007; 35:3100–3108. PubMed doi:10.1093/nar/gkm160PubMed CentralView ArticlePubMedGoogle Scholar
- Glasner JD, Rusch M, Liss P, Plunkett G, III, Cabot EL, Darling A, Anderson BD, Infield-Harm P, Gilson MC, Perna NT. ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res 2006; 34(Database issue):D41–D45. PubMed doi:10.1093/nar/gkj164PubMed CentralView ArticlePubMedGoogle Scholar
- Greene JM, Collins F, Lefkowitz EJ, Roos D, Scheuermann RH, Sobral B, Stevens R, White O, Di Francesco V. National Institute of Allergy and Infectious Diseases bioinformatics resource centers: new assets for pathogen informatics. Infect Immun 2007; 75:3212–3219. PubMed doi:10.1128/IAI.00105-07PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Klimke W, Maglott DR. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res 2009; 37(Database issue):D32–D36. PubMed doi:10.1093/nar/gkn721PubMed CentralView ArticlePubMedGoogle Scholar
- Klimke W, Agarwala R, Badretdin A, Chetvernin S, Ciufo S, Fedorov B, Kiryutin B, O’Neill K, Resch W, Resenchuk S, et al. The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res 2009; 37(Database issue):D216–D223. PubMed doi:10.1093/nar/gkn734PubMed CentralView ArticlePubMedGoogle Scholar
- The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 2009; 37(Database issue):D169–D174. PubMed doi:10.1093/nar/gkn664View ArticleGoogle Scholar
- Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, et al. Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2004; 33(Database issue):D297–D302. PubMed doi:10.1093/nar/gki039PubMed CentralView ArticleGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S and others. Ensembl 2011. Nucleic Acids Res; 39(Database issue):D800–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001; 29:365–371. PubMed doi:10.1038/ng1201-365View ArticlePubMedGoogle Scholar
- Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, et al. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol 2008; 26:541–547. PubMed doi:10.1038/nbt1360PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 2008; 26:889–896. PubMed doi:10.1038/nbt.1411PubMed CentralView ArticlePubMedGoogle Scholar
- Gaudet P, Bairoch A, Field D, Sansone SA, Taylor C, Attwood TK, Bateman A, Blake JA, Bult CJ, Cherry JM, et al. Towards BioDBcore: a community-defined information specification for biological databases. Nucleic Acids Res 2011; 39(Database issue):D7–D10. PubMed doi:10.1093/nar/gkq1173PubMed CentralView ArticlePubMedGoogle Scholar
- Quackenbush J. Data reporting standards: making the things we use better. Genome Med 2009; 1:111. PubMed doi:10.1186/gm111PubMed CentralView ArticlePubMedGoogle Scholar
- Kaminuma E, Mashima J, Kodama Y, Gojobori T, Ogasawara O, Okubo K, Takagi T, Nakamura Y. DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Res 2010; 38(Database issue):D33–D38. PubMed doi:10.1093/nar/gkp847PubMed CentralView ArticlePubMedGoogle Scholar
- Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tarraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, et al. The European Nucleotide Archive. Nucleic Acids Res 2011; 39(Database issue):D28–D31. PubMed doi:10.1093/nar/gkq967PubMed CentralView ArticlePubMedGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 2007;35(Web Server issue):W182–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008; 9:75. PubMed doi:10.1186/1471-2164-9-75PubMed CentralView ArticlePubMedGoogle Scholar
- JGI website. http://www.jgi.doe.gov
- Goll J, Montgomery R, Brinkac LM, Schobel S, Harkins DM, Sebastian Y, Shrivastava S, Durkin S, Sutton G. The Protein Naming Utility: a rules database for protein nomenclature. Nucleic Acids Res 2010; 38(Database issue):D336–D339. PubMed doi:10.1093/nar/gkp958PubMed CentralView ArticlePubMedGoogle Scholar
- Antonov I, Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the Viterbi algorithm. J Bioinform Comput Biol 2010; 8:535–551. PubMed doi:10.1142/S0219720010004847View ArticlePubMedGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2011; 39(Database issue):D38–D51. PubMed doi:10.1093/nar/gkq1172PubMed CentralView ArticlePubMedGoogle Scholar
- Riley M, Abe T, Arnaud MB, Berlyn MK, Blattner FR, Chaudhuri RR, Glasner JD, Horiuchi T, Keseler IM, Kosuge T, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot—2005. Nucleic Acids Res 2006; 34:1–9. PubMed doi:10.1093/nar/gkj405PubMed CentralView ArticlePubMedGoogle Scholar
- Siguier P, Perochon J, Lestrade L, Mahillon J, Chandler M. ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 2006; 34(Database issue):D32–D36. PubMed doi:10.1093/nar/gkj014PubMed CentralView ArticlePubMedGoogle Scholar
- Roberts AP, Chandler M, Courvalin P, Guedon G, Mullany P, Pembroke T, Rood JI, Smith CJ, Summers AO, Tsuda M, et al. Revised nomenclature for transposable genetic elements. Plasmid 2008; 60:167–173. PubMed doi:10.1016/j.plasmid.2008.08.001View ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4:41. PubMed doi:10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, Bulliard V, de Castro E, Lachaize C, Baratin D, et al. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res 2009; 37(Database issue):D471–D478. PubMed doi:10.1093/nar/gkn661PubMed CentralView ArticlePubMedGoogle Scholar
- Aoki-Kinoshita KF, Kanehisa M. Gene annotation and pathway mapping in KEGG. Methods Mol Biol 2007; 396:71–91. PubMed doi:10.1007/978-1-59745-515-2_6View ArticlePubMedGoogle Scholar
- Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 2007; 35(Database issue):D260–D264. PubMed doi:10.1093/nar/gkl1043PubMed CentralView ArticlePubMedGoogle Scholar
- Leplae R, Lima-Mendez G, Toussaint A. ACLAME: a CLAssification of Mobile genetic Elements, update 2010. Nucleic Acids Res 2010; 38(Database issue):D57–D61. PubMed doi:10.1093/nar/gkp938PubMed CentralView ArticlePubMedGoogle Scholar
- Genome Annotation Workshop NCBI. http://www.ncbi.nlm.nih.gov/genomes/AnnotationWorkshop.html
- Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Genome Res 2009; 19:1316–1323. PubMed doi:10.1101/gr.080531.108PubMed CentralView ArticlePubMedGoogle Scholar
- Blattner FR, Plunkett G, III, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. The complete genome sequence of Escherichia coli K-12. Science 1997; 277:1453–1462. PubMed doi:10.1126/science.277.5331.1453View ArticlePubMedGoogle Scholar
- Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muniz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T, et al. EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res 2011; 39(Database issue):D583–D590. PubMed doi:10.1093/nar/gkq1143PubMed CentralView ArticlePubMedGoogle Scholar
- Rudd KE. EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res 2000; 28:60–64. PubMed doi:10.1093/nar/28.1.60PubMed CentralView ArticlePubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. GenBank. Nucleic Acids Res 2011; 39(Database issue):D32–D37. PubMed doi:10.1093/nar/gkq1079PubMed CentralView ArticlePubMedGoogle Scholar
- BioProject. http://www.ncbi.nlm.nih.gov/genomeprj
- Angiuoli SV, Gussman A, Klimke W, Cochrane G, Field D, Garrity G, Kodira CD, Kyrpides N, Madupu R, Markowitz V, et al. Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. OMICS 2008; 12:137–141. PubMed doi:10.1089/omi.2008.0017PubMed CentralView ArticlePubMedGoogle Scholar
- Winsor GL, Van Rossum T, Lo R, Khaira B, Whiteside MD, Hancock RE, Brinkman FS. Pseudomonas Genome Database: facilitating user-friendly, comprehensive comparisons of microbial genomes. Nucleic Acids Res 2009; 37(Database issue):D483–D488. PubMed doi:10.1093/nar/gkn861PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology in extensions and refinements. Nucleic Acids Res 2010; 38(Database issue):D331–D335. PubMedGoogle Scholar
- Gil R, Silva FJ, Pereto J, Moya A. Determination of the core of a minimal bacterial gene set. Microbiol Mol Biol Rev 2004; 68:518–537. PubMed doi:10.1128/MMBR.68.3.518-537.2004PubMed CentralView ArticlePubMedGoogle Scholar
- Harris JK, Kelley ST, Spiegelman GB, Pace NR. The genetic core of the universal ancestor. Genome Res 2003; 13:407–412. PubMed doi:10.1101/gr.652803PubMed CentralView ArticlePubMedGoogle Scholar
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA. The relationship of protein conservation and sequence length. BMC Evol Biol 2002; 2:20. PubMed doi:10.1186/1471-2148-2-20PubMed CentralView ArticlePubMedGoogle Scholar
- Giovannoni SJ, Tripp HJ, Givan S, Podar M, Vergin KL, Baptista D, Bibbs L, Eads J, Richardson TH, Noordewier M, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science 2005; 309:1242–1245. PubMed doi:10.1126/science.1114057View ArticlePubMedGoogle Scholar
- Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori M. The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science 2006; 314:267. PubMed doi:10.1126/science.1134196View ArticlePubMedGoogle Scholar
- McCutcheon JP, McDonald BR, Moran NA. Origin of an alternative genetic code in the extremely small and GC-rich genome of a bacterial symbiont. PLoS Genet 2009; 5:e1000565. PubMed doi:10.1371/journal.pgen.1000565PubMed CentralView ArticlePubMedGoogle Scholar
- Dufresne A, Garczarek L, Partensky F. Accelerated evolution associated with genome reduction in a free-living prokaryote. Genome Biol 2005; 6:R14. PubMed doi:10.1186/gb-2005-6-2-r14PubMed CentralView ArticlePubMedGoogle Scholar
- Rocap G, Larimer FW, Lamerdin J, Malfatti S, Chain P, Ahlgren NA, Arellano A, Coleman M, Hauser L, Hess WR, et al. Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation. Nature 2003; 424:1042–1047. PubMed doi:10.1038/nature01947View ArticlePubMedGoogle Scholar
- Willenbrock H, Binnewies TT, Hallin PF, Ussery DW. Genome update: 2D clustering of bacterial genomes. Microbiology 2005; 151:333–336. PubMed doi:10.1099/mic.0.27811-0View ArticlePubMedGoogle Scholar
- Moran NA, McLaughlin HJ, Sorek R. The dynamics and time scale of ongoing genomic erosion in symbiotic bacteria. Science 2009; 323:379–382. PubMed doi:10.1126/science.1167140View ArticlePubMedGoogle Scholar
- Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 2000; 407:81–86. PubMed doi:10.1038/35024074View ArticlePubMedGoogle Scholar
- Shen X, Wang Q, Xia L, Zhu X, Zhang Z, Liang Y, Cai H, Zhang E, Wei J, Chen C, et al. Complete genome sequences of Yersinia pestis from natural foci in China. J Bacteriol 2010; 192:3551–3552. PubMed doi:10.1128/JB.00340-10PubMed CentralView ArticlePubMedGoogle Scholar
- Jeong H, Barbe V, Lee CH, Vallenet D, Yu DS, Choi SH, Couloux A, Lee SW, Yoon SH, Cattolico L, et al. Genome sequences of Escherichia coli B strains REL606 and BL21(DE3). J Mol Biol 2009; 394:644–652. PubMed doi:10.1016/j.jmb.2009.09.052View ArticlePubMedGoogle Scholar
- Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N, Cayting P, Harrrison P, Gerstein M. Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res 2007; 35(Database issue):D55–D60. PubMed doi:10.1093/nar/gkl851PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Y, Harrison PM, Kunin V, Gerstein M. Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 2004; 5:R64. PubMed doi:10.1186/gb-2004-5-9-r64PubMed CentralView ArticlePubMedGoogle Scholar
- Kuo CH, Ochman H. The extinction dynamics of bacterial pseudogenes. PLoS Genet 2010; 6. PubMed doi:10.1371/journal.pgen.1001050Google Scholar
- Okuda S, Yamada T, Hamajima M, Itoh M, Katayama T, Bork P, Goto S, Kanehisa M. KEGG Atlas mapping for global analysis of metabolic pathways. Nucleic Acids Res 2008; 36(Web Server issue):W423–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688–6719. PubMed doi:10.1093/nar/gkn668PubMed CentralView ArticlePubMedGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pangenome”. Proc Natl Acad Sci USA 2005; 102:13950–13955. PubMed doi:10.1073/pnas.0506758102PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, et al. InterPro: the integrative protein signature database. Nucleic Acids Res 2009; 37(Database issue):D211–D215. PubMed doi:10.1093/nar/gkn785PubMed CentralView ArticlePubMedGoogle Scholar
- Brister JR, Bao Y, Kuiken C, Lefkowitz EJ, Le Mercier P, Leplae R, Madupu R, Scheuermann RH, Schobel S, Seto D, et al. Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop. Viruses 2010; 2:2258–2268. doi:10.3390/v2102258PubMed CentralView ArticlePubMedGoogle Scholar
- Roberts RJ, Chang YC, Hu Z, Rachlin JN, Anton BP, Pokrzywa RM, Choi HP, Faller LL, Guleria J, Housman G, et al. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes. Nucleic Acids Res 2011; 39(Database issue):D11–D14. PubMed doi:10.1093/nar/gkq1168PubMed CentralView ArticlePubMedGoogle Scholar