Quantifying Protein Function Specificity in the Gene Ontology
© The Author(s) 2010
Published: 30 April 2010
Quantitative or numerical metrics of protein function specificity made possible by the Gene Ontology are useful in that they enable development of distance or similarity measures between protein functions. Here we describe how to calculate four measures of function specificity for GO terms: 1) number of ancestor terms; 2) number of offspring terms; 3) proportion of terms; and 4) Information Content (IC). We discuss the relationship between the metrics and the strengths and weaknesses of each.
Keywordsprotein annotation protein function function specificity
Genomic sciences and biological understanding can be greatly enriched by quantitative comparisons between the descriptions of protein functions [1–5]. To achieve this, numerical descriptions of protein function specificity must be defined. This is now possible using the Gene Ontology (GO [6,7],). The GO is a standardized description of protein function structured as a hierarchy of “parent-child” relationships, formally called a directed acyclic graph (DAG). DAGs have long been used in computer science as a mathematical formalism for describing complex objects. Modeling protein function as a DAG provides a means of more precisely defining protein function and the relationship between functions as opposed to traditional natural language descriptions which are information-rich but unfortunately not amenable to computers.
The use of the GO provides a conception of function specificity that has immediate implications in the automated annotation of proteins [4,6]. Millions of proteins in public databases have their functions inferred from proteins with similar sequences. The meaningful transfer of those functions is made possible in part by the standardized organization of the GO. The GO is organized so that as one traverses away from the root node function, definitions become narrower; examples of broad functional terms are “catalytic activity” (GO:0003824) or “transporter activity” (GO:0005215), while narrower functions would be “adenylate cyclase activity” (GO:0004016) or “peptidoglycan transporter activity” (GO:0015647) . Quantifying the path from broad to narrow function specificity is vague however as path lengths are variable and there are no edge weights which makes meaningful numeric interpretation of function specificity problematic. This ambiguity can be addressed by considering various aspects of the DAG structure of GO. Each node in GO (i.e., GO term) is assigned a function and measurements such as the number of ancestor or offspring nodes for that term can be used to give a numeric assessment of that term’s specificity. This paper discusses various methods created to improve the precision of assigning and comparing specificity of GO terms and discusses strengths and weaknesses of each method.
The methods described here utilize the R programming language with the Bioconductor R package installed and a dataset of associations between gene or protein identifiers and GO terms, e.g. the gene2go file available at ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/. All methods described here are available for free.
Number of Ancestors
Number of Offspring
The interpretation of this measure is that higher OffspN indicates more specific function. This maintains the same “higher is better” idea consistent with other GO specificity metrics such as GO ancestors. Pseudocode for this calculation is also exemplified in Figure 2. Note however that the neighbors of a node would only include its child nodes — not its parents.
Step 1: Count the number of proteins assigned to the term.
Step 2: Count the number of proteins assigned to all offspring of the term.
Step 3: Add the counts from Steps 1 and 2 and divide by the total number of proteins in the data set.
Step 1: Count the proteins assigned to GO:Term_3 (one).
Step 2: Count the proteins assigned to the offspring of GO:Term_3 (three).
Step 3: Calculate p(t) for GO:Term_3, in this case 4/11. Its IC is therefore 1.5.
The appealing part IC is that it implicitly accounts for the hierarchical structure of the GO. The root node, or “molecular function” (GO:0003674), has IC of 0.0, creating an appealing baseline measurement of the most non-specific function. As the GO is traversed away from the root node one generally expects IC for terms to increase.
Calculating GO Ancestors/Offspring
 “all” “GO:0005515” “GO:0003674” “GO:0005488”
Which returns the lone offspring of GO:0005518, which is the term “collagen V binding” (GO:0070052).
Calculating Information Content
Examples of GO terms and their specificity metrics.
Max Specificity Value
Min Specificity Value
The GO proportion is useful in that it considers the implications of both the number of ancestors and the number of offspring. It is also on a 0–1 scale, which allows a readily accessible min-max interpretation. However, the concerns with this proportion measure are similar to those pertaining to GO ancestors and GO offspring.
The most unique aspect of IC is that it is calculated from data. The strength of IC, however, doubles as a caveat: since it is data dependent, the IC of a particular GO term may fluctuate from data set to data set. In our experience however the IC calculation is generally robust for most terms, especially common ones as their probability of occurrence in a data set changes little. Very scarce GO terms can be impacted more significantly if more instances of the term are added to a dataset, although we have found this to be a rare event. Also, if a GO term does not occur in a data set then calculating its IC is impossible. This is undesirable, especially if that GO term is of some interest to the research project at hand. Also, occurrence of GO terms in a data set might be due to bias in the way proteins are annotated which may not be representative of the natural state of function specificity, i.e. more specific IC may not be truly indicative of more specific function.
There is also some “divergence” between the various specificity metrics for some GO terms. Concrete examples of where the various specificity measures may diverge can be seen in Table 1. For instance the GO term “interacting selectively and non-covalently with ATP, adenosine 5′-triphosphate” (GO:0005524) has the maximum value of the GO offspring measurement (OffspN=9.02) and the GO proportion is 1.0, both indications of very specific function. The IC however is only 6.64, an indication of relatively non-specific function. This exemplifies a case where two of the measures indicate specific function while another indicates moderately-specific function. A reversal of sorts can be seen with the GO term “the action of a molecule that contributes to the structural integrity of a cuticle” (GO:0042302). This GO term is very specific by IC (15.18), yet middling specificity is indicated by OffspN (7.23), and low specificity as indicated by GO proportion (0.29) and GO ancestors (2). The GO term “the function of binding to a specific DNA sequence in order to modulate transcription” (GO:0003700), can be seen as a more “middle ground” as all of its metrics indicate relatively moderate specificity. These numbers provide evidence that each of these measurements have information to give about function specificity. Consider that GO offspring, which only takes into account the structure of the GO beneath a particular term, is a product of abstracted biological knowledge as it currently exists. This knowledge appears to be represented to some degree idiosyncratically between different types of functions. In contrast, IC, which only considers the current distribution of GO terms in a database and may be biased due to the experimental methodologies used to annotate proteins. Certainly no metric is perfect, each has strengths and weaknesses, and considering all of them provides a more holistic knowledge of a GO term to enable specificity comparisons across terms.
Directed Acyclic Graph
- Kolker E, Purvine S, Galperin MY, Stolyar S, Goodlett DR, Nesvizhskii AI, Keller A, Xie T, Eng JK, Yi E, et al. Initial Proteome Analysis of Model Microorganism Haemophilus influenzae Strain Rd KW20. J Bacteriol 2003; 185:4593–4602. PubMed doi:10.1128/IB.185.15.4593-4602.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Raghunathan A, Price ND, Galperin MY, Makarova KS, Purvine S, Picone AF, Cherny T, Xie T, Reilly TJ, Munson R, et al. In Silico Metabolic Model and Protein Expression of Haemophilus influenzae Strain Rd KW20 in Rich Medium. OMICS 2004; 8:25–41. PubMed doi:10.1089/153623104773547471View ArticlePubMedGoogle Scholar
- Kolker E, Makarova KS, Shabalina S, Picone AF, Purvine S, Holzman T, Cherny T, Armbruster D, Munson RS, Kolesov G, et al. Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae. Nucleic Acids Res 2004; 32:2353–2361. PubMed doi:10.1093/nar/gkh555PubMed CentralView ArticlePubMedGoogle Scholar
- Kolker E, Picone AF, Galperin MY, Romine MF, Higdon R, Makarova KS, Kolker N, Anderson GA, Qiu X, Auberry KJ, et al. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations. Proc Natl Acad Sci USA 2005; 102:2099–2104. PubMed doi:10.1073/pnas.0409111102PubMed CentralView ArticlePubMedGoogle Scholar
- Louie B, Higdon R, Kolker E. A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions. PLoS ONE 2009; 4:e7546. PubMed doi:10.1371/journal.pone.0007546PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25:25–29. PubMed doi:10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium. The Gene Ontology in 2010: extensions and refinements. Nucleic Acids Res 2010;38:D331–D335. doi:10.1093/nar/gkp1018PubMed CentralView ArticleGoogle Scholar
- Chang S. Data structures and algorithms. University of Pittsburg, USA: World Scientific; 2003.View ArticleGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003; 19:1275–1283. PubMed doi:10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AE, Falcão AO, Couto FM. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 2008; 9:S4–S4. PubMed doi:10.1186/1471-2105-9-S5-S4PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey V, Bates D, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004; 5:R80. PubMed doi:10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar