The Metadata Coverage Index (MCI): A standardized metric for quantifying database metadata richness
© The Author(s) 2012
Published: 30 July 2012
Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
“If you cannot measure it, you cannot improve it.”
As the size, number and complexity of bioscience data sets in the public domain continue to grow, appropriate contextualizing of information becomes indispensable. Such ‘halos’ of information are referred to as metadata and include information on how data were collected, processed and analyzed, the nature and state of the biological sample used and the research context. Nowhere is this more relevant than in high-throughput studies using new technologies , where the rate of production of data sets is becoming almost unmanageable given current public provision. We are now at a critical stage in which we need to quantify the value of such contextual information.
Metadata considered critical to data interpretation are often referred to as ‘minimum information’ (MI) and this concept has been expressed in various ‘MI checklists’  covering a range of data types including transcriptomics, proteomics, metabolomics and genomics. MI checklists specify the contextual information that should be reported to ensure that studies are (in principle) reproducible and can be compared or combined in an appropriately-informed manner in downstream analyses. Because of the increasing number of such specifications, it behooves the data-sharing community to develop methods to quantify the degree of compliance of databases, individual records or ad hoc collections, in order to highlight challenging-to-acquire components of specifications or to quantify improvements in metadata reporting or database content (for example, through curation).
Here we introduce the first, simple metric for evaluating the ‘richness’ of the metadata for any given database (or compliance with a given standard) and a straightforward method to calculate it. The ‘Metadata Coverage Index’ (MCI) is the number of fields in a record for which information is provided, as a percentage of the total fields available. An MCI is no guarantee of quality, but given that automated assessment of the semantic content of metadata remains challenging, and that even the correct use of controlled vocabulary terms cannot be a general solution as things stand, we are prepared to make the assumption that most annotation constitutes an addition of value to the overall data set and that therefore an MCI is a realizable proxy for the hypothetical Metadata Quality Index of a dataset.
An MCI score represents arbitrarily complex contextual information as a simple numerical value. MCI scores can be calculated for individual fields or across collections/databases. While it is clear that some types of metadata carry more value than others, we have made no attempt to model distributions of value across database schemata or MI specifications so that generality for this simplest expression of the metric would be preserved. The weighting of fields according to local or consensus value could be the focus of future work to generate derived versions of MCI reflecting those weightings (i.e., depend on extended validation rules).
To illustrate the calculation of this metric and the usefulness of the concept, we use the MCI to profile the Genomes Online Database (GOLD)  and evaluate attempted compliance (i.e., fields filled) with the ‘Minimum Information about a Genome Sequence’ (MIGS) checklist  — a part of the MIxS standard  from the Genomic Standards Consortium (GSC) .
Materials and Methods
Spreadsheets containing information for genomes from the Genomic Encyclopedia of Bacteria and Archaea (GEBA, ) and the Human Microbiome Project (HMP)  studies, as well as all the genome projects available from GOLD  were obtained from the GOLD database.
Calculation of MCI scores with the MCI Calculator
For users: addition of MCI scores to the GOLD database
MCI scores were calculated for all records in GOLD, added to the GOLDCARD pages and offered for use through the GOLD search interface. Thus, MCI scores can now be used to search and sort GOLD records; for example, to retrieve only those records scoring above a certain MCI threshold.
Calculating MCI scores and comparison of metadata fields
The list of all selected metadata fields in GOLD (columns 2 and 6)1
GOLD Metadata Field
GOLD Metadata Field
GOLD STAMP ID
HMP FINISHING GOAL2
NCBI TAXON ID
HMP ISOLATION BODY SITE2
HMP NCBI SUBMISSION STATUS2
HMP PROJECT STATUS2
SEQUENCING STATUS LINK
GENE CALLING METHOD
HMP ISOLATE SOURCE2
BODY SAMPLE SUBSITES
HMP ISOLATION COMMENTS2
NCBI PROJECT ID
NUMBER OF READS
STRAIN INFO ID
HMP ISOLATION COMMENTS2
HMP ISOLATION BODY SUBSITE2
SHORT READ ARCHIVE ID
HOST TAXON ID
ISOLATION PUBMED ID
IN IMG DATABASE
BODY SAMPLE SITES
SYMBIONT TAXON ID
NCBI ARCHIVE ID
There are five fields with an MCI score of 100 (fields 1–5 in Table 1). These are the fields filled for all the genome projects in GOLD: essential fields for project registration in the GOLD database. There are seven more fields that have an MCI score greater than 99 (fields 6–13): again, essential fields for project registration – most likely the data are missing due to an error and should be flagged for attention. Some of the fields listed appear to be redundant (e.g. field 6 against 14, or 10 against 13), but when the number of records associated with them is displayed, they make better sense. For example, GOLD has implemented a field named ‘GOLD Genus’ (field 6), in addition to the genus information provided from the NCBI Taxonomy (field 14). This is because genus information is more readily available at the time of project registration with GOLD than it usually through the NCBI taxonomy; also true for phyla. The MCI score for the field ‘NCBI BioProject ID’ is 75%, which implies that 25% of the projects in GOLD are not registered yet with the NCBI BioProject collection. Forty-two percent of projects have ‘Host Name’ information, reflecting the size of the genome projects associated with a specific host organism. 74% of the projects in GOLD have an ‘update’ date (field 24 on Table 1), suggesting that the majority of the projects have been revisited for curation at least once after they were created in the database.
Overall, approximately two thirds of the 113 selected GOLD fields have an MCI score below 50 (fields 33–113). The MCI score across all 113 fields is 34.6. Ten of those fields apply only to projects that are part of the HMP study, and were excluded from subsequent comparisons across different datasets. Twelve fields are part of the MIGS fields as recommended by the GSC  (highlighted fields on Table 1). The position of the MIGS fields in the overall list of the 113 fields from GOLD makes clear that these are not the most frequently filled metadata fields across all projects. Only two of the MIGS fields are among the top ten GOLD fields and only six make the top fifty. While the MIGS fields were never likely to be the most populated fields (for example, data for ‘Isolation site’ and ‘Latitude/Longitude’ are frequently not available, even though they are among the most important metadata fields), nonetheless their overall position in the list suggests that a revision may be necessary.
MCI score comparison of data sets
Comparison of MCI scores from the GOLD database.1
Fields per Record
4. All Projects
4. All Projects
We have created nine distinct project collections from GOLD (Project list column on Table 2) and organized them in three separate groups, enabling comparison of the richness of various slices of the full database. Each comparison is meaningful only within its own group. For example, the ‘GEBA’ collection comprises 256 genome projects, all part of the GEBA study. The collection ‘Complete’ refers to the 2,040 complete genome projects available in GOLD; ‘HMP’ refers to the 2,096 projects selected for sequencing under the HMP study. The collection ‘All projects’ encompasses the currently available 13,790 isolate genome projects in GOLD, while ‘Archaea’, ‘Bacteria’ and ‘Eukarya’ relate to the corresponding phylogenetic subgroups. Each project collection group is characterized by the specific number and type of fields selected for the comparison. This is essential in order to select fields that would be applicable for all the projects within a list. Accordingly, all the HMP related fields were excluded from the total number of fields used in this study, thus creating a set of 103 fields that apply to all project lists (CORE group). In a similar manner, the ten HMP-specific fields have been grouped to compose the HMP group, while the 12 MIGS fields comprise the MIGS group of fields (all shown on the column Field group on Table 2).
Comparing the GEBA collection against the complete genomes, the HMP and the all-projects lists, using the core 103 metadata fields (group A on Table 2), reveals that GEBA has the best-curated project metadata, having the highest MCI score (54.18%). This reflects the emphasis given to the collection and curation of metadata for this project, suggesting a formal role for MCI as a performance metric. The availability of SIGS compliant genome reports for all the completed GEBA genomes, certainly had a pivotal role in providing a well curated and standardized source of key metadata for those projects . In terms of metadata coverage across different phylogenetic groups within the GOLD dataset (group 2, on Table 2), archaeal and bacterial subsets of the data had higher MCI scores than eukaryotes, reflecting the value of more-detailed curation of the microbial genome projects for GOLD. Likewise, subsets of data compliant with the MIGS standard fields also had relatively higher general MCI scores, with the GEBA list reaching 68% of metadata coverage (group C on Table 2), almost 10% more than the average complete genome. Finally, within the HMP project list the HMP fields have a high 70% MCI score (group D on Table 2).
Improvements in MCI scores over time
MCI scores can be used to compare collections and to quantify incremental increases in the richness of metadata over time. To illustrate this we compared the information contained in the GOLD database in 2008 , 2010  and in 2012. The 2008 publication of GOLD reported a list of 45 metadata fields and the number of projects associated with those fields , while the 2010 publication of GOLD reported 105 variables and the number of projects for which information was available . We selected a common set of 33 fields across the three sets and calculated the MCI scores for those (group E on Table 2). The results of this comparison revealed that the overall MCI score has remained stable around 60%, although the total number of records has been doubling every two years. This raises the question of whether more recent submitters have tended to report more metadata, which would be indicative of increased acceptance of the value of appropriate metadata. However, since the majority of the data available from the GOLD database are not provided from the submitters but rather collected and curated in the database, it is hard to accurately address that question with these data.
Calculating MCI Scores for Records and Fields
The list of the genome projects in GOLD with the top 10 MCI scores
Streptococcus bovis ATCC 700338
Mycobacterium parascrofulaceum ATCC BAA-614
Ensifer medicae WSM419
Rhizobium leguminosarum bv. trifolii WSM2304
Anaerofustis stercorihominis DSM 17244
Anaerotruncus colihominis DSM 17241
Clostridium hiranonis TO-931, DSM 13275
Clostridium scindens ATCC 35704
Rhizobium leguminosarum bv. trifolii WSM1325
Bacteroides stercoris ATCC 43183
MCI scores, as defined here, only take into account simple presence or absence of values. It is clearly important to make sure these values are valid (for example not uninformative ‘placeholders’ entered into required fields by reluctant data submitters or otherwise inappropriate information). Likewise, sheer quantity of metadata is not always necessarily optimal and care needs to be taken in both generating and interpreting MCI scores in a manner that is appropriate to the interpretation of the data at hand. MCI scores are best used when the exact variables in the total list of expected fields are well defined and transparent to the user (i.e. ideally selected from a minimum standard).
MCI scores will ideally be used to make targeted improvements to databases over time. They could also be used over time to track the evolution of databases and their contents, for example, to signal significant updates in content even when the total number of entries remains the same, to report progress to funders, or to reward the work of curators who contribute the relevant information. Methods that aid in defining the pivotal contributions of curators and rewarding their efforts to the wider community are needed.
MCI scores could be further refined in several ways; for example, to include only fields matching certain criteria (e.g., string, number, regular expression-compliant, or curated versus calculated values), or those using terms from recognized ontologies. This would be particularly useful for judging compliance with a given standard like MIGS – since free text is not allowed, formal validation could be done using, for example, GCDML  (for genomics) or the ISA-Tab (multi-omic) format . MCI scores could also be broken down to cover ‘required’ and ‘optional’ fields separately.
Further refinement of MCI scores would require more thorough validation of metadata, making maximum use of mappings between minimal information requirements, recommended terminologies and any formats used. New efforts emerging from the community are laying the basis for such a multi-dimensional validation process: Data standardization efforts such as the ISA Commons  offer common metadata tracking frameworks that can better underpin and facilitate the development of improved validation methods.
Where databases such as PRIDE  allow free use of controlled vocabularies to extend records (i.e., user-defined fields), the list of identifiable fields may appear disproportionately large (each term used becomes a field, making for a very sparse matrix). MCI requires adaptation for use in such data structures, but even in basic form can be useful in defining whether one or more core (minimum) sets of metadata can be identified (subsets of the data with MCI scores well above average).
When calculating MCI scores, it is important to consider that databases may also contain markedly different subsets (for example, delineated by technique or taxon); appropriate partitioning of records before calculation would address this.
In summary, the MCI scores individual records according to the completeness of their metadata and of their component fields, providing valuable insights into the provenance, value and cost of those records. As such, it serves as an objective and quantifiable metric for metadata capture and highlights the scholarly work required to develop curated collections . We look forward to the time when other databases utilize MCI scores, as it will also serve to provide a qualitative assessment between these resources.
We would like to thank Kristin Tennessen (JGI) for help with the figures. This work was funded by NERC grant NE/D01252X/1 to DF. KL, IP, BN, and NCK were supported by the Office of Science of the US Department of Energy under contract DE-AC02-05CH11231 and together with OW by the US National Institutes of Health Data Analysis and Coordination Center contract U01-HG004866. The support of Ioanna Bozionelou is especially acknowledged.
- Field D, Sansone SA, Collis A, Booth T, Dukes P, Gregurick SK, Kennedy K, Kolar P, Kolker E, Maxon M, et al. ’Omics Data Sharing. Science 2009; 326:234–236. PubMed http://dx.doi.org/10.1126/science.1180598PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor CF, Field D, Sansone SA, Aerts J, Apweiler R, Ashburner M, Ball CA, Binz PA, Bogue M, Booth T, et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nat Biotechnol 2008; 26:889–896. PubMed http://dx.doi.org/10.1038/nbt.1411PubMed CentralView ArticlePubMedGoogle Scholar
- Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, Markowitz VM, Kyrpides NC. The Genomes On Line Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 2012; 40:D571–D579. PubMed http://dx.doi.org/10.1093/nar/gkr1100PubMed CentralView ArticlePubMedGoogle Scholar
- Field D, Garrity G, Gray T, Morrison N, Selengut J, Sterk P, Tatusova T, Thomson N, Allen MJ, Angiuoli SV, et al. The minimum information about a genome sequence (MIGS) specification. Nat Biotechnol 2008; 26:541–547. PubMed http://dx.doi.org/10.1038/nbt1360PubMed CentralView ArticlePubMedGoogle Scholar
- Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol 2011; 29:415–420. PubMed http://dx.doi.org/10.1038/nbt.1823PubMed CentralView ArticlePubMedGoogle Scholar
- Field D, Amaral-Zettler L, Cochrane G, Cole JR, Dawyndt P, Garrity GM, Gilbert J, Glöckner FO, Hirschman L, Karsch-Mizrachi I, et al. The Genomic Standards Consortium (GSC). PLoS Biol 2011; 9:e1001088. PubMed http://dx.doi.org/10.1371/journal.pbio.1001088PubMed CentralView ArticlePubMedGoogle Scholar
- Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, Kunin V, Goodwin L, Wu M, Tindall BJ, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 2009; 462:1056–1060. PubMed http://dx.doi.org/10.1038/nature08656PubMed CentralView ArticlePubMedGoogle Scholar
- Peterson J, Garges S, Giovanni M, McInnes P, Wang L, Schloss JA, Bonazzi V, McEwen JE, Wetterstrand KA, Deal C, et al. The NIH Human Microbiome Project. Genome Res 2009; 19:2317–2323. PubMed http://dx.doi.org/10.1101/gr.096651.109PubMed CentralView ArticlePubMedGoogle Scholar
- Genomes On Line Database MCI. Calculator. http://genomesonline.org/SetupMCICalculator.msi
- Garrity GM, Field D, Kyrpides NC. Standards in Genomic Sciences. Stand Genomic Sci 2009; 1:1–2. PubMed http://dx.doi.org/10.4056/sigs.34251View ArticleGoogle Scholar
- Liolios K, Chen IM, Mavromatis K, Tavernarakis N, Hugenholtz P, Markowitz V, Kyrpides NC. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 38:D346–D354. PubMed http://dx.doi.org/10.1093/nar/gkp848
- Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC. The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 2008; 36:D475–D479. PubMed http://dx.doi.org/10.1093/nar/gkm884PubMed CentralView ArticlePubMedGoogle Scholar
- GEBA-RNB — Available at: http://genome.jgi-psf.org/programs/bacteria-archaea/GEBA-RNB.jsf
- Kottmann R, Gray T, Murphy S, Kagan L, Kravitz S, Lombardof T, Field D, Glockner FO, Genomic Standards Consortium. A standard MIGS/MIMS compliant XML Schema: toward the development of the Genomic Contextual Data Markup Language (GCDML). Omics: a journal of integrative biology 2008; 12:115–121. PubMed http://dx.doi.org/10.1089/omi.2008.0A10View ArticlePubMedGoogle Scholar
- Rocca-Serra P, Brandizi M, Maguire E, Sklyar N, Taylor C, Begley K, Field D, Harris S, Hide W, Hofmann O, et al. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics 2010; 26:2354–2356. PubMed http://dx.doi.org/10.1093/bioinformatics/btq415PubMed CentralView ArticlePubMedGoogle Scholar
- Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, et al. Toward interoperable bioscience data. Nat Genet 2012; 44:121–126. PubMed http://dx.doi.org/10.1038/ng.1054PubMed CentralView ArticlePubMedGoogle Scholar
- Jones P, Côté RG, Cho SY, Klie S, Martens L, Quinn AF, Thorneycroft D, Hermjakob H. PRIDE: new developments and new datasets. Nucleic Acids Res 2008; 36:D878–D883. PubMed http://dx.doi.org/10.1093/nar/gkm1021PubMed CentralView ArticlePubMedGoogle Scholar
- Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, et al. Big data: The future of biocuration. Nature 2008; 455:47–50. PubMed http://dx.doi.org/10.1038/455047aPubMed CentralView ArticlePubMedGoogle Scholar