A Genomic Encyclopedia of the Root Nodule Bacteria: assessing genetic diversity through a systematic biogeographic survey

Root nodule bacteria are free-living soil bacteria, belonging to diverse genera within the Alphaproteobacteria and Betaproteobacteria, that have the capacity to form nitrogen-fixing symbioses with legumes. The symbiosis is specific and is governed by signaling molecules produced from both host and bacteria. Sequencing of several model RNB genomes has provided valuable insights into the genetic basis of symbiosis. However, the small number of sequenced RNB genomes available does not currently reflect the phylogenetic diversity of RNB, or the variety of mechanisms that lead to symbiosis in different legume hosts. This prevents a broad understanding of symbiotic interactions and the factors that govern the biogeography of host-microbe symbioses. Here, we outline a proposal to expand the number of sequenced RNB strains, which aims to capture this phylogenetic and biogeographic diversity. Through the Vavilov centers of diversity (Proposal ID: 231) and GEBA-RNB (Proposal ID: 882) projects we will sequence 107 RNB strains, isolated from diverse legume hosts in various geographic locations around the world. The nominated strains belong to nine of the 16 currently validly described RNB genera. They include 13 type strains, as well as elite inoculant strains of high commercial importance. These projects will strongly support systematic sequence-based studies of RNB and contribute to our understanding of the effects of biogeography on the evolution of different species of RNB, as well as the mechanisms that determine the specificity and effectiveness of nodulation and symbiotic nitrogen fixation by RNB with diverse legume hosts.


Introduction
The importance of the research Legumes, with around 20,000 species and over 700 genera, are the third largest flowering plant family and are found on all continents (except Antarctica). They are major components of most of the world's vegetation types and have important roles in agriculture as both pastures and pulses [1,2]. Most legumes are able to form dinitrogen-fixing symbioses with soil bacteria, collectively known as root nodule bacteria or rhizobia. RNB infection elicits the organogenesis of a unique structure, the nodule, which forms on the root (or less commonly, the stem) of the host plant. The mode of infection and the morphology and structure of the resulting nodule varies within the different legume tribes and has phylogenetic significance [3,4]. Following infection, RNB migrate to the nodule primordium, are endocytosed within the host cell and differentiate into N 2 -fixing bacteroids.
The availability of utilizable nitrogen is the critical determinant for plant productivity. Legume-RNB symbiotic nitrogen fixation is a vital source of N in both natural and agricultural ecosystems. Based on different estimates, the total annual input of biologically fixed N ranges from 139 to 175 million tons, 35 to 44 million tons of which is attributed to RNB-legume associations growing on arable land, with those in permanent pastures accounting for another 45 million tons of N. N 2 -fixation by legume pastures and crops provides 65% of the N currently utilized in agricultural production [5,6]. The economic value of legumes on the farm is estimated at $30 billion annually, including $22 billion in the value of legume crops and $8 billion in the value of N 2 -fixation. Increasing the efficiency of the legume-RNB symbiosis has been projected to have an annual US benefit of $1,067 million, while transferring SNF technology to cereals and totally eliminating chemical N fertilization of the major crops will have an annual US benefit of $4,484 million [7].
Incorporating SNF in agricultural systems also reduces energy consumption, compared with systems that rely on chemical N-input. Every ton of manufactured N-fertilizer requires 873 m 3 of natural gas and ultimately releases~2 tons of CO 2 into the air [8]. Furthermore, >50% of US N-fertilizer is imported, which further increases the energy cost of chemical N fertilizer. SNF has the potential to reduce the application of manufactured N-fertilizer by~160 million tons pa, equating to a reduction of 270 million tons of coal or equivalent fossil fuel consumed in the production process. As well as energy cost savings, this reduces CO 2 greenhouse gas emissions. Legume-and forage-based rotations also reduce CO 2 emission by maintaining high levels of soil organic matter, thus enhancing both soil fertility and carbon storage in soil [9]. There are additional significant environmental costs to the use of N fertilizer: agriculturally based increases in reactive N are substantial and widespread, and lead to losses of biological diversity, compromised air and water quality, and threats to human health [10]. Microbial nitrification and denitrification of soil N are major contributors to emissions of the potent greenhouse gas and air pollutant, nitrous oxide, from agricultural soils [5]. Emission of N 2 O is in direct proportion to the amount of fertilizer applied. In addition, fertilizer N not recovered by the crop rapidly enters surface and groundwater pools, leading to drinking water contamination, and eutrophication and hypoxia in aquatic ecosystems [8].
The global increase in population is predicted to double demand for agricultural production by 2050 [11]. To meet this demand without incurring the high and unsustainable costs associated with the increased use of chemical N-fertilizer, the N 2 -fixing potential of the legume-RNB symbiosis must be maximized. Achieving this target will require a greater understanding of the molecular mechanisms that govern specificity and effectiveness of N 2fixation in diverse RNB-legume symbioses.
Genome sequencing of RNB strains has revolutionized our understanding of the bacterial functional genomics that underpin symbiotic interactions and N 2 -fixation. However, previous RNB sequencing projects have not reflected the phylogenetic and biogeographic diversity of RNB or the variety of mechanisms that lead to symbiosis in different legume hosts. As a result, the insights gained into SNF have been limited to a small group of symbioses and there has not yet been a systematic effort to remedy this narrow focus.
Here, we outline proposals for two sequencing projects to be undertaken at the DoE Joint Genome Institute that aim to expand the number of sequenced RNB strains in order to capture this phylogenetic and biogeographic diversity. Through the Vavilov centers of diversity (Proposal ID: 231) and GEBA-RNB (Proposal ID: 882) projects we will sequence 107 RNB strains isolated from diverse legume hosts in various geographic locations in over 30 countries around the world. The sequenced strains belong to nine of the 16 validly described RNB genera and have been isolated from 69 different legume species, representing 39 taxonomically diverse genera, growing in diverse biomes. These proposals will provide unprecedented perspectives on the evolution, ecology and biogeography of legume-RNB symbioses, as no rhizobial sequencing project so far has attempted to relate extensive genomic characterization of RNB strains to comprehensive metadata and thereby identify correlations between the genomes of rhizobial strains, their symbiotic associations with specific legume hosts, and the environmental parameters of their habitats.

Selection of target organisms
The proposed RNB genome sequencing projects were designed with two different but complementary objectives in mind. In the "Analysis of the clover, pea/bean and lupin microsymbiont genetic pool by studying isolates from distinct Vavilov centres of diversity" project (Proposal ID: 231), the nominated RNB included clover, pea/Vicia and lupin-nodulating strains, chosen because their hosts are of highly significant commercial importance [12]. The legumes originate from six distinct Vavilov centres of diversity: the Mediterranean basin, high altitude Temperate Europe, North America, South America, highland central Africa and southern Africa [13]. The rhizobial associations in these centers have phenological and geographic specificity for nodulation and nitrogen fixation [14,15]. A detailed analysis of strains representing the six centres of diversity will enable the investigation of the evolution and biodiversity of symbioses from a geographic and phenological viewpoint.
The GEBA-RNB project falls under the umbrella of the Genomic Encyclopedia of Bacteria and Archaea family projects. The original GEBA project [16] sequenced and analysed the genomes of Bacteria and Archaea species selected to maximize phylogenetic coverage. RNB are polyphyletic, belonging to diverse genera of the Alphaproteobacteriaand Betaproteobacteria; currently, 16 genera and over 100 species have been validly described (ICSP Subcommittee on the taxonomy of Rhizobium and Agrobacterium). Existing RNB sequencing programs have tended to focus on particular organisms or on RNB isolated from specific hosts. The GEBA-RNB project was therefore designed as a systematic genome sequencing project to capture RNB phylogenetic and symbiotic diversity. RNB strains were selected on the basis of (i) phylogenetic diversity, (ii) legume host diversity, (iii) economic importance and (iv) biogeographic origin. Strains were also required to have comprehensive metadata records and well characterized phenotypes, in particular relating to symbiotic effectiveness. In addition, the phylogenetic divergence of strains from previously sequenced isolates was taken into account.
The map in Figure 1 shows collection sites of strains selected for sequencing. Table 1 lists the strains nominated for sequencing, their country of origin and original host. Extensive metadata is available for all strains and was used to guide strain selection; proposed strains display a wide range of host specificities (from strictly specific to highly promiscuous) and SNF efficiency. The RNB were collected from sites that spanned a broad range of soils and climates (e.g. neutral, acidic or alkaline soil, tropical, arid or temperate climate). These strains differ in their physiological attributes (ability to recycle hydrogen, rhizobitoxine production, salt and acid tolerance, heavy metal resistance, methylotrophy) and some of them display unusual genetic features (unique genotype based on multilocus sequence typing, nodulation phenotype, atypical organization of symbiosis islands or identical symbiosis islands in different genetic backgrounds).

Organism growth and nucleic acid isolation
The international consortium, which consists of more than 34 experts in the field from 15 different countries, together with Culture Collections Centers in Australia and Belgium will be growing the 107 different RNB. Quality Control will be performed for all samples before shipping the DNA to the JGI. All samples from members of the consortium that are based in the US, will be sent to Dr Peter van Berkum in Washington DC, and all other samples will be quality controlled at the Centre for Rhizobium Studies, Murdoch University in Australia before shipping to the JGI. Scientists at the Centre for Rhizobium Studies have extensive experience in producing high quality DNA, a skill acquired as a result of a long collaboration with the JGI as is evidenced by collaborative publication [17][18][19][20][21][22].

Sequencing approach
Most RNB strains are characterized by multipartite genomes, the size of which varies between 5-10 Mb, with an average G + C%age of 60-65%. We propose drafting of the   107 RNB genomes using Illumina, PacBio or Roche sequencing platforms. All genomes will be completed to at least the stage of high quality draft. As most RNB strains carry their symbiotic genes on plasmids or within mobile islands that can be integrated in different sites on the chromosome, accurate scaffolding information is important for separation of chromosomal and plasmid-borne genes of interest.

Annotation and comparative analysis
The microbial genome annotation pipeline at the JGI has been scaled to handle hundreds of microbial genomes per month [23][24][25].

Publication of analyzed genomes
As many genomes as possible that are of publication quality will be published in Standards in Genomic Sciences [26,27].

The scientific questions we expect to answer
The genome sequences of RNB generated in this project will be used to identify the core genomes of different RNB species, as well as dispensable parts of species pangenomes and their distribution between strains from different locales and/or plant hosts. Symbiotically relevant sets of genes such as those participating in adhesion, biosynthesis of nodulation factors, SNF, energy metabolism and exopolysaccharide biosynthesis will be characterized in detail. This will include the genes' evolutionary histories and genome dynamics, such as localization on plasmids or within genomic islands and relation to mobile genetic elements. Statistical analyses will be performed in order to identify genes and gene sets that correlate with host specificity, nodulation and SNF efficiency and with various environmental metadata such as edaphic and climatic constraints. Within RNB strains of the same species, but from different environmental sites and/or legume hosts, genes that are under selective pressure will be identified and characterized by analysis of synonymous and nonsynonymous substitution rates. These analyses will be informed by the comprehensive metadata that are available for each strain, including data on the strains' collection site, host specificity, nodulation and SNF efficiency. Considerable efforts have been devoted to sourcing strains from different geographical locations in order to improve legume productivity across a range of environments, and the project takes advantage of the particularly well characterized RNB that have been sourced from several culture collections around the globe. Biogeographic considerations are particularly relevant to the RNB as their survival and persistence as soil saprophytes is dictated by environmental and edaphic constraints such as temperature, salinity, pH, and soil moisture and clay content [28]. This project will support systematic sequence-based studies of the RNB and contribute to our understanding of the biogeographic effects on the evolution of different rhizobial species, as well as the mechanisms determining the specificity and efficiency of nodulation and N 2 -fixation by RNB.
The relevance of the project to problems of societal importance The symbiotic nitrogen fixation by RNB is a significant asset for world agricultural productivity, farming economy and environmental sustainability. Large-scale agricultural use of highly effective N 2 -fixing legumes will be critical for sustainable food production for livestock and humans. Increased incorporation of SNF into agricultural systems reduces the requirement for inputs of economically and environmentally costly nitrogenous fertilizer. Currently,~1-2% of the world's annual energy supply is used in the Haber-Bosch process to manufacture chemical N, at a cost of $US 6.8 billion pa. In addition, SNF significantly reduces greenhouse gas emissions compared to intensive agriculture practice, which requires large inputs of chemical N. SNF also benefits the environment by helping to reduce dry-land salinity, increase soil fertility, promote carbon sequestration and prevent eutrophication of waterways. Recent publications have also emphasized the importance of providing renewable sources of biofuels [29,30], and a detailed understanding of endosymbionts and SNF will aid this quest. Pongamia pinnata, for example, is a leguminous tree that is important for the biofuel industry and is nodulated by a Bradyrhizobium strain [31] that has been included for sequencing in this proposal.
Apart from their economic importance, RNB also represent a uniquely tractable biological system that can offer insights into the shared genetic mechanisms between fungal and bacterial root endosymbioses [32] and between intracellular pathogens and endocytosed RNB microsymbionts. The latter have been shown to share similar host-adapted strategies in their infection processes and adaptation to growth within the cytoplasm of a eukaryotic host [33,34]. An understanding of these mechanisms will facilitate the quest to extend N 2 -fixation to cereals, a goal which is being vigorously pursued and which has been described as essential for future sustainable food production [35].

Conclusion
The legume-RNB symbiosis is one of the best-studied associations between microbes and eukaryotes, due to the economic and ecological importance of symbiotic nitrogen fixation. Targeting RNB for sequencing on the basis of firstly, phylogenetic diversity and secondly, isolation from taxonomically distinct host legumes growing in diverse biomes offers significant benefits. Previous RNB sequencing projects have tended to focus on a narrow range of model organisms. By setting a goal of maximizing the phylogenetic diversity of sequenced RNB strains, these projects, in keeping with the other members of the GEBA family of projects, aid the development of a phylogenetically balanced genomic representation of the microbial tree of life and allow for the large-scale discovery of novel rhizobial genes and functions. The chosen RNB strains are available to the global research community and are stored in culture collections that are dedicated to long-term storage and distribution. A wealth of experimental data and metadata is available for each strain, which will inform analyses to identify genes and gene sets that correlate with rhizobial adaptation to diverse biomes, to the nodule environments found in taxonomically distinct legume hosts and to the effectiveness of nitrogen fixation within these nodules. Moreover, the legume-RNB symbiosis is an excellent model system to study plant-bacterial associations, including symbiotic signaling, cell differentiation and the mechanisms of endocytosis. The sequenced RNB genomes will not only provide a greater understanding of legume-RNB associations, but can be used to gain insights into the evolution of N2fixing symbioses and microbe-eukaryote interactions.