Genome sequence of the soil bacterium Corynebacterium callunae type strain DSM 20147T

Corynebacterium callunae DSM 20147T is a member of the genus Corynebacterium which contains Gram-positive and non-spore forming bacteria with a high G + C content. C. callunae was isolated during a screening for l-glutamic acid producing bacteria and belongs to the aerobic and non-haemolytic corynebacteria. As this is a type strain in a subgroup of industrial relevant bacteria for many of which there are also complete genome sequence available, knowledge of the complete genome sequence might enable genome comparisons to identify production relevant genetic loci. This project, describing the 2.84 Mbp long chromosome and the two plasmids, pCC1 (4.11 kbp) and pCC2 (85.02 kbp), with their 2,647 protein-coding and 82 RNA genes, will aid the Genomic Encyclopedia of Bacteria and Archaea project.


Introduction
Strain DSM 20147 T is the type strain in a subgroup of industrial relevant bacteria originally isolated during a screening for L-glutamic acid producing microorganisms and was classified to belong to the genus Corynebacterium [1]. This genus is comprised of Gram-positive bacteria with a high G + C content. It currently contains 126 validly published members (species and subspecies), 4 of which are synonyms of other species within the genus, 27 that were later reclassified as members of 7 other genera, and 1 member abolished in erratum [2][3][4][5][6][7][8][9][10][11]. The remaining 93 were isolated from diverse backgrounds like soil, sea, or ripening cheese, but also from human clinical samples and animals.
Within this diverse genus, C. callunae has been found to be a producer of L-glutamic acid, like one of the most prominent representatives of the corynebacteria, C. glutamicum [1]. The biological context of this species is, unfortunately, basically unknown as it was first described in a patent application [1] that does neither mention the geographic location nor the exact habitat of the strain. Based on the name and the habitats of its close relatives C. glutamicum, C. deserti, and C. efficiens, the most likely habitat of C. callunae is soil around heather plants. But while the biotechnological uses and capabilities of this subgroup within the genus Corynebacterium has been studied in detail, especially for C. glutamicum, the ability of all these strains to secrete considerable amounts of L-glutamic acid is still not well understood in the context of the environment.
C. callunae DSM 20147 T harbors two cryptic plasmids: pCC1 (4,109 bp) which encodes a Rep protein that shows similarity to the corynebacterial plasmid pAG3 and pBL1, and pCC2 (85,023 bp) the Rep protein of which has possible orthologs in many other corynebacteria. Aside from this, DSM 20147 T is an alkaline-tolerant bacterium, which grows well at pH 5.0 -9.0 (optimum pH 6-8) [1]. Here we present a summary classification and a set of features for C. callunae DSM 20147 T , together with the description of the genomic sequencing and annotation.

Classification and features
A representative genomic 16S rRNA sequence of C. callunae DSM 20147 T was compared to the Ribosomal Database Project database [12] confirming the initial taxonomic classification. C. callunae shows highest similarity to C. glutamicum and C. deserti (97%, respectively). Figure 1 shows the phylogenetic neighborhood of C. callunae in a 16S rRNA based tree. C. callunae forms a subgroup containing furthermore the species C. glutamicum ATCC 13032 T , C. deserti GIMN1.010 T , and C. efficiens YS-314 T .
C. callunae DSM 20147 T is a Gram-positive rod shaped bacterium, which is 1-2 μm long and 0.4-0.6 μm wide ( Figure 2). It is described to be non-motile [1], which coincides with a complete lack of genes associated with 'cell motility' (functional category N in COGs table). Growth of DSM 20147 T was shown at temperatures between 25-37°C with optimal L-glutamic acid production between 25-35°C [1]. Carbon sources utilized by strain DSM 20147 T include dextrose, fructose, galactose, inulin, inositol, maltose, mannitol, mannose, raffinose, salicin, sucrose and trehalose [1]. DSM 20147 T tested positive for citrate, catalase and urease, but shows no nitrate reduction activity [1]. Details on the chemotaxonomy are largely missing, but can be inferred from the close relatives C. glutamicum, C. efficiens, and C. deserti [3]. Based on these relatives, meso-diaminopimelic acid is expected to be the major diamino acid of the cell wall, with arabinose and galactose as the main sugars (chemotype IV). Short-chain mycolic acids (32 ± 36 carbon atoms) are also certain to be present, as all necessary genes were found to be present. The major cellular fatty acids are expected to be hexadecanoic acid (C 16:0 , 40-50%) and octadecenoic acid (C 18:1 ω9c, 40-50%) with small amounts of octadecanoic acid (C 18:0 ,~1%) and possible others. MK-9(H 2 ) is thought to be the major menaquinone, although MK-8(H 2 ) might also be present in significant amounts. Phosphatidylinositol, diphosphatidylglycerol, and phosphatidylglycerol as well as their glycosides are expected to be the main components of the polar lipids (Table 1).

Genome project history
Due to its phylogenetic position in the near neighborhood of industrial relevant species of the genus Corynebacterium, C. callunae was selected for sequencing as part of a project to define production relevant loci in corynebacteria. While not being part of the GEBA project, sequencing of the type strain will nonetheless aid Figure 1 Phylogenetic tree highlighting the position of C. callunae relative to type strains of other species within the genus Corynebacterium. Species with at least one publicly available genome sequence (not necessarily the type strain) are highlighted in bold face. The tree is based on sequences aligned by the RDP aligner and utilizes the Jukes-Cantor corrected distance model to construct a distance matrix based on alignment model positions without alignment inserts, using a minimum comparable position of 200. The tree is built with RDP Tree Builder, which utilizes Weighbor [13] with an alphabet size of 4 and length size of 1000. The building of the tree also involves a bootstrapping process repeated 100 times to generate a majority consensus tree [14] Rhodococcus equi (X80614) was used as an outgroup. the GEBA effort. The genome project is deposited in the Genomes OnLine Database [28] and the complete genome sequence is deposited in GenBank. Sequencing, finishing and annotation were performed at the CeBiTec. A summary of the project information is shown in Table 2.
Growth conditions and DNA isolation C. callunae DSM 20147 T was grown aerobically in CASO bouillon (Carl Roth GmbH, Karlsruhe, Germany) at 30°C. DNA was isolated from~10 8 cells using the protocol described by Tauch et al. [29].

Genome sequencing and assembly
Two libraries were prepared: a WGS library using the Illumina-Compatible Nextera DNA Sample Prep Kit (Epicentre, WI, U.S.A) and a 6 k MatePair library using the Nextera Mate Pair Sample Preparation Kit, both according to the manufacturer's protocol. Both libraries were sequenced in a 2× 250 bp paired read run on the MiSeq platform, yielding 1,747,266 total reads, providing 99.51× coverage of the genome. Reads were assembled using the Newbler assembler v2.8 (Roche). The initial Newbler assembly consisted of 29 contigs in four scaffolds. Analysis of the four scaffolds revealed two to be an extrachromosomal element (plasmid pCC1 and pCC2), one to make up the chromosome and the remaining one containing the seven copies of the RRN operon.

Genome properties
The genome (on the scale of 2,928,683 bp) includes one circular chromosome of 2,839,5514 bp (52.39% G + C content) and two plasmids of 4,109 bp (54.42% G + C content) and 85,023 bp (54.38% G + C content, [ Figure 3]). For chromosome and plasmids, a total of 2,729 genes were predicted, 2,647 of which are protein coding genes. 2,085 (76.40%) of the protein coding genes were assigned to a putative function, the remaining were annotated as hypothetical proteins. 1,937 protein coding genes belong to 314 paralogous families in this genome corresponding to a gene content redundancy of 41.52%. The properties and   The total is based on either the size of the genome in base pairs or the total number of total genes in the annotated genome.

Insights from the genome sequence
The complete genome sequence of C. callunae was already mined for biotechnological purposes to define the core genome of the C. glutamicum -C. efficiens -C. callunae subgroup to define the chassis genome for C. glutamicum [46]. Comparison of the three genomes using EDGAR [47] reveals that the core genome of this group comprises just 1,873 genes and the number of genes that are found only in C. callunae is also relatively small (366), especially when compared to number of singletons found in the other two (926 and 773 in C. glutamicum and C. efficiens, respectively; Figure 4). As C. callunae was shown to produce L-glutamate in an amount comparable to C. glutamicum, C. callunae might be considered as a potential candidate for future genome reduction efforts since the chromosome is already considerably smaller than that of C. glutamicum and C. efficiens (2.84 Mbp versus 3.21 Mbp and 3.15 Mbp, respectively). This future approach is aided by the observation that many of the singletons are clustered in just three regions (I: H924_2045-H924_02230, 37 genes, 25.2 kbp; II: H924_03 630-H924_03880, 50 genes 52.5 kbp; III: H924_07070-H924_07380, 61 genes, 48.2 kbp) which constitutes~4.4% of the genome size. As at least region II and region III are likely prophages, loss of these regions should be neutral or even beneficial, as demonstrated for C. glutamicum [48].
One central prerequisite for future rational strain development is the genetic accessibility of the prospective strain. Knowledge of the complete genome sequence of C. callunae helps to overcome at least two of the main obstacles: the construction of plasmids usable as vectors and removal of elements that hinder DNA transfer. For the former, the knowledge of the sequences of the two plasmids pCC1 and pCC2 allows use of plasmid-tagging approaches via a counter-selectable marker [49] to cure them, should conventional approaches like heat-shock curing fail. Once cured, the sequence of the plasmids help to identify the minimal gene set necessary for replication to build shuttle vectors, as demonstrated for pCC1 [50]. For the latter, the genome sequence helps to identify restriction-modification systems. A preliminary analysis revealed the presence of at least 4 such systems, one of which is located in the potential prophage region II. Removal of such systems has been shown to significantly enhance the stability of foreign DNA introduced and thus facilitating genetic engineering approaches [48].  Figure 4 Venn diagram depicting the number of genes shared between C. callunae, C. glutamicum, and C. efficiens. EDGAR [47] was used to determine the core genomes shared between respectively singletons unique to the three species.

Conclusion
The complete genome sequence of C. callunae is the third genome sequence of the C. glutamicum -C. deserti -C. efficiens -C. callunae subgroup of L-glutamic acid producing corynebacteria within the genus Corynebacterium.
Knowledge of the complete genome sequence has already contributed to identify the core genome of this group. With a size of 2.84 Mbp and an a total of 2,647 protein coding genes, the genome of C. callunae is by far the smallest within this group. Therefore, this bacterium might be an ideal choice for future development of a platform strain as the otherwise high degree of similarity of its genome content to the well studied C. glutamicum would allow an easy transfer of knowledge to the new host. Furthermore, knowledge of the complete genome sequence also facilitates the identification of possible targets to improve the accessibility to genetic engineering (like restriction-modification systems) and to enhance genome stability (like phages and transposases).