Non-contiguous finished genome sequence and description of Clostridium saudii sp. nov

Clostridium saudii strain JCCT sp. nov. is the type strain of C. saudii sp. nov., a new species within the genus Clostridia. This strain, whose genome is described here, was isolated from a fecal sample collected from an obese 24-year-old (body mass index 52 kg/m2) man living in Jeddah, Saudi Arabia. C. saudii is a Gram-positive, anaerobic bacillus. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 3,653,762 bp long genome contains 3,452 protein-coding and 53 RNA genes, including 4 rRNA genes.


Introduction
Clostridium saudii strain JCC T (=CSUR P697 = DSM 27835) is the type strain of C. saudii sp. nov. This bacterium is a Gram-positive, anaerobic, spore-forming indole negative bacillus that was isolated from the stool sample of an obese 24 year-old Saudi individual, as a part of a culturomics study as previously reported [1][2][3].
The current prokaryote species classification method, known as polyphasic taxonomy, is based on a combination of genomic and phenotypic properties [4]. The usual parameters used to delineate a bacterial species include 16S rDNA sequence identity and phylogeny [2,3], genomic G + C content diversity and DNA-DNA hybridization (DDH) [4,5]. Nevertheless, some limitations appear, notably because the cutoff values vary dramatically between species and genera [6]. The introduction of high-throughput sequencing techniques has made genomic data for many bacterial species available [7]. To date, more than 4,000 bacterial genomes have been published and approximately 15,000 genomes project are anticipated to be completed in a near future [5]. We recently proposed a new method (taxono-genomics), which integrates genomic information in the taxonomic framework, combining phenotypic characteristics, including MALDI-TOF MS spectra, and genomic analysis .
The genus Clostridium was first created in 1880 [39] and consists of obligate anaerobic rod-shaped bacilli able to produce endospores [39]. To date, more than 200 species have been described (http://www.bacterio.cict.fr/c/ clostridium.html). Members of the genus Clostridium are mostly environmental bacteria or associated with the commensal digestive flora of mammals. However, C. botulinum, C. difficile and C. tetani are major human pathogens [39].

Classification and features
A stool sample was collected from an obese 24-year-old male Saudi volunteer patient from Jeddah. The patient gave an informed and signed consent, and the agreement of the local Ethical Committee of the King Abdulaziz University, King Fahd medical Research Centre, Saudi Arabia, and of the local ethics committee of the IFR48 (Marseille, France) were obtained under agreement number 014-CEGMR-2-ETH-P and 09-022 respectively. The fecal specimen was preserved at −80°C after collection and sent to Marseille. C. saudii strain JCC T (Table 1) was isolated in July 2013 by anaerobic cultivation on 5% sheep blood-enriched Columbia agar (BioMerieux, Marcy l'Etoile, France) after a 5-day preincubation on blood culture bottle with rumen fluid. This strain exhibited a 98.3% nucleotide sequence similarity with Clostridium disporicum (Y18176) (Figure 1). This value was lower than the 98.7% 16S rRNA gene sequence threshold recommended by Stackebrandt and Ebers to delineate a new species without carrying out DNA-DNA hybridization [3] and was in the 78.4 to 98.9% range of 16S rRNA identity values observed among 41 Clostridium species with validly published names [40].
For the growth of C. saudii we tested four temperatures (25, 30, 37, 45°C); growth occurred between 25 and 37°C, however optimal growth occurred at 37°C, 24 hours after inoculation. No growth occurred at 45°C. Colonies were translucent on 5% sheep blood-enriched Columbia agar (BioMerieux). Colonies on blood-enriched Columbia agar were about 0.2 to 0.3 mm in diameter and translucent light grey. Growth of the strain was tested in the same agar under anaerobic and microaerophilic conditions using GENbag anaer and GENbag microaer systems, respectively (BioMerieux), and in aerobic conditions, with or without 5% CO 2 . Growth was observed only under anaerobic conditions and no growth occurred under aerobic or microaerophilic conditions. Gram staining showed Gram-positive rods able to form spores ( Figure 1) and the motility test was positive. Cells grown on agar exhibit a mean diameter of 1 μm and a mean length of 1.22 μm in electron microscopy ( Figure 2, Figure 3).
C. saudii did not have catalase or oxidase activity ( Table 2). On an API Rapid ID 32A strip (BioMerieux), C. saudii presented positive reactions for α-galactosidase, Phylum Firmicutes TAS [42][43][44] Class Clostridia TAS [45,46] Order Clostridiales TAS [47,48] Family Clostridiaceae TAS [47,49] Genus Clostridium IDA [47,50,51] Species Evidence codes -IDA: Inferred from Direct Assay; TAS: Traceable Author Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from of the Gene Ontology project [52]. If the evidence is IDA, then the property was directly observed for a live isolate by one of the authors or an expert mentioned in the acknowledgements.
Matrix-assisted laser-desorption/ionization time-of-flight (MALDI-TOF) MS protein analysis was carried out as previously described [53]. Briefly, a pipette tip was used to pick one isolated bacterial colony from a culture agar plate and spread it as a thin film on a MTP 384 MALDI-TOF target plate (Bruker Daltonics, Leipzig, Germany). Twelve distinct deposits from twelve isolated colonies were performed for C. saudii JCC T . Each smear was overlaid with 2 μL of matrix solution (saturated solution of alpha-cyano-4-hydroxycinnamic acid) in 50% acetonitrile, 2.5% tri-fluoracetic acid, and allowed to dry for 5 minutes. Measurements were performed with a Microflex spectrometer (Bruker). Spectra were recorded in the positive linear mode for the mass range of 2,000 to 20,000 Da (parameter settings: ion source 1 (ISI), 20 kV; IS2, 18.5 kV; lens, 7 kV). A spectrum was obtained after 675 shots with variable laser power. The time of acquisition was between 30 seconds and 1 minute per spot. The twelve JCC T spectra were imported into the MALDI BioTyper software (version 2.0, Bruker) and analyzed by standard pattern matching (with default parameter settings) against the main spectra of 3,769 bacteria, including 228 spectra from 96 Clostridium species. The method of identification included the m/z from 3,000 to 15,000 Da. For every spectrum, a maximum of 100 peaks were compared with spectra in database. The resulting score enabled the identification of tested species, or not: a score ≥ 2 with a validly published species enabled identification at the species level, a score ≥ 1.7 but < 2 enabled identification at the genus level, and a score < 1.7 did not enable any identification. No significant MALDI-TOF score was obtained for strain JCC T against the Bruker database, suggesting that our isolate was not a member of a known species. We added the spectrum from strain JCC T to our database ( Figure 4). Finally, the gel view showed the spectral differences with other members of the genus Clostridium ( Figure 5).

Genome project history
The organism was selected for sequencing on the basis of its phylogenetic position and 16S rRNA similarity to members of the genus Clostridium, and is part of a study of the human digestive flora aiming at isolating all bacterial species in human feces [1]. It was the 101 st genome of a Clostridium species and the first genome of C. saudii sp. nov. The GenBank accession number is HG726039 and consists of 104 contigs. Table 2 shows the project information and its association with MIGS version 2.0 compliance [54].
Growth conditions and DNA isolation C. saudii sp. nov., strain JCC T (=CSUR P697 = DSM 27835) was grown anaerobically on 5% sheep bloodenriched Columbia agar (BioMerieux) at 37°C. Bacteria grown on three Petri dishes were harvested and resuspended in 4×100 μL of TE buffer. Then, 200 μL of this suspension was diluted in 1 ml TE buffer for lysis treatment that included a 30 minute incubation with 2.5 μg/μL lysozyme at 37°C, followed by an overnight incubation with 20 μg/μL proteinase K at 37°C. Extracted DNA was then purified using 3 successive phenol-chloroform extractions and ethanol precipitation at −20°C overnight.  After centrifugation, the DNA was resuspended in 160 μL TE buffer. The yield and concentration was measured by the Quant-it Picogreen kit (Invitrogen) on the Genios-Tecan fluorometer.

Genome sequencing and assembly
Genomic DNA of C. saudii was sequenced on a MiSeq instrument (Illumina Inc, San Diego, CA, USA) with 2 applications: paired end and mate pair. The paired end and the mate pair strategies were barcoded in order to be mixed respectively with 14 other genomic projects prepared with the Nextera XT DNA sample prep kit (Illumina) and 11 other projects with the Nextera Mate Pair sample prep kit (Illumina). The gDNA was quantified by a Qubit assay with the high sensitivity kit (Life technologies, Carlsbad, CA, USA) at 36.6 ng/μl and dilution was performed such that 1 ng of each genome was used to prepare the paired end library. The "tagmentation" step fragmented and tagged the DNA with a mean size of 1.5 kb. Then limited cycle PCR amplification (12 cycles) completed the tag adapters and introduced dual-index barcodes. After purification on AMPure XP beads (Beckman Coulter Inc, Fullerton, CA, USA), the libraries were then normalized on specific beads according to the Nextera XT protocol (Illumina). Normalized libraries were pooled into a single library for sequencing on the MiSeq. The pooled single strand library was loaded onto the reagent cartridge and then onto the instrument along with the flow cell. Automated cluster generation and paired end sequencing with dual index reads were performed in a single 39-hours run with a 2x250 bp read length. Total information of 5.3 Gb was obtained from a 574 K/mm 2 cluster density with 95.4% (11,188,000) of the clusters passing quality control filters. Within this run, the index representation for C. saudii was

Production of
Alkaline phosphatase -Na Na Na -+ Na determined to be 6.9%. The 710,425 reads were filtered according to the read qualities. The mate pair library was prepared with 1 μg of genomic DNA using the Nextera mate pair Illumina guide. The genomic DNA sample was simultaneously fragmented and tagged with a mate pair junction adapter. The profile of the fragmentation was validated on an Agilent 2100 BioAnalyzer (Agilent Technologies Inc, Santa Clara, CA, USA) with a DNA 7500 labchip. The DNA fragments ranged in size from 1.4 kb up to 10 kb with a mean size of 5 kb. No size selection was performed and 600 ng of tagmented fragments were circularized. The circularized DNA was mechanically sheared to small fragments with a mean size of 625 bp on the Covaris device S2 in microtubes (Covaris, Woburn, MA, USA). The library profile was visualized on a High Sensitivity Bioanalyzer LabChip (Agilent Technologies Inc, Santa Clara, CA, USA). The libraries were normalized at 2 nM and pooled. After a denaturation step and dilution at 10 pM, the pool of libraries was loaded onto the reagent cartridge and then onto the instrument along with the flow cell. Automated cluster generation and sequencing run were performed in a single 42-hours run with a 2×250 bp read length.
Total information of 3.2 Gb was obtained from a 690 K/mm 2 cluster density with 95.4% (13,264,000) of the clusters passing quality control filters. Within this run, the index representation for C. saudii was determined to be 8.2%. The 1,037,710 reads were filtered according to the read qualities.

Genome annotation
Open Reading Frames (ORFs) were predicted using Prodigal [55] with default parameters. However, the predicted ORFs were excluded if they spanned a sequencing gap region. The predicted bacterial protein sequences were searched against the GenBank [56] and Clusters of Orthologous Groups (COG) databases using BLASTP. The tRNAs and rRNAs were predicted using the tRNAScanSE [57] and RNAmmer [58] tools, respectively. Lipoprotein signal peptides and numbers of transmembrane helices were predicted using SignalP [59] and TMHMM [60], respectively. Mobile genetic elements were predicted using PHAST [61] and RAST [62]. ORFans were identified if their BLASTP E-value was lower than 1e-03 for alignment length greater than 80 amino acids. If alignment lengths were smaller than 80 amino acids, we used an Evalue of 1e-05. Such parameter thresholds have already been used in previous works to define ORFans. Artemis [63] and DNA Plotter [64] were used for data management and visualization of genomic features, respectively. Mauve alignment tool (version 2.3.1) was used for multiple genomic sequence alignment [65]. To estimate the Average Genome Identity of Orthologous Sequences (AGIOS) [7] at the genome level between C. saudii and another 9 members of the Clostridium genus (  Figure 5 Gel view comparing spectra from Clostridium saudii strain JCC T , Clostridium tertium, Clostridium sartagoforme, Clostridium baratii, Clostridium beijerinckii, Clostridium botulinum, Clostridium carboxidivorans and Clostridium paraputrificum. The gel view presents the raw spectra of loaded spectrum files as a pseudo-electrophoretic gel. The x-axis records the m/z value. The left y-axis displays the running spectrum number originating from subsequent spectra loading. The peak intensity is expressed by a grey scale scheme code. The grey scale bar on the right y-axis indicates the relation between the shade of grey a peak is displayed with and the peak intensity in arbitrary units. Species are listed on the left. The numbers of orthologous protein shared between genomes (above diagonal), average percentage similarity of nucleotides corresponding to orthologous protein shared between genomes (below diagonal) and the numbers of proteins per genome (bold). C. sma = C. saudii, C.bej = C. beijerinckii, C. bot = C. botulinum, C. car = C. carboxidivorans, C. cel = C. celatum, C. dak = C. dakarense, C. dif = C. difficile, C. per = C. perfringens, C. par = C. paraputrificum, C. sen = C. senegalense.
proteins were detected using the Proteinortho [66] and we compared genomes two by two and determined the mean percentage of nucleotide sequence identity among orthologous ORFs using BLASTn.
A total of 2144 genes (61.10%) were assigned a putative function. One hundred and twenty eight genes were identified as ORFans (3.65%) and the remaining genes were annotated as hypothetical proteins. The properties and statistics of the genome are summarized in Tables 4 and 5.
The distribution of genes into COGs functional categories is presented in Table 6.
Genome comparison of C. saudii with 9 other Clostridium genomes We compared the genome of C. saudii strain JCC T with those of C. beijerinckii strain NCIMB 8052, C. botulinum Figure 6 Graphical circular map of the chromosome. From outside to the center: Genes on the forward strand (colored by COG categories), genes on the reverse strand (colored by COG categories), RNA genes (tRNAs green, rRNAs red), GC content, and GC skew. The total is based on either the size of the genome in base pairs or the total number of protein coding genes in the annotated genome. The distribution of genes into COG categories was almost similar in all the 10 compared genomes except the unique presence of cytoskeleton associated proteins in C. difficile (Figure 7). In addition, C. saudii shared 1479, 1181, 1034, 1779, 1100, 1037, 1554, 1351, and 1137 orthologous genes with C. beijerinckii, C. botulinum, C. carboxidivorans, C. celatum, C. dakarense, C. difficile, C. perfringens, C. paraputrificum and C. senegalense, respectively. Among compared genomes AGIOS values ranged from 68.54 between C. carboxidivorans and C. paraputrificum to 79.95% between C. botulinum and C. perfringens. When C. saudii was compared to other species, AGIOS values ranged from 69.57 with C. difficile to 81.95% with C. celatum (Table 7).

Conclusion
On the basis of phenotypic, phylogenetic and genomic analyses, we formally propose the creation of Clostridium saudii sp. nov. that contains the strain JCC T . This bacterial strain was isolated in Marseille, France.
The G + C content of the genome is 28%. The 16S rRNA and genome sequences are deposited in GenBank under accession numbers HG726039 and CBYM00 000000, respectively. The type strain is JCC T (=CSUR P697 = DSM 27835).