Codon-usage optimization in the prokaryotic tree of life: How synonymous codons are differentially selected in sequence domains with different expression levels and degrees of conservation

Prokaryote genomes exhibit a wide range of GC contents and codon usages, both resulting from an interaction between mutational bias and natural selection. In order to investigate the basis underlying specific codon changes, we performed a comprehensive analysis of 29-different prokaryote families. The analysis of core-gene sets with increasing ancestries in each family lineage revealed that the codon usages became progressively more adapted to the tRNA pools. While, as previously reported, highly-expressed genes presented the more optimized codon usage, the singletons contained the less selectively-favored codons. Results showed that usually codons with the highest translational adaptation were preferentially enriched. In agreement with previous reports, a C-bias in 2- to 3-fold codons, and a U-bias in 4-fold codons occurred in all families, irrespective of the global genomic-GC content. Furthermore, the U-biases suggested that U3-mRNA–U34-tRNA interactions were responsible for a prominent codon optimization in both the more ancestral core and the highly expressed genes. A comparative analysis of sequences that encode conserved-(cr) or variable-(vr) translated products, with each one being under high- (HEP) and low- (LEP) expression levels, demonstrated that the efficiency was more relevant (by a factor of 2) than accuracy to modelling codon usage. Finally, analysis of the third position of codons (GC3) revealed that, in genomes of global-GC contents higher than 35-40%, selection favored a GC3 increase; whereas in genomes with very low-GC contents, a decrease in GC3 occurred. A comprehensive final model is presented where all patterns of codon usage variations are condensed in five-distinct behavioral groups. IMPORTANCE The prokaryotic genomes—the current heritage of the more ancient life forms on earth— are comprised of diverse gene sets; all characterized by varied origins, ancestries, and spatial-temporal–expression patterns. Such genetic diversity has for a long time raised the question of how cells shape their coding strategies to optimize protein demands (i.e., product abundance) and accuracy (i.e., translation fidelity) through the use of the same genetic code in genomes with GC-contents that range from less than 20 to over 80%. In this work, we present evidence on how codon usage is adjusted in the prokaryote tree of life, and on how specific biases have operated to improve translation. Through the use of proteome data, we characterized conserved and variable sequence domains in genes of either high- or low-expression level, and quantitated the relative weight of efficiency and accuracy—as well as their interaction—in shaping codon usage in prokaryotes.


24
The prokaryotic genomes-the current heritage of the more ancient life forms on earth-25 are comprised of diverse gene sets; all characterized by varied origins, ancestries, and 26 spatial-temporal-expression patterns. Such genetic diversity has for a long time raised 27 the question of how cells shape their coding strategies to optimize protein demands (i.e., 28 product abundance) and accuracy (i.e., translation fidelity) through the use of the same 29 genetic code in genomes with GC-contents that range from less than 20 to over 80%. In 30 this work, we present evidence on how codon usage is adjusted in the prokaryote tree of 31 life, and on how specific biases have operated to improve translation. Through the use of 32 proteome data, we characterized conserved and variable sequence domains in genes of 33 either high-or low-expression level, and quantitated the relative weight of efficiency and 34 accuracy-as well as their interaction-in shaping codon usage in prokaryotes.

37
The wide range of GC contents exhibited by prokaryote genomes-i. e., from less 38 than 20 to 80%-are believed to be primarily caused by interspecies differences in 39 mutational processes that operate on both the coding and the noncoding regions (1-6).

40
Since prokaryote genomes consist mainly of coding regions that tightly reflect the 41 genomic GC content, mutational bias is a main force that shapes the codon usage of the 42 majority of the genes (7, 8). Thus, understanding how selection is coupled to mutational 43 processes to model codon usage under such diverse GC contents is an essential issue 44 (9-11). Recent evidence suggests that prokaryotic genomes with intermediate-to-high 45 GC contents are affected by mutations that are universally biased in favor of AT 46 replacements (12, 13). That process is counterbalanced by selection-based constraints 47 that, in turn, increase the GC content and fine-tune codon usage-i. e., the so-called 48 mutation-selection-drift model (14)(15)(16). Intragenomic codon-usage heterogeneities, 49 however, are always present among different gene sets-i. e., between core genes that 50 are shared throughout a given lineage, and singletons (unique accessory genes) that are 51 taxa-and/or strain-specific (17,18). Furthermore, in a multipartite genome, the linkage 52 between the physical patterns of heterogeneity in codon usage and the replicon location 53 of the different core genes has also been recently demonstrated (19). The analysis of 54 intragenomic codon usage heterogeneities by different authors (20, 21) has served to 55 identify at least the following three distinctive gene groups: The first comprises the 56 majority of the coding sequences that are associated with the so-called typical codon 57 usage, while the second consists of the putative highly expressed (PHE) genes involving 58 codon usages that are the best adapted to the translational machinery (20,(22)(23)(24)(25)(26). The 59 third contains genes that encode the accessory information including the singletons 60 (unique genes) that are present in mobile genetic elements as well as in the more stable

75
The analysis of an extensive number of genes with different functions, 76 ubiquitousness, and degrees of phylogenetic conservation has demonstrated that codon 77 usage is related to gene-expression level (33,42,43), the degree of conservation (18, 78 32, 44, 45), the genomic location-i. e., chromosome, chromid, plasmidome (19, 46, 79 47)-and different features such as codon ramps, and mRNA secondary structure, 80 among others (48-50). Current evidence indicates that accessory genes involve atypical 81 codon usages (21,47,51,52) compared to the more conserved (ancestral) core genes in 82 a given lineage. These latter genes, for their part, exhibit adaptational variations in codon 83 usages ranging from typical to more biased as the one observed in genes that 84 correspond to highly abundant proteins which are coded by PHE (53). Moreover, that 85 core genes may also exhibit remarkable codon-usage heterogeneities has been recently 86 demonstrated (19).

87
In the work reported here after examining 29 different prokaryote families, we 88 performed a consolidated analysis aimed at characterizing the specific intragenomic 89 codon variations that lead to differences in codon usage between gene sets with diverse 90 expression levels and degrees of conservation in a given lineage. The evaluation of 6 intragenic regions with different coding characteristics-compared to strategies based on 92 the global analyses of complete genes-enabled the recognition of different patterns of 93 codon usages within a message to be translated. Thus the questions emerged of (i) 94 whether the codon-usage patterns associated with highly expressed amino-acid 95 sequences (i. e., affecting efficiency) were the same as those associated with genes 96 encoding highly conserved sequences (i. e., affecting accuracy), and (ii) whether the 97 requirements for translation efficiency and accuracy were fully independent or whether 98 those two types of demands interacted. The results have indicated how, even in 99 organisms with quite different GC contents, alterations in specific codons are associated 100 with a selective adaptation of the more ancestral genes compared to the adaptation of 101 those genes that are newer in the phylogeny. Through an independent analysis of 102 sequences associated with variable or conserved regions having different expression 103 levels (i. e., low versus high), we were able to identify the specific codon usages 104 associated with translation efficiency and accuracy as well as quantitatively estimate their 105 relative relevance to codon usage.

108
Prokaryote families selected for analysis in this work and identification of core 109 genes and singletons

110
We screened the EDGAR public-projects database (54, 55) available at 111 https://edgar.computational.bio.uni-giessen.de/cgi-bin/edgar_login.cgi, chose several 112 prokaryote families that included at least 20 complete genomes each, and finally selected 113 27 bacterial and 2 archaeal families (Table S1_a, tab 1). A specific core-gene set was 114 defined as a group of genes whose orthologs are present in a given set of species under 115 investigation. For each of the families selected sequential core-gene sets with increasing 116 ancestry (C1 through Cn) were calculated. To that end, first the phylogenetic tree for 117 each family was extracted from EDGAR and one species per family chosen as a 118 reference. Next, the different core-gene sets were obtained by incorporating into the 119 analysis new species having sequentially increasing phylogenetic distances from the 120 reference strain (accordingly, by following the tree from the branches to the root). Table   121 S1_a-c, lists the phylogenetic trees used for these calculations as well as the particular 122 species that were included in each core-gene set (C1 to Cn) for the different prokaryote 123 families. The phylogenetic trees were edited with the Figtree (56) and Inkskape programs 124 (TEAM-Inkscape). At least six core-gene sets differing in size from ca. 50 to 100 genes 125 each were calculated per family. In each prokaryote family, the most ancestral core-gene 126 set (Cn) consisted of 100 to 500 orthologs.

131
For each of the selected reference genomes, we obtained a set of genes 132 encoding ribosomal proteins and tRNA synthetases (24, 57). Table S2, tab 1 itemizes the 133 PHE genes whose orthologs were obtained and analyzed in each reference genome.

140
Assuming that orthologs have comparable expression levels within the same-or closely 141 related-species and using the PaxDB data from the above indicated 6 strains, we

177
The RSCUs (60) of all individual genes from a given genome were calculated by 178 CodonW with DNA sequences as input data (61) and then used to perform the 59-

208
The estimated sets of S ij weights were used to calculate the modal tRNA-adaptation index 209 (m-tAI) for different species and gene sets (i. e., core and PHE genes plus singletons) as 210 a measure of their efficiency in being recognized by the intracellular tRNA pool. The m-211 tAIs were calculated from the previously generated modal sequences by means of the 212 tAI_Modal_g.sh script from the CUBES package.

215
Ancestry-dependent codon-usage bias as revealed by the analysis of core genes 216 from diverse prokaryotic families 217 López et al. (19) have recently demonstrated that, in a model proteobacterium, 218 the more ancestral the core genes were the better adapted their codon usages were to 219 the translational machinery. In order to investigate if such correlation was associated with 220 a general phenomenon in different prokaryote taxa, we assembled different core-gene 221 sets that progressed deeper into the phylogenies of 27 Gram-negative and -positive 222 eubacterial families spanning the phyla Proteobacteria, Actinobacteria, Firmicutes, and

223
Bacteroidetes along with 2 archaeal families from the phylum Euryarchaeota. Table S1_a 224 (tab 1) itemizes for each taxon the number of genes in each gene set from the most

287
S3, left panels a1, a3, a5, a6, a8, a10, and a12 to a19). In the reference strains from 288 these prokaryote families, the PHE genes (red dashed lines) were always associated with 289 higher m-tAI values than those of the core-gene sets from the same genome.

306
(2). As we have seen in the two previous sections, core genes adjust their codon usages 307 in the direction of the PHE genes ( Fig. 1 and Fig. S1, left panels) in order to improve

311
show that changes in GC3 in genomes from Groups A to D each follows a distinctive 312 pattern from singletons-to-Ci-to-PHE. Whereas in genomes that belong to Group A 313 (overall GC content lower than ca. 35%) the GC3 decreases from singletons to Ci to PHE 314 (cf. Fig. 2, panel b1), in the genomes included in Group C the GC3 either increases from 315 singletons to Ci-to PHE (cf. Fig. 2   the Ci series (with i varying from 1 to n) followed by a later decrease from the Cn values 319 down to those of the PHE genes (cf. Fig. 2, panel b2). Those changes in the Group-B 320 genomes were reflected in pronounced forward and backward movements in the position 321 of the core genes in the CA plots, first from singletons to Ci and then from Cn to the PHE 322 genes (cf. Fig. S1, organisms in Group B). A similar biphasic pattern in the CA plots could 323 also be recognized, though softened, in certain species that were included in Group C or 324 15 even Group D where the PHE genes did not evidence a decrease in GC3 levels when 325 compared to those of the core genes. The genomes in Group D had extremely high 326 global GC contents and had GC3 values in all their core-gene sets (C1 to Cn) that were 327 comparable-though slightly higher-than the corresponding values in their PHE genes.

328
In the next section we will describe how individual codons for a given amino acid are 329 selected in the more ancestral core-gene sets.

359
In the instance of the 2-fold purine-ending codons-that is GAA and GAG for Glu,

360
AAA and AAG for Lys, and CAA and CAG for Gln-we observed that the codons with G

381
A different codon usage bias-in a pattern not found in the 2-/3-fold-degenerate 382 amino acids-was observed in codons encoded by 4-fold-degenerate amino acids (Val,

393
All the codon adaptations that we have described in this section referring to core 394 genes proved to be more prominent in the PHE genes, whose triplets were even better    (Table S3 worksheet 472 "distances", and tree in Fig. 4), changes in codon usage derived from differences in gene-473 expression levels-i. e., the efficiency in terms of the distance from the LEP to the HEP-  (15) to drive all these changes is not yet fully understood.

488
The codon-usage biases described here were associated with the four different patterns 489 of GC3 changes summarized in the schemes presented in Fig. 7  extremely low GC content-such as those belonging to Group-A-the observed AT3 504 enrichment from C1 to Cn to PHE (Fig. 7, right panel) should mainly result from selection.

505
According to Hildebrand et al. (15), the mutational processes in very low-GC organisms 22 favor a GC3 enrichment. That the core and PHE genes in bacteria that belonged to 507 Group A had been selected to bear lower GC3 values than singletons in order to improve 508 translation in view of the previous pattern of increasing GC content was remarkable, with 509 this circumstance being a result of the above-mentioned enrichment in NNA and NNU 510 triplets compared to their proportion in the synonymous codons (Fig. 7, right panel). In

511
Group-B genomes, the biphasic pattern observed from singletons to PHE genes could be 512 explained by an initial increase in GC3-rich codons from singletons to core genes, 513 followed by a later U bias from core genes to PHE genes. That initial GC3 enrichment 514 followed by a U3 increase was sufficient to explain the basis of the previously reported 515 "rabbit head" distribution of codon usages that characterizes most prokaryote genomes 516 (21,67). What should be also especially noted is that the PHE genes separated from the 517 Cn (in both the CA, and the GC3 plots) because of a much more intense U bias likely 518 associated with the difference in expression levels between the two gene sets. In the 519 type-C genomes, in which the GC3 always increased, the absence of a strong U bias 520 from the Cn to the PHE genes led to a less pronounced-i. e., more linear-"rabbit-head"

539
The results presented here together with previous evidence from other authors 540 have enabled a comprehensive analysis of the principal basis underlying the changes 541 associated with the optimization of codon usage in prokaryotes in different gene sets and 542 in organisms with different GC contents. As stated previously, the overall codon usage is 543 known to be constrained by genome-wide mutational processes (7,8,10) that are 544 considered as a main force in shaping the global GC content. The intragenomic codon 545 usage will concurrently become accommodated through selection-driven processes, as 546 has also been extensively reported (35,42,48,68). In order to further our knowledge of

591
Average coordinates (centroids) for different gene sets (i.e. singletons in blue, C1 to Cn 592 in a gradient from blue to red, and PHE in red) were projected on the CA space. In C1 to 593 Cn the higher number denote a more ancestral core-gene set within the phylogeny. Table   594 S1_a (tab 1) lists the prokaryote species that were used to construct each Ci gene set by