Pangenomics reveal diversification of enzyme families and niche specialization in globally abundant SAR202 bacteria

It has been hypothesized that abundant heterotrophic ocean bacterioplankton in the SAR202 clade of the phylum Chloroflexi evolved specialized metabolism for the oxidation of organic compounds that are resistant to microbial degradation via common metabolic pathways. Expansions of paralogous enzymes were reported and implicated in hypothetical metabolism involving monooxygenase and dioxygenase enzymes. In the metabolic schemes proposed, the paralogs serve the purpose of diversifying the range of organic molecules that cells can utilize. To further explore this question, we reconstructed SAR202 single amplified genomes and metagenome-assembled genomes from locations around the world, including the deepest ocean trenches. In analyses of 122 SAR202 genomes that included six subclades spanning SAR202 diversity, we observed additional evidence of paralog expansions that correlated with evolutionary history, and further evidence of metabolic specialization. Consistent with previous reports, families of flavin-dependent monooxygenases were observed mainly in the Group III SAR202, in the proposed class Monstramaria and expansions of dioxygenase enzymes were prevalent in Group IV. We found that Group I SAR202 encode expansions of racemases in the enolase superfamily, which we propose evolved for the degradation of compounds that resist biological oxidation because of chiral complexity. Supporting the conclusion that the paralog expansions indicate metabolic specialization, fragment recruitment and fluorescence in situ hybridization with phylogenetic probes showed that SAR202 subclades are indigenous to different ocean depths and geographical regions. Surprisingly, some of the subclades were abundant in surface waters and contained rhodopsin genes, altering our understanding of the ecological role of SAR202 in stratified water columns. Importance The oceans contain an estimated 662 Pg C of dissolved organic carbon (DOC). Information about microbial interactions with this vast resource is limited, despite broad recognition that DOM turnover has a major impact on the global carbon cycle. To explain patterns in the genomes of marine bacteria we propose hypothetical metabolic pathways for the oxidation of organic molecules that are resistant to oxidation via common pathways. The hypothetical schemes we propose suggest new metabolism and classes of compounds that could be important for understanding of the distribution of organic carbon throughout the biosphere. These genome-based schemes will remain hypothetical until evidence from experimental cell biology can be gathered to test them, but until then they provide a perspective that directs our attention to the biochemistry of resistant DOM metabolism. Our findings also fundamentally change our understanding of the ecology of SAR202, showing that metabolically diverse variants of these cells occupy niches spanning all depths, and are not relegated to the dark ocean.

It has been hypothesized that abundant heterotrophic ocean bacterioplankton in the 32 SAR202 clade of the phylum Chloroflexi evolved specialized metabolism for the oxidation of 33 organic compounds that are resistant to microbial degradation via common metabolic 34 pathways. Expansions of paralogous enzymes were reported and implicated in 35 hypothetical metabolism involving monooxygenase and dioxygenase enzymes. In the 36 metabolic schemes proposed, the paralogs serve the purpose of diversifying the range of 37 organic molecules that cells can utilize. To further explore this question, we reconstructed 38 SAR202 single amplified genomes and metagenome-assembled genomes from locations 39 around the world, including the deepest ocean trenches. In analyses of 122 SAR202 40 genomes that included six subclades spanning SAR202 diversity, we observed additional 41 evidence of paralog expansions that correlated with evolutionary history, and further 42 evidence of metabolic specialization. Consistent with previous reports, families of flavin-43 dependent monooxygenases were observed mainly in the Group III SAR202, in the 44 proposed class Monstramaria and expansions of dioxygenase enzymes were prevalent in 45 Group IV. We found that Group I SAR202 encode expansions of racemases in the enolase 46 superfamily, which we propose evolved for the degradation of compounds that resist 47 biological oxidation because of chiral complexity. Supporting the conclusion that the 48 paralog expansions indicate metabolic specialization, fragment recruitment and 49 fluorescence in situ hybridization with phylogenetic probes showed that SAR202 subclades 50 are indigenous to different ocean depths and geographical regions. Surprisingly, some of 51 the subclades were abundant in surface waters and contained rhodopsin genes, altering 52 our understanding of the ecological role of SAR202 in stratified water columns. 53 Importance 54 The oceans contain an estimated 662 Pg C of dissolved organic carbon (DOC). Information 55 about microbial interactions with this vast resource is limited, despite broad recognition 56 that DOM turnover has a major impact on the global carbon cycle. To explain patterns in 57 the genomes of marine bacteria we propose hypothetical metabolic pathways for the 58 oxidation of organic molecules that are resistant to oxidation via common pathways. The 59 hypothetical schemes we propose suggest new metabolism and classes of compounds that 60 could be important for understanding of the distribution of organic carbon throughout the 61 biosphere. These genome-based schemes will remain hypothetical until evidence from 62 experimental cell biology can be gathered to test them, but until then they provide a 63 perspective that directs our attention to the biochemistry of resistant DOM metabolism. 64 Our findings also fundamentally change our understanding of the ecology of SAR202, thousands of years (2) and is distributed throughout the water column, but is the main 73 DOM type in the bathypelagic realm (>1000 m). Here we use the term semi-labile DOM 74 (SLDOM) to encompass molecules that span a broad range of intermediate stabilities in the 75 environment, including compounds that are often referred to as recalcitrant (3). Two 76 general hypotheses put forward to explain SLDOM and RDOM are the intrinsic stability 77 hypothesis, which postulates that DOM stability is due to molecular structures that are 78 resistant to enzymatic cleavage (8), and the molecular diversity hypothesis, which predicts 79 that extreme dilution of compounds can render them unusable by heterotrophs (4). Here,80 in genomes of the SAR202 clade of marine bacteria, we explore metabolic diversity related 81 to both the intrinsic stability hypothesis and the molecular diversity hypothesis. 82 The first reports on SAR202 used molecular data to demonstrate their relative abundance 83 increases dramatically at the transition between the euphotic and aphotic zones of the 84 oceans (5). Microbes adapted to dark ocean regions (mesopelagic, 200-1000 m; 85 bathypelagic, 1000-4000 m; abyssopelagic, 4000-6000 m; hadalpelagic, 6000-11,000 m) 86 exploit environments where the most abundant energy resources are SLDOM. These 87 compounds mainly are remnants from primary production in the epipelagic, which is 88 attenuated in transit through food webs. In the dark oceans, low levels of primary 89 production also occur locally, fueled by chemoautotrophy (6). The Microbial Carbon Pump 90 (MCP) is a conceptual framework that captures these features of food webs, and recognizes 91 that, in the process of transformation, a fraction of labile DOM is chemically altered to 92 forms that resist or escape microbial degradation (7). 93 SAR202 are the most abundant lineage of bacteria in the deep oceans. This clade diversified 94 approximately 2 billion years ago, forming six subclades, referred to as "Groups I-VI") (9, 95 10). Early work showed that they constitute, on average, about 10% of total 96 bacterioplankton throughout the mesopelagic of the Sargasso Sea, Central Pacific Ocean, 97 and Eastern Pacific coastal waters (11). A subsequent study revealed that they constitute 98 up to 5% of the total bacterioplankton community in the epipelagic and up to 30% in the 99 meso-and bathypelagic zones in parts of the Atlantic Ocean (12 abundances of the top 50 most abundant COG categories ( Fig. 2A). The heatmap revealed 163 five major expansions of paralogous gene families, and many other less prominent 164 expansions. The distributions of these groups of paralogs across the major SAR202 165 subclades are shown in Fig. 2B. COG4948, the enolase superfamily, were mainly found in 166 Group I and Group II (Fig. 2B); COG2141, the SAR202 FMNO paralogs were found mainly in 167 Group II and III; and COG4638, ring-hydroxylating dioxygenase paralogs, were found in 168 Group IV, as reported previously (16). 169 A correlation matrix of the top 50 most abundant COG categories showed that the 170 expansions of the five major paralog families discussed above are linked to broad shifts in 171 metabolism (Fig. 3). For example, COG3391, COG4102, and COG5267 are all 172 uncharacterized conserved proteins. COG0747, COG0601, and COG1173 are components 173 involved in dipeptide transport. We interpret these patterns as evidence that the ancient 174 paralog expansions described above accompanied metabolic reorganization and 175 specialization in the SAR202 subclades. 176

The diversification of flavin-dependent monooxygenases in Group III 177
An expansion and radiation of diverse FMNO members in Group III SAR202 was previously 178 reported (10). We found further support for this conclusion in this broader analysis of 179 SAR202 diversity, and also observed elevated numbers of FMNO paralogs in Groups II and 180 IV. The number of paralogous FMNO copies ranged from 1 and 114, with members of 181 Group IIIa encoding the highest numbers and the greatest relative abundances, up to 4% 182 when normalized to total number of resolved genes (Fig. 2B). FMNOs were also present in 183 other SAR202 subgroups, at lower copy numbers. Group 1 encode the fewest copies of 184 FMNOs; in some genomes this number approaches zero. The five most abundant FMNOs 185 were annotated as: alkanal mono-oxygenase alpha chain (23% of all annotations); 186 limonene 1,2-monooxygenase (21%); phthiodiolone/ phenolphthiodiolone 187 dimycocerosates ketoreductase (13.9%); F420-dependent glucose-6-phosphate 188 dehydrogenase (13.7%); and alkanesulfonate monooxygenase (7.2%). 189 Because automatic annotation can sometimes fail to assign proper function to the genes, 190 we built a maximum likelihood (ML) phylogenetic tree of all extant FMNOs identified in 191 databases to better visualize the functional diversity of the FMNOs (Fig. 4A racemase (6.8%); and L-Ala-D/L-Glu epimerase (5.4%). 211 The numbers of enolase paralogs in Group 1 ranged from 4 to 75 (1.3 to 3.5% of total genes 212 found in each subclade); other SAR202 clades appear to encode very few copies of this 213 enzyme (Fig. 2B), with the exception of Group II SAR202, which encode both FMNO and 214 enolase paralogs, in roughly equal abundances (Fig. 2B). Enzymes of the enolase 215 superfamily catalyze mechanistically diverse reactions such as racemizations, 216 epimerizations, -eliminations of hydroxyl or amino groups, and cycloisomerizations, but 217 all the known reactions they catalyze involve abstraction of an -proton from carbons 218 adjacent to carboxylic acid groups and stabilization of the enolate anion intermediate 219 through a divalent metal ion, usually Mg 2+ (24, 25). 220 Muconate cycloisomerases were also detected in SAR202, although they constitute a small 221 fraction of the enolases found. They belong to the muconate lactonizing enzyme (MLE) 222 family and are involved in breaking down of lignin-derived aromatic compounds, catechols, 223 and protocatechuate to produce intermediates that are used in the citric acid cycle (26, 27). 224 It is worth noting that, although Group I members predominantly encode a large diversity 225 of enolase family enzymes, some Group III members also encode a few of these genes, the 226 majority of which are mandelate racemases ( Fig. 2B  interconvert between (R)-mandelate and (S)-mandelate, the latter of which is the first 239 compound in the mandelate and hydroxy-mandelate degradation pathways (28). We 240 postulate the expansion of diverse enolase superfamily paralogs in Groups I and II is an 241 adaptation to metabolize organic compounds that are recalcitrant to oxidation because of 242 chiral complexity. In the discussion section, we further explore the ramifications of these 243 observations. 244

Sulfatases in Group I and II members 245
Sulfatases in SAR202 were first reported in a study on dead zones in Gulf of Mexico (14). 246 We also detected a large number of genes belonging to COG3119 (AslA, Arylsulfatase A) 247 and related enzymes classified in inorganic ion transport and metabolism predominantly in 248 Group I and II bins (Fig. 2B). Arylsulfatases and choline sulfatases can hydrolyze sulfated 249 polysaccharides such as fucoidan produced by marine eukaryotes (algae or fungi). These 250 enzymes are expressed intracellularly by a species of marine fungus (29), and are also 251 found in marine Rhodobacteraceae that are mutualists of marine eukaryotes (30). Marine 252 brown algae, such as Macrocystis, are known to produce fucoidans, which consist of -L-253 fucosyl monomers (31). We speculate that SAR202 Groups 1 and 2 could be utilizing 254 arylsulfatases to break down similar sulfated polysaccharides produced by the algae in the 255 upper water column. 256

Ring-hydroxylating dioxygenases in Group IV, a molecular arsenal to break down 257
aromatic compounds 258 One of the enzyme families that seems to be disproportionately expanded in SAR202 259 belongs to COG4638, annotated as "phenylpropionate dioxygenases or related ring-260 hydroxylating dioxygenases, large terminal subunit". Enzymes belonging to the ring-261 hydroxylating dioxygenases (RHDs) family occur as monomers of subunits alpha and beta 262 ( 2 2 or 3 3) (32). The subunit of RHDs contains a Rieske [2Fe-2S] center that transfer 263 electrons to iron at the active site while the subunit is thought to play a structural role in 264 the enzyme complex (32). Members of SAR202 Group IV harbor a large number of these 265 RHDs, ranging from 1 to 62 paralogous copies for subunit (COG4638) and 1 to 3 for 266 subunit (COG5517). Given that there are more than subunits, it appears that most of 267 the RHDs in Group IV function as monomeric RHDs. 268 Of the 365 RHD subunits found in SAR202, 136 copies came from Group 4. OSU_TB11, a 269 Group 4 SAR202, encodes the highest relative abundance of RHDs at 50 (2.64%) of all 270 genes in its genome (Fig. 2B) While the vast majority of the RHDs are annotated as "phthalate 4,5-dioxygenases", it is 277 unlikely that phthalates are common substrates in the ocean. Most of Group IV SAGs and 278 MAGs were recovered from euphotic zone samples; all bins originated from ≤ 200 m depth. 279 We speculate these enzymes are used to metabolize other mono-or polycyclic aromatic 280 compounds that are mainly released by phytoplankton, providing Group IV SAR202 with 281 energy and carbon. 282 A recent paper showed that some of the SAR202 members encode large numbers of RHDs 283 in their genomes, which were likely acquired by horizontal gene transfer (HGT), and 284 speculated they play a role in the catabolism of resistant DOM of terrestrial origin (16). We 285 found Group IV MAGs containing copies of RHDs predominantly in samples from coastal 286 regions of the Indian Ocean and Red Sea, and the Southern Ocean, near Antarctica (Fig. S1). 287

Rhodopsins in epipelagic Group I and II SAR202 288
Twenty-eight genomes, all from samples obtained from water depths shallower than 150 289 m, encoded proteorhodopsins, one of which was a heliorhodopsin. Most of the type-1 290 rhodopsins were found in members of Group Ia, Ib, Ic, and Group II, which we report are 291 prevalent in the euphotic zone. The single heliorhodopsin, which was found in a Group II 292 genome, is related to a recently described group of heliorhodopsins (35). Using the 293 backbone tree from that study (35), the SAR202 Type-1 rhodopsins were placed close to 294 previously known proteorhodopsins and the sole heliorhodopsin was placed deep within 295 the newly described heliorhodopsins ( Fig. S2 and S3). 296

Depth stratification and biogeography indicate niche specialization is correlated 297
with expansions of paralogous gene superfamilies in SAR202 298 Group I genomes, including those that encoded rhodopsins, were mostly isolated from the 299 epipelagic (0-200 m), whereas the Group III members were mainly retrieved from the 300 mesopelagic (200-1000 m) (Fig. 2). We further analyzed a variety of data types and found 301 that the major SAR202 Groups have different depth ranges (Fig. 5). The oceanic water 302 column vertical gradients of light (PAR), inorganic nutrients and organic matter quality and 303 quantity establish specialized nutritional niches. The vertical stratification of SAR202 304 groups with the evidence described above for metabolic specialization, suggests that 305 SAR202 diversified to specialize in resources that vary across the water column. 306

Fragment recruitment analyses 307
Metagenome fragment recruitment showed that Group I members are most abundant in 308 the epipelagic (from surface to 200 m); Group III recruited more reads from meso, bathy, 309 abysso and hadalpelagic samples, and Group II recruited reads from the surface through 310 the mesopelagic (Fig. 6, S4, and S5). In TARA Oceans metagenomes, Group I members, most 311 notably Ib, were relatively more abundant in the epipelagic ( of Group IIIb, however appear to be more abundant in the upper water columns and less so 326 in the deeper zones in two metagenome datasets ( Fig. 6 and S4). 327 Group II members seem to occupy transitional zones between those occupied by Group I 328 and Group III members (for example, 270-600 m in the Indian Ocean, 250m in the North 329 Atlantic Ocean, and 40-450 m in the North Pacific Ocean). However, the zones occupied by 330 Group II members seem to largely overlap with those of both Group I and Group III 331 members as well ( Fig. 6  are found in wider depth ranges, with one found to be quite abundant in deepest water 335 samples in all three trenches (Fig. 6). 336

Group I, II and III Florescence in Situ Hybridization Profiles 337
The first group-specific oligonucleotide probes for SAR202 Groups I, II and III were 338 developed and used to count cells throughout the BATS water column to 4000 m in July 339 2017 (Fig. 5). All three groups were detected in significant numbers throughout the water 340 column, summing to about 5% of total bacteria near the surface and up to 10% at 4000 m. 341 Group I SAR202 cell numbers peaked in the epipelagic and dropped off sharply below the 342 euphotic zone (100 m), whereas both Group II and III had a broader distribution across the 343 epipelagic, peaking sharply within the upper mesopelagic zone at ~ 250 m, as reported 344 previously. When plotted as relative abundance (lower panels, Fig. 5), the direct cell count 345 data was consistent with the observations from metagenome recruitment, which are also 346 presented in relative units. 347

SAR202 FMNO gene relative abundance is correlated with depth 348
The relative abundance of all TARA FMNO genes (Fig. S8C), and SAR202 specific FMNOs, 349 was correlated with depth ( Fig. 7C), with Pearson r values for the latter of 0.87 (P=9.6e -75 ). 350 From these results, it was clear that FMNOs appear to be more functionally important in 351 the deeper oceans. 352 Because it appeared that FMNOs are abundant in SAR202 members originating from the 353 bathy-and abysso-pelagic, we checked to see if the relative abundances of FMNOs in 354 SAR202 genomes correlated with depth. Fig. S6A shows a significant positive correlation 355 between FMNO relative abundance vs. depth and Fig. S6B shows weak but significant 356 negative correlation between enolase abundances vs. depth. These data indicate that 357 FMNOs are mostly abundant SAR202 cells from deep waters, whereas the enolases are 358 more abundant in shallow water ecotypes. 359 10 of 22 The analysis in Fig. 7D tests the prediction that molecules differing by the addition of a 360 single oxygen atom, as expected from the chemical mechanism of FMNO enzymes, should 361 be more abundant in the deep ocean. In the plot, the ratio between the number of m/z 362 observations that differ in mass by one oxygen, to observations that differ in mass by one 363 carbon, increases dramatically below the epipelagic. In the model we presented previously, 364 cells are presumed to enzymatically modify resistant DOM compounds, channeling some to 365 catabolism, while exporting from the cell molecules that cannot be further degraded (10). 366

Enolase abundances show weak correlation with depth 367
Because enolases appear to be a notable feature of SAR202 SAGs and MAGs from the upper 368 water column, we assessed whether relative enolase abundances were also correlated with 369 depth. Fig. S6B shows that there is a slight negative correlation between the % abundance 370 of enolase genes in MAGS and SAGS and the depth they were recovered from, but SAR202 371 enolases in the TARA Oceans metagenomic data show a somewhat positive correlation with 372 depth (Pearson r value of 0.6, P=1.4e -25 ) (Fig. S7). This was surprising because we reasoned 373 that the enolases might be involved in breaking down more labile compounds found in the 374 upper water column based on the genomic data and expected higher abundances of 375 enolases in the samples from upper water columns. One reason for this discrepancy could 376 be biased sampling of MAGs from TARA Oceans metagenome samples. We selected 43 377 TARA samples to re-assemble based on SAR202 abundances; some samples from deeper 378 regions that we did not assemble could harbor uncharacterized SAR202 subgroups that 379 encode a large number of enolases. 380

Discussion 381
Pangenome analysis confirmed earlier reports and uncovered further evidence of ancient 382 expansions of paralogous enzymes in the SAR202 clade (Fig. 2B, 4A, 4B). The paralogous 383 gene families were correlated with deep branches in the SAR202 genome tree, which divide 384 the clade into six subgroups. Metagenome analyses, and cell counts made with FISH 385 probes, showed that several of the SAR202 groups are vertically stratified through the 386 water column, suggesting niche specialization (Fig. 6). Collectively, these patterns amount 387 to strong evidence that the early evolutionary radiation of SAR202 into subgroups was 388 accompanied by metabolic specialization and expansion into different ocean niches. 389 It is striking that the major paralog expansions in SAR202 suggest three different metabolic 390 strategies, each potentially targeting a different class of semi-labile DOM compounds. In 391 the hypothetical schemes we developed, the evolutionary diversification of paralogous 392 enzyme families was driven by selection favoring substrate range expansion. We found 393 support for this scheme in evidence these gene lineages arose early in evolution. While 394 deep internal nodes for these genes in tree topologies could result from the recruitment of 395 paralogs by horizontal gene transfer, the rarity of near gene neighbors across the tree-of-396 life favors the explanation that most of the paralog diversity arose within SAR202 by gene 397 duplication during evolution. If this interpretation is correct, it implies that much of the 398 functional diversity in two major enzyme families, the alkanal monooxygenases within the 399 FMNO superfamily and madelate racemases within the racemase superfamily, may have 400 11 of 22 originated within SAR202. This is apparently not the case for the Group IV dioxygenases, 401 for which there is evidence of acquisition by HGT (16). 402 Surprisingly, because SAR202 have the reputation of being deep ocean microbes, the 403 ecological data we gathered revealed that Group I SAR202 are mainly epipelagic, and 404 harbor large and diverse families of enolase paralogs. We interpret this proliferation of 405 enolase superfamily paralogs as evidence that these organisms have evolved to metabolize 406 organic matter that is resistant to oxidation because of chiral complexity. Enolase 407 superfamily enzymes remove the a-proton from carboxylic acids to form enolic 408 intermediates, which can rotate on the axis of the double bond of the intermediate, with 409 stereochemical consequences (24). These enzymes catalyze racemizations, b-eliminations 410 of water, b-eliminations of ammonia, and cycloisomerizations. Chemical oceanographers 411 have recognized a role for molecular chirality in diagenesis, reporting that the ratio of D-to 412 L-aspartic acid uptake by prokaryotic plankton increases by two to three orders magnitude 413 between surface and deep mesopelagic waters in the North Atlantic (36). This has been 414 interpreted as evidence that mesopelagic prokaryotic plankton are using bacterial cell 415 wall-derived organic matter because the bacterial peptidoglycan layer is the only major 416 biotic source of significant of D-amino acids in the ocean (37). However, information about 417 D-amino utilization by marine microbes remains limited (38). 418 The possibility that SAR202 harness paralogous enzymes of the enolase superfamily to 419 metabolize compounds that are resistant because of chirality is a powerful concept. We 420 propose that chiral complexity defines a class of resistant compounds, and that enolases 421 are an innovation that makes this DOM accessible to degradation by reducing the number 422 of enzymes needed to degrade it. The number of enantiomers of a compound increases by 423 2 n , where n is the number of chiral centers. Thus, a single compound with three chiral 424 centers might in principle require eight enzymes to recognize all stereoisomers. However, 425 if the three chiral centers were racemized by enolases, then only four enzymes would be 426 required -one degradative enzyme and one enzyme to racemize each of the chiral centers. 427 Spontaneous racemization might play a role in increasing the chiral complexity of DOM and 428 thereby transitioning it to more resistant forms, but it might also originate in biological 429 complexity, much of which is unexplored. The role for enolases that we propose evokes the 430 molecular diversity hypothesis by speculating there is a relationship between the complexity 431 of DOM and its resistance to degradation. Most often, the molecular diversity hypothesis is 432 used to explain the relationship between the dilution of DOM and its susceptibility to 433 degradation. 434 We speculate that Group I SAR202 are specialized to harvest a fraction of DOM The positions and separation of the subclades in trees, and the diversity of the enzymes 447 involved, suggest this evolution occurred early in SAR202 history. Close examination of 448 Fig. 6 shows that there are more finely structured patterns of congruence between tree 449 topologies and depth range than the broad patterns we focus our discussion on. For 450 example, some lineages of Group Ia were consistently observed in bathypelagic, and some 451 Group II near the surface. It is apparent that more complex relationships between ecology, 452 evolution and metabolism remain to be explored in SAR202. 453 This study confirmed previous reports of expansions of FMNO enzymes in Group III 454 genomes recovered from the deepest ocean regions (10) The genome-enabled hypotheses we propose will be challenging to test, but nonetheless 464 should be studied because the organic carbon pool in question is so large. Deep-ocean 465 regions beyond the reach of sunlight contain an estimated 662 Pg of DOC (1), which ranges 466 in quality between LDOM and RDOM (3, 42). If our hypotheses are correct, this pool would 467 be much larger if cells had not evolved strategies to oxidize many forms of resistant DOM. 468 In principle, the modern RDOM pool would become much smaller if contemporary cells 469 evolved mechanisms to oxidize it, with catastrophic consequences for the environment. 470 The complexity of DOM presents many challenges to proving these hypotheses. Thus far, 471 DOM chemical structures have not been resolved with sufficient accuracy to support a 472 detailed accounting of compounds and corresponding pathways of microbial catabolism. 473 An example of these problems is the issue of chemical enantiomers, which have identical 474 empirical formulas, making them perhaps the most difficult challenge. In brainstorming 475 these challenges, we encountered one success (Fig. 7D)  (Nextera XT) as described previously (46). The amplification cycle for the construction of 512 these libraries was 17 except the case of AD AD-812-D07 with 12 cycles of amplification. 513

Genome assemblies, binning, and annotation 514
Illumina library preparation, sequencing, de novo assembly and QC of SAGs AC-409-J13, 515 AC-647-N09, AC-647-P02 and AD-493-K16 were performed by SCGC, as previously 516 described (43). For the remaining six SAGs, raw sequences were first quality trimmed using 517 Trimmomatic tool (47). Four SAGs were assembled individually using SPAdes assembler 518 version 3.9.0 (48) with "-careful and -sc" flags. Due to cross-contamination present in a 519 second batch of 6 SAGs sequenced, they were co-assembled using metaSPAdes, then 520 CONCOCT was used to separate the contigs from each SAG into respective bins. CheckM 521 analysis of the bins showed that contamination levels in each identified bin were very low 522 (below 0.2%) and the 6 SAGs are from very divergent clades, so that they can be easily 523 separated by differential coverage binning approach. 524

of 22
Raw sequences from 17 metagenomics samples from Bermuda Atlantic Time-series Study 525 (BATS) and 43 metagenomic samples from TARA Oceans expedition were quality trimmed 526 using Trimmomatic and individually assembled using metaSPAdes version 3.9.0 (49). The 527 43 TARA Oceans metagenomes chosen contain at least 1% of relative SAR202 abundance 528 based on metagenomics tag (miTAG) sequence data (50) (Supplemental Table 2). 529 All metagenomics contigs larger than 1.5 kbp were separated using metabat (51) to gather 530 potential SAR202 bins. Metabat requires the use of multiple samples to calculate contig 531 abundance profile in the samples. For TARA Oceans metagenomes, in order to generate 532 abundance profiles, contigs were mapped against a minimum of 10 TARA oceans 533 metagenome samples chosen randomly (including the sample from which the contigs were 534 assembled) using BBmap (http://sourceforge.net/projects/bbmap/). For BATS 535 metagenomes, BBmap was also used against all 17 metagenomes to generate config 536 abundance profiles. Identities of the resulting bins were checked for presence of 16S rRNA 537 gene sequence matching known SAR202 sequences from Silva database release 128. In 538 cases where there were no 16S rRNA genes in the bins, concatenated ribosomal protein 539 phylogenies were constructed to identify members of the SAR202 clade. A total of 26 MAGs 540 from a recent study (23) was also included in the binning process. These also were 541 metagenomic bins from TARA metagenomes that have been assembled with megahit. The 542 list of bins used in this study are shown in Supplemental Table 1. We also checked the bins 543 obtained by another study using the TARA metagenomes (21) to see if there are redundant 544 genome bins in our assemblies. 545 After potentially novel SAR202 bins were identified, average nucleotide identities between 546 all TARA genome bins were determined with PyANI tool 547 (https://github.com/widdowquinn/pyani) and a custom Python script 548 "osu_uniquefy_TARA_bins.py" was used to identify bins that share 99% ANI. When near-549 identical bins were matched, more complete and less contaminated genome bin was 550 retained. In cases where bins originated from the same TARA station, near-identical bins 551 were combined and co-assembled with Minimus2 tool (52) to improve the genome 552 completeness. Refinement of metagenomic bins was done using Anvi'o tool (53) to identify 553 any potentially contaminating contigs. Some genomic bins were entirely discarded if too 554 many multiple copies of single-copy genes are present that cannot be separated by Anvi'o. 555 Genome completeness and redundancies were estimated using the tool CheckM (54). 556 Genomes at various levels of completion that are less than 1.1% in redundancy of single-557 cope marker genes and less than 5% contamination were included for further analyses. 558 All the SAGs and MAGs were annotated with Prokka version 1.11 (55) to assign functions. 559 Coding sequences predicted by Prokka were also submitted to GhostKOALA web server 560 (56) to assign KEGG annotations to the predicted genes. In addition, Interproscan 561 (database version 5.28-67.0) and eggNOG-Mapper (57) searches were also carried out. 562 Metagenome-assembled genomes (MAGs) and SAGs from previous studies were also re-563 annotated together with the new genomes to keep the functional assignments consistent. 564

Metagenome fragment recruitment analyses 565
Recruitment of quality-trimmed metagenomic reads from three different metagenomic 566 databases against the SAG and MAG contigs masked to exclude ribosomal RNA-coding 567 regions (16S, 23S, and 5S rRNA genes as predicted by barrnap) was done using FR-hit (58) 568 with the following parameters: "-e 1e-5 -r 1 -c 80". These parameters allowed for reads 569 matching a given reference genome with similarity score of 80% or higher to be counted as 570 positive matches. The metagenomic samples used for fragment recruitment were: 17 571 samples from BATS, 43 samples from TARA, and 22 samples from (6 from Japan, 9 from 572 Ogasawara, and 7 from Mariana Trenches) (Supplemental Table 1). Recruitment was 573 calculated as a percentage of quality-trimmed metagenomic reads aligned against a SAG or 574 a MAG genome size in basepairs, normalized by total base pairs of reads in a given sample. 575 Recruitment plot was made using "osu_plot_recruitment_heatmap.py" Python script (see 576 https://bitbucket.org/jimmysaw/sar202_pangenomics/src/master/). 577

Analysis of TARA Oceans metagenome SAR202 enzyme abundances 578
A custom Kraken (59) COG0665  COG2084  COG1028  COG3391  COG4102  COG5267  COG0028  COG0405  COG1917  COG3618  COG1225  COG2814  COG2159  COG4638  COG2271  COG0491  COG0183  COG1024  COG1804  COG1960  COG2226  COG0154  COG1529  COG2303  COG0251  COG0596  COG0346  COG2141  COG0344  COG1131  COG0438  COG0329  COG0673  COG0747  COG0601  COG1173  COG0745  COG1011  COG0451  COG3386  COG0667  COG3836  COG4948  COG0684  COG2055  COG1063 FMNOs enolases RHDs dehydrogenases arylsulfatases Figure 3: Correlations among top 50 most abundant COG functional categories, demonstrating that the major paralog expansions identified in Figure 2 are linked to other expanded families of proteins, indicating metabolic specialization.     Normalization of FMNO abundances was obtained by dividing total SAR202 FMNOs by total SAR202 single-copy genes found in each sample. (D) The ratio of observations of organic metabolites with mass : charge ratio (m/z) that differ in mass by one oxygen, to observations that differ in mass by one carbon, in FTICR-MS data from deep ocean marine DOM samples collected from the Western Atlantic. The stations ranged from 38 • S (station 2) to 10 • N (station 23). Across the full dataset, the most common m/z difference observed corresponds to one carbon atom of mass. The data show that transformations corresponding to the addition of a single oxygen atom, as would be catalyzed by a flavin-dependent monooxygenase, become relatively more frequent in the dark ocean. Of several patterns predicted from a previous study (10), this one alone showed a consistent trend.