Identification of the Bacterial Biosynthetic Gene Clusters of the Oral Microbiome Illuminates the Unexplored Social Language of Bacteria during Health and Disease

The healthy oral microbiome is symbiotic with the human host, importantly providing colonization resistance against potential pathogens. Dental caries and periodontitis are two of the world’s most common and costly chronic infectious diseases and are caused by a localized dysbiosis of the oral microbiome. Bacterially produced small molecules, often encoded by BGCs, are the primary communication media of bacterial communities and play a crucial, yet largely unknown, role in the transition from health to dysbiosis. This study provides a comprehensive mapping of the BGC repertoire of the human oral microbiome and identifies major differences in health compared to disease. Furthermore, BGC representation and expression is linked to the abundance of particular oral bacterial taxa in health versus dental caries and periodontitis. Overall, this study provides a significant insight into the chemical communication network of the healthy oral microbiome and how it devolves in the case of two prominent diseases.

the chemical communication network of the healthy oral microbiome and how it devolves in the case of two prominent diseases. KEYWORDS biosynthetic gene clusters, caries, genome mining, oral microbiome, periodontitis, small molecules T he human body is inhabited by rich and diverse bacterial communities, which are intimately linked to the health of the human host (1). Small molecules, which are often encoded by biosynthetic gene clusters (BGCs), are the primary means of communication in this microbial world. Recent studies suggest that the human microbiota has the potential to synthesize structurally diverse small molecules and that these small molecules serve as mediators in a variety of microbe-microbe and host-microbe interactions (2)(3)(4). These interactions include antibacterial activity (5), bacterial signaling (6), immune modulation (7), biofilm formation (8,9), host colonization (10), nutrient scavenging (11), and stress protection (12). Disruption of the finely tuned equilibrium of the bacterial ecosystems in the human microbiome, referred to as dysbiosis, is associated with a plethora of diseases. While the mechanistic underpinnings of a shift to a dysbiotic community remain poorly understood, there is little doubt that signaling via the small molecules produced by microbial BGCs plays a critical role in the transition to dysbiosis and associated pathogenesis (13,14).
The human oral cavity contains an assortment of ecological niches, and as such, harbors one of the most diverse microbial populations in the human body (1,15). Dental caries and periodontitis are two of the most common and costly chronic conditions afflicting humans and are the result of localized dysbiosis in the oral cavity (16)(17)(18)(19)(20). Unlike the rest of the human digestive tract, the oral cavity is consistently exposed to the exterior environment. Therefore, an indispensable portion of the first line of defense against invading pathogens is the colonization resistance provided by a healthy oral microbiome. Indeed, dysbiosis of the oral microbiome is not only directly linked to oral diseases but is also implicated in system-wide health (21), stressing the urgent need to unravel the underlying factors that shape and maintain a healthy human oral microbiome.
Elucidating the transmissions relayed by oral bacterial small molecules could lead to a deeper understanding of key ecological factors that set the stage for oral community succession, in health and pathogenesis. A large and growing body of literature suggests that the microbial composition and metabolic potential of the saliva and dental plaque varies significantly in healthy versus disease states (22)(23)(24)(25)(26)(27)(28). Therefore, we hypothesize that the abundance and expression of BGCs, which produce small molecules, may drive crucial bacterial interactions which contribute to health or disease. To explore this further, the biosynthetic capacity of 461 well-annotated oral bacterial genomes was investigated, and an enormous diversity of BGCs was revealed. In addition, sequence reads from 294 publicly available metagenomes and metatranscriptomes, which were associated with health, dental caries, or periodontitis, were mapped to these novel oral BGCs. This analysis identified 2,473 biosynthetic pathways which were differentially represented in health versus disease. In addition, the BGC content in salivary metagenomes obtained from 24 healthy children and 23 children with dental caries was analyzed. A Bayesian network approach was employed to identify both positive and inverse correlations between BGCs and bacterial taxa, which revealed differentially abundant signaling networks and species in health compared to dental caries. Overall, this study provides significant insight into the chemical communication network of the healthy oral microbiome and how it devolves in the case of dental caries and periodontitis.

RESULTS AND DISCUSSION
The human oral microbiome encodes thousands of diverse BGCs from an array of species. To explore the metabolic capacity of the human oral microbiome in depth, a comprehensive pipeline for mining bacterial genomes was established, utilizing antiSMASH infrastructure v4 (accessible at https://antismash.secondarymetabolites .org/) (29), including MultiGeneBlast (30). An oral bacterial genome sequence database was assembled to include a total of 461 well-curated and annotated bacterial genomes, representing 113 unique bacterial genera and 298 taxonomically unique species, as well as 72 taxa unclassified at the species level (see Table S1 posted at https://massive .ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). A few genomes representing transient exogenous bacteria (e.g., legume symbiont Mesorhizobium loti, pathogenic Yersinia pestis), which also have oral taxon IDs in the Human Oral Microbiome Database (HOMD) were included in our analysis to represent an additional source of unexplored BGC diversity. However, our sequence read mapping exercises (both with regards to metagenomic and metatranscriptomic reads) showed that most of these BGCs were of less importance, indicating that the majority were either of low abundance or absent in the oral cavity (Table S2 posted at https://massive.ucsd.edu/ProteoSAFe/ status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). Genomes were selected based on their completeness and level of annotation. A single genome sequence for each bacterial species was included to circumvent the overrepresentation of BGCs from bacteria with a high number of genome representatives. Indeed, in a previous bioinformatic study of 169 Streptococcus mutans genomes, ϳ1,000 putative BGCs were identified, revealing an incredible potential to produce small molecules within one bacterial species (31). Therefore, it should be noted that the estimated BGC diversity reported here is likely underestimated. Clearly, strain-level diversity is important to explore in future studies. However, this will require extensive genome sequencing, since to date, most oral bacterial species lack multiple reference genomes. By applying the genome-mining pipeline described above, a total of 4,915 BGCs of known and unknown types were identified (Table S1 posted at https://massive.ucsd.edu/ ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). BGCs annotated as fatty acid synthases, which are often involved in primary metabolism, were excluded. Approximately 50% of the identified BGCs were of an unknown class, congruent with the observations of other efforts to identify BGCs (Table S1 posted at the above website) (2). The remaining 50% of BGCs (2,250) shared sequence similarities with an extensive range of previously characterized BGC classes, which is likely reflective of the high taxonomic diversity observed within the oral cavity compared to many other body sites (1) (Fig. 1A).
Of the BGCs of a known class, a substantial fraction (1,398 BGCs [62%]) were annotated as oligosaccharides, making it the most abundant class of BGCs in the oral cavity. Oligosaccharide pathways are widely distributed across bacterial phyla and are predominant in Firmicutes, Proteobacteria, Bacteroidetes, Actinobacteria, and Fusobacteria, with the highest number being identified in Firmicutes ( Fig. 1B and C). However, the high abundance of Firmicutes BGCs may reflect that Streptococcus genomes are highly sequenced compared to genomes of other oral taxa which may have added bias to the analysis. The ecological roles of oligosaccharides are underexplored, but studies show important functions such as capsule formation in virulence development (32) and attachment to surfaces, including neighboring bacterial species and host cells (33). Furthermore, diffusible oligosaccharides are known to display antibacterial activities (34); for example, a previous study showed that polysaccharide A from the human gut bacterium Bacteroides fragilis can modulate the gut mucosal immune response (35,36).
Another highly represented BGC class was ribosomally synthesized and posttranslationally modified peptides (RiPPs), for which 209 BGCs (9.3% of BGCs of a known class) were identified. RiPPs include molecules such as bacteriocins, lantipeptides, sactipeptides, cyanobactins, and proteusins (denoted as fluorescent green in Fig. 1). Of these RiPP types, bacteriocin-encoding BGCs were the most abundant, as they contributed ϳ75% of the total RiPP diversity. Interestingly, although bacteriocin-producing BGCs were abundant in the oral microbiome overall, they were depleted in all Bacteroidetes genomes (Fig. 1C). The role of RiPPs, such as the bacteriocins, demands further exploration, as they exhibit antagonistic activities against other microbes sharing the same ecological niche and influence competition for persistence between commensals and pathogens (37,38). Furthermore, multiple studies of genetic transformation in Streptococcus show that competence is tightly linked to bacteriocin production (39), which suggests that these molecules also play important roles in the horizontal transfer of genes and ultimately in niche differentiation and population structure changes.
BGCs encoding aryl polyene-like molecules in several Bacteroidetes and Proteobacteria genomes were identified (131 BGCs or 5.8% of BGCs of a known class). Aryl polyenes are predicted to function as protective agents against oxidative stress (40). However, only a few candidates have been experimentally characterized, leaving this group of small molecules highly underexplored. A diversity of nonribosomal peptide synthetases (NRPSs), polyketide synthase (PKS), and NRPS-PKS hybrid BGCs (ranging between 0.9% and 4.4% of BGCs of a known class) were identified, in line with a prior study, which classified BGCs in the human microbiome in multiple body habitats (2). In the same study, it was also observed that BGCs show biogeographic signatures within the oral cavity, which suggests that different social signaling skills are required to inhabit different ecological niches. NRPS and PKS compound classes are known for their antimicrobial activities and were previously characterized as possessing various nutrient scavenging, immunosuppressant, surfactant, and cytotoxic properties (41). BGCs of the terpene class were also identified (95 BGCs, 4.2% of BGCs of a known class). This diverse group of small molecules may also be of ecological and medicinal interest, since their activities have been reported as both anti-inflammatory and antimicrobial (42). The class "other" encompasses BGCs that fall outside the known categories of antiS-MASH annotation, includes rare classes found in only a few species, and constituted 9.4% of the total BGCs identified (Fig. 1A). Taken together, these results show that the oral microbiome encodes a vast and highly diverse array of small molecules that have largely unexplored, yet likely pivotal, roles in ecology and health.
Sequence similarity networks reveals distant and close homologies to wellstudied classes of BGCs. In order to assess the evolutionary relationships between conserved domains in the proteins encoded by BGCs, as well as to group BGCs of similar putative function to evaluate novelty, a sequence similarity network approach was applied (see Text S1 in the supplemental material). Briefly, the BGCs that were identified from the bacterial genomes using antiSMASH were aligned to the MIBiG repository (43) of 1,409 experimentally validated reference BGCs using the BiG-SCAPE algorithm (https://git.wageningenur.nl/medema-group/BiG-SCAPE). The resulting network comprised 4,242 nodes and 19,847 connecting edges, revealing both close and distant homology to characterized biosynthetic pathways (Fig. 2). Notably, a significant fraction of the previously unclassified BGCs did subnetwork with BGCs predicted to be of a known class, particularly the oligosaccharide, RiPP, and aryl polyene classes (Fig. 2). These data provide inferences as to the function of more than 100 previously unclassified novel BGCs.
The largest subnetwork, comprised of mainly oligosaccharide-encoding BGCs, showed no significant homology with any experimentally validated BGCs in the MIBiG repository (Fig. 2). This may be due to the fact that oligosaccharide-producing BGCs are at times categorized with primary metabolism, and not natural product-producing BGCs, as is the case in this study. The second-largest major subnetwork was comprised of primarily unclassified BGCs. These may encompass distinct chemical scaffolds and may represent a rich source of novel BGC types. The NRPS, PKS, NRPS-PKS hybrids, and a few terpene BGCs, grouped together, forming a subnetwork implying a set of common core domains involved in these biosynthetic assembly lines as described previously (4,41). The majority of NRPS, PKS, NRPS-PKS hybrids, and RiPPs (in particular thiopeptides and lantipeptides) showed strong associations with MIBiG reference BGC sequences. It should be noted that these are the most prevalent classes in the MIBiG repository (Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ 1ed182b005a74b94ac730f769b2f37b1). Currently, only four experimentally characterized aryl polyene BGCs exist in the MIBiG database; therefore, it was not surprising that none of the nodes in the aryl polyene cluster subnetworked with MIBiG reference BGCs. Given that aryl polyenes are thought to be the most abundant BGC class in the human microbiome (4), this indicates that this class of molecules is severely understudied ( Fig. 1; see also Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?task ϭ1ed182b005a74b94ac730f769b2f37b1). Several BGCs annotated as saccharides, other, unclassified, PKS and NRPS BGC types grouped with aryl polyene BGCs, which may represent novel hybrid classes of BGCs. Other small subnetworks include biosynthesis of terpene phenazine, homoserine lactone, alkaloid, siderophore, and ectoine. These subnetworks did not associate with MIBiG reference BGCs, indicating that they also await experimental validation. Our implemented analysis approach, using the MIBiG/BiG-SCAPE pipeline, is powerful with regard to predicting the functions of novel BGCs. The annotations we generated here provide deeper insights of which BGCs and compound classes are most likely to be identified in futures studies, due to knowledge of their closest neighbor's biochemical properties. The BGCs remaining with completely unknown functions represent exciting future challenges, which could be addressed by generating large-insert BGC expression libraries.
While antiSMASH and network analysis were employed for broad classification of BGCs into known classes, MultiGeneBlast was also utilized at the level of the entire gene cluster to further annotate BGCs in depth and identify homologs against the MIBiG repository (30). Using this approach, the 4,915 BGCs were classified into four major categories based upon the level of homology to known experimentally validated BGCs in the MIBiG repository. This categorization resulted in 1,146 (20%) BGCs closely homologous, 848 (15%) BGCs moderately homologous, and 2,221 (40%) BGCs distantly homologous to well-characterized BGCs (see Fig. S1 in the supplemental material). A total of 1,393 (ϳ25%) BGCs did not appear to have significant homology to BGCs in MIBiG, based upon the E value (see Materials and Methods for details). Such a detailed annotation of BGCs harbored by the human oral microbiome has not been accomplished previously.
Specific BGCs and bacterial taxa are associated with periodontitis and dental caries. We next systematically examined the differential representation of bacterial BGCs in saliva and dental plaque across 294 human subjects with good oral health, dental caries, or periodontitis. The data from 247 subjects were obtained from eight previous studies, which represented all publicly available metagenomes and metatranscriptomes associated with caries or periodontal disease compared to oral health at the time of this study (28 December 2017; Table S2 posted at https://massive.ucsd.edu/ ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). Due to the limited number of publicly available deep sequencing libraries of high quality that represent the same sampling location within the oral cavity (i.e., supra-or subgingival plaque and saliva), we included libraries with mixed sample origin to represent healthy and disease states, respectively (Table S2 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp ?taskϭ1ed182b005a74b94ac730f769b2f37b1). This could have created discrepancies in our analysis, and differences that we present here may represent differences between sampling location and not between oral health states. In addition, DNAs from 47 saliva samples representing 23 children with caries and 24 healthy children were sequenced, and putative BGCs were identified. Nonsupervised exploratory ordination through Subnetworks representing major BGC classes, as determined by antiSMASH and BiG-SCAPE, are highlighted with different background colors to visualize BGCs as constellations within the biosynthetic landscape. Nodes (small circles) represent amino acid sequences of BGC domains and are colored by BGC class. Unfilled nodes represent reference BGCs from the MIBiG repository. Edges drawn between the nodes correspond to pairwise distances, computed by BiG-SCAPE as the weighted combination of the Jaccard, adjacency, and domain sequence similarity indices. For increased simplicity, only subclusters of unclassified and oligosaccharide BGCs with a minimum number of eight nodes are organized into given highlighted constellation.
PCoA revealed significant differences in the representation of BGCs between healthy and diseased subjects in five of the six metatranscriptome studies and six of the seven metagenome studies investigated (Fig. 3). The 1,804 BGCs which were differentially represented in health versus disease in the metagenomes and metatranscriptomes are summarized in Table 1.
The BGCs associated with disease in the metatranscriptome studies were related to the synthesis of a broad range of small molecule types. These types particularly included BGCs of the oligosaccharide, aryl polyene, terpene, bacteriocin, and NRPS classes (Fig. 4). BGCs encoding PKS, NRPS, and bacteriocins from Actinomyces, Rothia, and Corynebacterium had increased expression in subjects with caries, while BGCs encoding terpenes and aryl polyenes from Neisseria spp. and Proteobacteria had increased expression in healthy subjects (Fig. 4). Previous studies illustrated that aryl polyenes act as protective agents against oxidative stress and that terpenes function as anti-inflammatory agents (40). These findings are in line with previous findings that high levels of Actinomyces were associated with severe childhood caries (44). In the caries-associated samples, known cariogenic species belonging to the Streptococcus, Veillonella, and Lactobacillus genera (45) showed notable changes in bacteriocin and oligosaccharide BGC expression profiles (Fig. 4).
In periodontitis, a high number of differentially expressed BGCs (170 BGCs) were identified in community members belonging to the Bacteroidetes phylum. Several BGCs encoded by periodontal pathogens of the red and orange complexes (e.g., Porphyromonas gingivalis) were differentially expressed in health compared to periodontal disease. Known red complex species had increased expression of BGCs belonging to the aryl polyene, oligosaccharide, homoserine lactone, and resorcinol classes in diseased states. Neisseria spp. also showed interesting signatures, such as increased expression of BGCs belonging to the terpene, resorcinol, bacteriocin, and homoserine lactone classes (Fig. 4). These results suggest that Neisseria spp. and its small molecules may play a role in periodontitis, which is worth further explorations in future studies. Homologs to specific BGC products in the MIBiG database which displayed differential expression in health and disease are detailed in Fig. S2. Analysis of the metagenomic studies yielded trends similar to those detailed above (Fig. 5).
Next, a subset of differentially represented BGCs, which showed high levels of expression in either healthy or diseased states, was examined to determine whether they commonly occur across studies. The results were visualized as a binary occurrence matrix (Fig. S3). In all studies analyzed, only a minor fraction of the differential features (Ͻ10 BGCs) were shared between any two studies. The reason for this discrepancy may be due to both biological and technical variations such as differences in anatomical features at the collection site of dental plaque and what time of the day saliva samples were collected. Additionally, sequencing platforms, methodological variations in sample collection, nucleic acid extraction techniques (with or without DNA or RNA/rRNA removal methods), and use of different library preparation methods (see Table S2 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a7 4b94ac730f769b2f37b1 and Fig. S4) may also play important roles. Given the complexity of metagenomes and metatranscriptomes, these factors can certainly influence the sequence composition and lead to the differences we observed here. It is also important to note that due to the fact that BGCs can be horizontally transferred between different bacterial species, mRNA read mapping to BGC databases does not provide definite taxonomic classification and predictions of species-level activities.
Differential interactions between BGCs and oral taxa in caries and health. To examine possible relationships between BGCs and bacterial taxa during health and disease, a focused comparative analysis of the shotgun metagenomics data obtained in this study from saliva samples from healthy children and children with caries (Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b 94ac730f769b2f37b1) was performed. Children in the caries group had a minimum of two active caries lesions. No overall differences in diet or oral hygiene habits (brushing Manhattan distances between samples were visualized using the first three principal coordinates, and significance was tested by applying Mann-Whitney ranks on the most variance captured by the first principal coordinate. and flossing) were recorded between the groups. Also, gingiva index scores, which measure oral health were low for both groups (below 1.0) and not significantly different (P Ͼ 0.05, Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ 1ed182b005a74b94ac730f769b2f37b1), suggesting that both groups presented with good overall gingival health, except for cavities in the caries active group. Interactions between BGCs and microbial taxa were examined by employing cooccurrence network analysis using the SparCC algorithm, which has the benefit of limiting the number of spurious correlations identified due to species data being compositional (46). While positive correlations were more evident among taxon-taxon relationships (i.e., different taxa benefit from one another's presence), almost all significant correlations that were identified between specific BGCs and taxa were negative (Table S1 posted at https:// massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1 and Fig. S5 and S6). This suggests that antagonistic relationships, modulated through BGC-produced antimicrobial molecules, are highly significant to the ecology of the oral microbiome.
All BGCs that had significant correlations to oral taxa (a total of 36) were annotated as close homologs to previously characterized BGCs belonging to the PKS, NRPS, NRPS-PKS hybrid, oligosaccharide, and aryl polyene classes (Fig. S1). In the oral microbiomes derived from healthy children, the interaction network was dominated by negative correlations between oral taxa and BGCs producing glycopeptidolipids, capsular polysaccharides, as well as a homolog of flexirubin ( Fig. S5A and Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f 769b2f37b1). The glycopeptidolipids were encoded by the opportunistic pathogens Kytococcus sedentarius and Mycobacterium neoaurum and were primarily shown to vary inversely with the oral taxa Lactobacillus, Prevotella, Capnocytophaga, and Enterococcus ( Fig. S5A and Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?task ϭ1ed182b005a74b94ac730f769b2f37b1). The flexirubin homolog BGCs were encoded  (Fig. S5A). These associations are reminiscent of a previous study which observed similar macrolide-encoding BGCs widely distributed among oral bacterial genomes (2). These Aleti et al.
® macrolide structures were also reported to inhibit the growth of cariogenic streptococci (50). This collective evidence indicates that the isolation and characterization of bacillaene-and pristinamycin-like molecules in future studies may be key to understanding important health-protective mechanisms in the oral cavity. Finally, P. propionicum F0230a encoded a BGC with high sequence homology to a nonribosomal peptide pathway encoding the genotoxin colibactin (51). This BGC showed antagonistic associations with pathogenic genera: Haemophilus, Aggregatibacter, Parascardovia, Capnocytophaga, and Streptococcus (Table S1 posted at https://massive.ucsd.edu/ ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). Most intriguingly, the number of significant correlations between BGCs and microbial taxa was dramatically reduced in the samples derived from children with caries ( Fig. S6A to C). This may indicate that in the oral cavities exhibiting disease, the well-documented colonization resistance of the oral microbiome may be impaired. Of the few significant correlations between BGCs and taxa within the interaction network of the caries-associated microbiome, the vast majority involved BGCs encoding RiPPs with close homology to nosiheptide and hygromycin BGCs (Fig. S5B; Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f 769b2f37b1). The nosiheptide-like BGC, encoded in the genome of Corynebacterium matruchotii, was the most predominant, with antagonistic interactions against ϳ90 taxa. These taxa included pathogens from the Klebsiella, Helicobacter, Filifactor, Haemophilus, Enterococcus, and Fusobacterium genera ( Fig. S5B; Table S1 posted at https:// massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). The hygromycin-like BGC from P. propionicum negatively correlated with several pathogens belonging to the genera Lactobacillus, Neisseria, Klebsiella, Anaerococcus, and Pseudoramibacter. Interestingly, there were no significant correlations between S. mutans and BGCs in the caries-associated oral microbiomes, which suggests that during disease, the community lacks the ability to limit the abundance of this cariogenic pathogen. Taken together, these results show that in the oral microbiome, exclusion of particular taxa via antagonistic interactions, mediated by the products of BGCs, is widespread. Although such interactions were still present in the caries-associated oral microbiomes, there were fewer interactions. This underscores the importance of ecology and the role of BGC-produced small molecules in the balance between health and disease.
Homologs of BGC-produced small molecules are present in oral metabolomes associated with caries and health. To validate the production of small molecules from the differentially abundant BGCs that we identified in the analysis above (Fig. 3M), we performed untargeted liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis of the same saliva samples. Utilizing the Global Natural Products Social Molecular Networking (GNPS) (52) analysis platform, a mass spectral molecular network consisting of 1,369 mass spectral features grouped into 69 molecular families (two or more connected components of a graph) was obtained. Fifty matches were acquired between the query MS/MS spectra and characterized reference spectra from GNPS. To further enhance mass spectrometry annotations and to link annotations to known chemical structures encoded by BGCs, major chemical classes were putatively identified by integrating mass spectral molecular networking with in silico annotations and automated chemical classification approaches (53)(54)(55). This allowed identification of approximately 38% of the nodes in the mass spectral molecular network at the chemical class level. The most predominant chemical classes within the network were carboxylic acids and derivatives, prenol lipids, fatty acyls, and flavonoids (Fig. S7). Substructures associated with macrolides, terpenoids, and macrolactams were also identified. At the chemical class level, there were no distinct relative abundance patterns between the oral health-and disease-associated samples. This is in stark contrast to PCoA analysis of the 1,369 MS features, which showed clear separation of samples between healthy and diseased states (Fig. 6A) rather than by DMFT scores (Fig. 6B), dentition stage (Fig. 6C), race (Fig. 6D), and age (Fig. 6E), in agreement with the BGC abundance profiles (Fig. 3M). By employing a random forest importance model, 15 key metabolites, which were distinct between healthy and disease states (Fig. 6F), were identified. Of these, 12 were significantly more abundant in healthy subjects, while 3 were more abundant in the subjects with dental caries. Out of the three key metabolites that were significantly more abundant in the diseased subjects, two matches were obtained to lipid compounds from GNPS reference spectra resulting in a level 2 metabolite identification (56). These matches were N-nervonoyl-D-erythrosphingophosphorylcholine and 13-docosenamide. These molecules are likely to originate from the human host and warrant further investigation.
Using the in silico Network Annotation Propagation tool (NAP) (57), putative structural matches were obtained for 6 out of the 12 key metabolites that were more abundant in the healthy subjects, including terpenoids, phenylpropanoids, as well as fatty alcohols. It should be noted however, that one of the limitations of in silico annotation is the uncertainty around the correct structure among the predicted candidate structures. Results should therefore be interpreted with care, and an accurate prediction of the putative identity would require follow-up investigations, which is outside the scope of the present study. It should be also noted that both the genomic and metabolomic approaches employed identify putative homologs and not exact matches. Thus, using current techniques and databases, it is not possible to definitively determine whether the small molecules identified by LC-MS/MS were in fact produced by the specific BGCs predicted by antiSMASH. However, the LC-MS/MS analyses largely support the results of the genomic analyses by detecting classes of small molecules and homologs which were similar to those discovered by the complementary BGC genomics analyses.
Concluding remarks. This study significantly expands the number of identified BGCs encoded by bacteria of the human oral microbiome and designates putative products to many novel clusters. Representation and expression of the newly identified BGCs, as well as their relationship to the abundance of oral bacterial taxa were examined during health, dental caries, and periodontitis, revealing significant differences in microbial social ecology and communication among the three host outcomes. This work provides an atlas for further examination and experimental validation of the identified socio-chemical relationships and their role in the pathogenesis of dental caries and periodontal disease. A deeper elucidation of the social activities of the microbes residing in the oral cavity will significantly improve our understanding of the pathogenesis of oral (and extraoral) diseases and will guide development of improved therapeutic strategies to maintain oral health.

MATERIALS AND METHODS
The ethics statement is provided in Text S1 (Supplementary Materials and Methods section) in the supplemental material.
Study inclusion/exclusion criteria and collection of saliva. Subjects were included in the study if the subject was 3 years old or older and in good general health according to a medical history and clinical judgment of the clinical investigator, subjects had at least 12 teeth. Subjects in the caries group had a supragingival (above the gum line) plaque score of 1 or greater on a scale of 0 to 3. Subjects were excluded from the study if they had generalized rampant dental caries, chronic systemic disease, or medical conditions that would influence the ability to participate in the proposed study (i.e., cancer treatment, HIV, rheumatic conditions, history of oral candidiasis). Subjects were also excluded it they had open sores or ulceration in the mouth, radiation therapy to the head and neck region of the body, significantly reduced saliva production or had been treated by anti-inflammatory or antimicrobial therapy. Ethnic origin was mixed for healthy subjects (Hispanic, Asian, Caucasian, Caucasian/Asian), while children with caries were of Hispanic origin. For the latter group, no other ethnic group enrolled despite several attempts to identify interested families/participants. Children with both primary and mixed dentition stages were included (caries group: 18 children with mixed dentition and 6 with primary dentition; healthy group: 19 children with mixed dentition and 6 with primary dentition). To further enable classification of health status (here caries and healthy), a comprehensive oral examination of each subject was performed by a single calibrated pediatric dental resident, using a standard dental mirror, illuminated by artificial light. The visual inspection was aided by tactile inspection with a community periodontal index (CPI) probe when necessary. The number of teeth present was recorded, and dental caries status was recorded using decayed (d), missing due to decay (m), or filled (f) teeth (t) in primary and permanent dentitions (dmft/DMFT), according to the criteria proposed by the World Health Organization in 1997. Duplicate examinations were performed on five randomly selected subjects to assess intraexaminer reliability. Subjects were dichotomized into two groups: caries free (dmft/DMFT ϭ 0) and caries active (subjects with Ն2 active dentin lesions). If the subject qualified for the study, (s)he was asked to brush their teeth after breakfast and abstain from eating and drinking for 2 h prior to saliva collection in the morning. According to a previously developed protocol (58), to ensure that the mouth was free of food or foreign substances, each volunteer rinsed their mouth three times with water 10 min prior to saliva collection. Following the same protocol, we collected approximately 2 ml saliva from each subject by having the subject spit directly in a sterile 15-ml Falcon tube over a 20-min period. Whole saliva was immediately transferred to sterile 2-ml cryovial tubes and snap-frozen for long term storage at Ϫ80˚C. For more information, see Text S1 (Supplementary Materials and Methods section) in the supplemental material.
DNA extraction and metagenomic sequencing. For detailed protocols of DNA extraction and metagenomic sequencing, see Text S1 (Supplementary Materials and Methods section).
BGC identification and network analysis of known and putative oral BGCs. A list of 1,362 described and curated human oral taxa (18 September 2017) was obtained from the Human Oral Microbiome Database (HOMD) (55). A few genomes belonging to transient exogenous bacterial taxa, which also have oral taxon IDs in HOMD were also added to this list (e.g., legume symbiont Mesorhizobium loti, pathogenic Yersinia pestis). In order to identify small-molecule-and secondary metaboliteencoding BGCs in the genomes of bacterial taxa representative of a broad oral bacterial diversity, 461 complete and high-quality draft genomic sequences, annotated as dynamic and static, were obtained from the National Center of Biotechnology Information genome database (https://www.ncbi.nlm.nih .gov/genome), as well as from an in-house database (Table S1 posted at https://massive.ucsd.edu/ ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1). These sequences were concatenated into a major query-database and fed to antiSMASH (Antibiotics & Secondary Metabolite Analysis Shell, version 4.0) (29). Multiple nucleotide FASTA sequences from BGCs were constructed. We excluded a list of 320 previously described nonbiosynthetic genes commonly found in BGCs (2) (Table S1 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b94ac730f769b2f37b1) based on text within an attribute using advanced filter settings in CLC Workbench software v. 9. (CLCbio, Aahus, Denmark). The resulting data set contained a total of 192,283 gene sequences from 4,915 BGCs and can be downloaded from the MassIVE repository (ftp://massive.ucsd.edu/MSV000081832) with the accession ID MSV000081832. For more information, see Text S1 (Supplementary Materials and Methods section).
Comparison of BGCs with known biosynthetic pathways. A reference MIBiG database comprising multiple amino acid sequences for each BGC was constructed using MultiGeneBlast (30). To further compare BGCs (excluding the fatty acid synthase encoding BGCs) derived from oral bacterial genomes with those encoding the biosynthetic pathways for known compounds, we performed multigene homology searches using complete gene cluster sequences against the MIBiG database by using the stand-alone version of MultiGeneBlast (http://multigeneblast.sourceforge.net/) algorithm with default settings. Subsequently, for each queried BGC, we extracted information from the top hit (with the highest cumulative BLAST bit score) from an output of multiple BLAST hits using an in-house python script. For additional information, see Text S1 (Supplementary Materials and Methods section).  60), as well as our own study of metagenomes from saliva samples obtained from children with good dental health and children with dental caries was analyzed (sequence reads are accessible under BioProject accession no. PRJNA1234; Table S2 posted at https://massive.ucsd.edu/ProteoSAFe/status.jsp ?taskϭ1ed182b005a74b94ac730f769b2f37b1). For a detailed protocol, see Text S1 (Supplementary Materials and Methods section).
Differential abundance and expression analyses of BGCs. We employed a systematic workflow for analyzing abundance and expression profiles of the BGCs. Using SRA toolkit utilities, reads were extracted from metatranscriptome and metagenome shotgun sequenced libraries available via NCBI. For a detailed protocol, see Text S1 (Supplementary Materials and Methods section).
Principal coordinate analysis. The differences between samples from healthy versus diseased individuals was investigated by applying principal coordinates analysis (PCoA) on Manhattan distances generated on the DESeq2-normalized count file using the EMPeror (61) tool. For a detailed protocol, see Text S1 (Supplementary Materials and Methods section).
Correlation network analysis. The correlation network was constructed using the SparCC algorithm (46) python package (available at https://bitbucket.org/yonatanf/sparcc) to represent both coabundance and coexclusion networks between species and corresponding BGCs. For a detailed protocol, see Text S1 (Supplementary Materials and Methods section).
Experimental small molecule metabolite detection. Approximately 150 l of saliva was lyophilized, and ethyl acetate was added to extract nonpolar molecules. Samples were then vortexed, centrifuged to remove the cell debris, and submitted to untargeted LC-MS/MS analysis. For a detailed protocol, see Text S1 (Supplementary Materials and Methods section).
Mass spectral molecular networking. LC-MS/MS spectra obtained from 49 saliva samples (25 healthy subjects and 24 subjects with caries) were preprocessed for feature extraction using MZmine2 (62) and submitted to mass spectral molecular networking through GNPS (43). For a detailed protocol, see Text S1.
Putative chemical structure annotation. To putatively annotate chemical structures in our mass spectral molecular networks, we performed in silico structure annotation through Network Annotation Propagation (NAP) (57) both for [MϩH]ϩ and [MϩNa]ϩ adducts. For a detailed protocol, see Text S1.
Availability of data and material. Sequence data have been submitted to NCBI under BioProject ID PRJNA478018 with SRA accession no. SRP151559. Mass spectral files, LC-MS/MS metadata file, and Nucleotide FASTA sequences of the oral biosynthetic gene cluster collection are accessible from the MassIVE repository (ftp://massive.ucsd.edu/MSV000081832) with the accession ID MSV000081832. Tables S1 and S2 are available at https://massive.ucsd.edu/ProteoSAFe/status.jsp?taskϭ1ed182b005a74b 94ac730f769b2f37b1.