Transcriptome Remodeling Contributes to Epidemic Disease Caused by the Human Pathogen Streptococcus pyogenes

ABSTRACT For over a century, a fundamental objective in infection biology research has been to understand the molecular processes contributing to the origin and perpetuation of epidemics. Divergent hypotheses have emerged concerning the extent to which environmental events or pathogen evolution dominates in these processes. Remarkably few studies bear on this important issue. Based on population pathogenomic analysis of 1,200 Streptococcus pyogenes type emm89 infection isolates, we report that a series of horizontal gene transfer events produced a new pathogenic genotype with increased ability to cause infection, leading to an epidemic wave of disease on at least two continents. In the aggregate, these and other genetic changes substantially remodeled the transcriptomes of the evolved progeny, causing extensive differential expression of virulence genes and altered pathogen-host interaction, including enhanced immune evasion. Our findings delineate the precise molecular genetic changes that occurred and enhance our understanding of the evolutionary processes that contribute to the emergence and persistence of epidemically successful pathogen clones. The data have significant implications for understanding bacterial epidemics and for translational research efforts to blunt their detrimental effects.


IMPORTANCE
The confluence of studies of molecular events underlying pathogen strain emergence, evolutionary genetic processes mediating altered virulence, and epidemics is in its infancy. Although understanding these events is necessary to develop new or improved strategies to protect health, surprisingly few studies have addressed this issue, in particular, at the comprehensive population genomic level. Herein we establish that substantial remodeling of the transcriptome of the human-specific pathogen Streptococcus pyogenes by horizontal gene flow and other evolutionary genetic changes is a central factor in precipitating and perpetuating epidemic disease. The data unambiguously show that the key outcome of these molecular events is evolution of a new, more virulent pathogenic genotype. Our findings provide new understanding of epidemic disease.
In parallel with studies of bacterial population genetic structure, there has been interest in identifying the precise genomic changes that contribute to the emergence, numerical success, and epidemic behavior of members of some bacterial species. A major effort has been devoted to analysis of comprehensive, populationbased samples of the strict human pathogen S. pyogenes (commonly, group A streptococcus [GAS]) as a model pathogen (28)(29)(30)(31)(32)(33)(34)(35). S. pyogenes is endemic in humans worldwide and periodically causes epidemics of superficial (e.g., pharyngitis and impetigo) and invasive (e.g., necrotizing fasciitis, pneumonia, myositis) infections. Globally, the organism causes an estimated 711 million human infections and over 500,000 deaths annually (36). The species is genetically diverse, with more than 240 emm-types (typing based on sequence differences in the hypervariable aminoterminal portion of the emm gene encoding the antiphagocytic Emm virulence protein; http://www.cdc.gov/abcs/index.html), and approximately 650 multilocus sequence types (MLSTs) (http://spyogenes.mlst.net) described.
In the early 1980s, a dramatic increase in the frequency and severity of infections caused by S. pyogenes led to the recognition of a global pandemic caused by emm1 strains (37)(38)(39)(40)(41)(42)(43)(44). This pandemic afforded the opportunity to compare preepidemic and epidemic strains for potential bacterial factors contributing to this global health problem. To gain insight into the emergence, dissemination, and diversification of emm1 strains causing this pandemic, we sequenced the genome of 3,615 emm1 infection isolates (32). Phylogenetic analyses revealed that the pandemic emm1 strains that emerged are a genetically closely related clonal population that evolved from its most recent preepidemic progenitor in the early 1980s. The key genetic event underpinning the pandemic was acquisition by HGT and recombinational replacement of a 36 kb segment of the S. pyogenes core chromosome (i.e., that portion of the chromosome/genome that is largely conserved across emmtypes and not present on obvious mobile genetic elements such as phages and integrative-conjugative elements) that mediated enhanced production of toxins NAD ϩ glycohydrolase (SPN [S. pyogenes NADase]) and streptolysin O (SLO) (32). A subsequent study (35) showed that the striking upregulation of SPN and SLO production and altered virulence phenotype by members of the pandemic clone occurred as a consequence of only three single nucleotide polymorphisms (SNPs). Two are located in the Ϫ35 to Ϫ10 spacer region of the promoter sequence upstream of the nga-ifs-slo transcriptional unit and resulted in increased gene expression. The third, a nonsynonymous SNP in the nga gene, increases the activity of SPN, a secreted cytotoxin virulence factor (45). Additional evidence supporting the notion of upregulation of SPN and SLO as a contributing cause of S. pyogenes epidemic disease was found by sequence analysis of 1,125 emm89 genomes (35) obtained in comprehensive population-based surveillance studies conducted in the United States, Finland, and Iceland between 1995 and 2013. Among these emm89 strains, we identified three distinct phylogenetic clades (designated clade 1, clade 2, and clade 3). The current worldwide recent increase in the incidence of emm89 invasive infections corresponded temporally with the emergence and expansion of clade 3 strains upregulated in SPN and SLO production (35,46). Thus, progress is being made in understanding genomic alterations that are linked with increases in disease frequency and severity in some human pathogens. However, despite these advances, very little analogous work has been conducted to investigate global changes in gene expression that may contribute to the origin and perpetuation of bacterial epidemics. Similarly, there is a general lack of studies linking genome variation, transcriptional changes, and altered virulence in epidemic forms. The primary goal of this investigation was to study how genome variation linked with changes in transcriptome and altered virulence might contribute to the origin and perpetuation of bacterial epidemics, using the ongoing S. pyogenes emm89 epidemic as a convenient model system. We used comparative pathogenomics to dissect the precise molecular genetic events that have mediated the evolutionary origin and diversification of the epidemic emm89 strains. Unexpectedly, we found that a high frequency of HGT events has shaped the emm89 population genetic structure to a far greater extent than vertically inherited SNPs and short insertions and deletions (indels). Three main mechanisms that mediate HGT in bacteria have been described: conjugation, transduction, and transformation. Although S. pyogenes is not considered to be naturally competent, analysis of MLST data found S. pyogenes to have a level of recombination comparable to that of Streptococcus pneumoniae, a species that is naturally competent (47). The mechanism mediating the relatively high level of recombination detected in GAS is not known, but, given the prevalence of phage in GAS genomes, generalized transduction may play an important role. Global transcriptome sequencing (RNAseq) analysis was conducted on genetically representative preepidemic and epidemic emm89 strains to determine the extent to which the genomic changes causing altered gene expression may have contributed to the epidemic. We found that HGT is extensive in the emm89 population and has contributed disproportionately to the diversification of virulence factors and their expression. Nonsynonymous SNPs in major regulatory genes and other modest genetic changes have also led to transcriptome remodeling intimately linked with the origination and perpetuation of the epidemic. The results have significant implications for understanding epidemic bacterial disease and for translational research efforts designed to control or limit the detrimental effect of infectious agents. The overall strategy used as described here is of general utility and pertinence to the investigation of other pathogens.

RESULTS AND DISCUSSION
Population genetic structure and contribution of horizontal gene transfer (HGT). We studied 1,200 emm89 S. pyogenes strains, virtually all (n ϭ 1,198) cultured from patients with invasive infections that occurred between 1995 and 2014 ( Fig. 1; see also Table S1 in the supplemental material). The great majority of strains (n ϭ 1,180) were collected as part of comprehensive population-based studies conducted in the United States, Finland, and Iceland. The genomes of all 1,200 strains were sequenced to a mean 60-fold depth of coverage (range, 13-fold to 440-fold) using an Illumina paired-end strategy, and polymorphisms were identified. Inference of genetic relationships using core chromosomal SNPs revealed that these emm89 strains have a major population of 1,193 strains and a minor population of 7 substantially diver-gent genetic outlier strains ( Fig. 2A). Bayesian clustering showed the major emm89 population of 1,193 strains to comprise 3 primary clades (Fig. 2B). The genomes of 3 strains representing the genetic backgrounds of organisms assigned to the three primary clades (MGAS11027 of clade 1, MGAS23530 of clade 2, and MGAS27061 of clade 3) were closed and annotated (see Fig. S1). The epidemiological information available for the 1,200 strains revealed that clade 3 strains emerged and expanded rapidly in the United States, Finland, and Iceland, displacing the corresponding predecessor clade 1 and 2 strains in the populations studied (Fig. 1). These findings are consistent with the preliminary data that we recently reported (35).
The emm89 population genomic data revealed an unprecedented level of genetic diversity for strains of a single S. pyogenes emm-type. Comparison of the emm89 genome sequences with data available for 37 S. pyogenes genomes of 18 other emm-types (see Table S2 in the supplemental material) showed that the emm89 strains are the only emm-type to have two deeply rooted branches in the phylogenetic network (Fig. 3). Although we found evidence of recombination within the emm89 population, the random distribution of SNPs and the lack of sequence identity of the 7 minor population emm89 outlier strains with sequences of another GAS emm-type or MLST argue that these strains have not arisen through emm-type switching.
We identified extensive genomic diversity between and within the three primary emm89 clades. The mean genetic distance (MGD) among the 1,193 strains of the 3 clades was 610 SNPs in the core genome (Fig. 2B). In striking contrast, among 3,615 emm1 strains collected in 8 countries on two continents over 45 years (i.e., a collection 3 times larger, from a broader geographic region, and a period 2.5 times longer than those used for analysis of the emm89 sample), the MGD was only 106 core SNPs (32).
There was a nonrandom distribution of SNPs throughout the emm89 core genomes. Multiple regions had elevated SNP density, indicating HGT and core genomes with a mosaic evolutionary history ( Fig. 4; see also Fig. S1 in the supplemental material). Gubbins (genealogies unbiased by recombinations in nucleotide sequences) statistical analysis of SNP distribution (48) identified 2,316 regions of putative HGT with a mean size of 3,695 bp (range, 4 bp to 71,774 bp) at 526 loci around the genome. Because HGT can distort inferences of genetic relationships and evolutionary history, the phylogeny of the strains was reassessed using sequences filtered to exclude regions of recombination (Fig. 2C). This analysis greatly reduced the MGD (i.e., average pairwise core SNPs) among the 1,193 strains by 78% (from 610 to 134), a level similar to that found in 3,615 emm1 strains (32). The MGD from clade 1 to clade 2 and the MGD from clade 2 to clade 3 were reduced by 87% and 76%, respectively. The MGD between strains within each of the clades also was substantially reduced. The MGD strain-to-strain within clade 1 went from 226 to 100 (Ϫ56%), within clade 2 from 83 to 63 (Ϫ24%), and within clade 3 from 244 to 45 (Ϫ82%). Importantly, however, after exclusion of SNPs present in chromosomal segments associated with HGT events, 3 primary clades still remained among the 1,193 strains. Outgroup rooting with the genome of emm1 reference strain SF370 showed that the evolutionary pathway leading to the current emm89 epidemic lineage had clades branching in the sequence of clade 1 followed by clade 2 and then clade 3 (see Fig. S2 in the supplemental material). Clade 1 and clade 2 strains differed by 8 regions of HGT encompassing 171.1 kb or 10% of the genome, and clade 2 and clade 3 strains differed by 6 regions of HGT encompassing 15.3 kb or 0.9% of the genome ( Fig. 4 and Table 1). Seven of the 8 HGT regions differentiating clade 1 and clade 2 are most similar in sequence to regions in emm2 reference genome MGAS10270 (see Fig. S3). Of special note, 33 isolates in clade 3 differed from the 725 other clade 3 strains by one additional HGT. These strains, designated subclade 3D (SC-3D) (Fig. 2B), first occurred in the Finland sample in 2009 and have disproportionately increased in prevalence in recent years as a cause of bloodstream infections in that country (see Table S1 and Fig. S4).
HGT events are responsible for the bulk of the core sequence differences between the clades. The transferred sequences encompass multiple genes encoding many known secreted and cell surface-associated virulence factors, including the pilus/Tantigen adhesin, fibronectin-binding protein FbaB, the toxin pair NGA and SLO, internalin InlA, C5a peptidase ScpA, antiphagocytic M-like proteins Enn and Mrp, virulence regulators Mga and Ihk-Irr, immunogenic secreted protein Isp1, and the HasABC capsule synthesis enzymes (49). These HGT events have had important consequences. For example, clade 1 strains differ from clade 2 and 3 strains in pilus/T-antigen, and the clade 3 strains cannot produce capsule due to loss of the hasABC genes. Of note, different pilus types have been shown to vary in cell adherence and tissue tropism, and differences in the levels of production of capsule and SPN and SLO cytotoxins can alter virulence (35,49,50).
Consistent with SPN and SLO playing a key role in S. pyogenes strain emergence and enhanced fitness, each of the three clades has a distinct nga-ifs-slo region resulting from two independent HGT events. In addition, SC-3D strains differ from the other clade 3 strains due to HGT of a region encoding the SpyA and SpeJ virulence factors (49,(51)(52)(53). Inasmuch as these multiple HGT events involve regions encoding virulence factors, it is reasonable to hy- pothesize that many of these HGT events alter host-pathogen interactions.
Variation in gene content and phage genotype. HGT in bacteria can be mediated by mobile genetic elements (MGE), phages, and integrative-conjugative elements (ICEs). S. pyogenes phages commonly encode one or more secreted virulence factors such as streptococcal pyrogenic exotoxin superantigens and streptococcal phage DNases (54,55). S. pyogenes ICEs usually encode one or more factors mediating resistance to antibiotics such as tetracycline and macrolides (54). Horizontal acquisition of antibiotic resistance and novel virulence factor genes, mediated by ICEs and phages, has been associated with localized outbreaks and large epidemics of S. pyogenes infections (29). MGE content was investigated in 1,193 emm89 isolates relative to the combined gene content (Ͼ53,000 genes) of 30 GAS genomes of 18 emm-types (see Tables S1 and S2 in the supplemental material). This analysis identified 64 different profiles of MGE content (Fig. 5). ICEs were infrequent in the strain sample. The three most prevalent MGE content profiles, or phage genotypes (PGs), accounted for 72% of the strains (Fig. 5). These three phage genotypes (PG01, PG02, and PG03) correspond to the phage content of the reference genomes for each of the three primary clades (Fig. S1 in the supplemental material). With the exception of PG02 (defined as lack of prophages), most phage genotypes were confined to a single clade. The most prevalent (43%) PG in clade 1 was PG03 (phage 11027.1 encoding SpeC and Spd1 and phage 11027.2 encoding Sdn). Also prevalent were PG05 (13%) and PG06 (11%) strains, potentially derived from PG03 strains by phage loss. Most clade 2 strains are Isolates are colored by cluster as determined using Bayesian analysis of population structure (BAPS) as indicated in the hierarchy below the figure. Three major clades (C1, C2, and C3) are defined at the first level of clustering. Subclade 3D (SC3D), a recently emerged and expanding population of strains in Finland, is defined at the second level of clustering. Indicated for the inferred phylogenies are the mean genetic distances (MGDs), both inter-and intraclade, measured as differences in core chromosomal SNPs. The mean genetic distance among strains within clades is less than the MGD to strains of the nearest neighboring clade(s). Bootstrap analysis with 100 iterations gives 100% confidence for all of the clade-to-clade branches (i.e., C1-C2, C2-C3, and C3-SC3D). (C) Genetic relationships based on 8,989 SNPs identified among the major population of 1,193 strains, filtered to exclude horizontally acquired sites as inferred using Gubbins. Exclusion of sites attributed to horizontal gene transfer events collapses the MGD strain-to-strain both within and between the clades. The MGD within the clades remains less than the MGD to the nearest neighboring clade(s). Trees in panels B and C are illustrated at the same scale. Beres et al. PG02 (72%), having no phages. The abundance of PG02 strains representing 20% of the entire emm89 cohort is unusual in that, prior to our investigation, nearly all S. pyogenes genomes had been found to be polylysogenic (55). Most clade 3 strains are PG01 (62%), having phage 27061.1 encoding SpeC and Spd1, followed next in prevalence by PG02 (22%). Of note, although phages 11027.1 and 27061.1 are integrated at the same genomic locus and encode the same two secreted virulence factors, they are different phages (see Fig. S5). PG01 (presence of 27061.1) first occurred in our strain samples in 2003, a time that corresponds to the emergence of the epidemic clade 3 strains. However, the acquisition of 27061.1 by the emm89 population does not result in the epidemic clade 3 strains acquiring new phage-encoded virulence genes that were not already prevalent in the preepidemic clade 1 strains. This finding suggests that acquisition of phage-encoded virulence genes was not a key driver for the emergence of epidemic clade 3 organisms as has been speculated (46).
HGT and extensively remodeled global transcriptomes. One school of thought postulates that HGT events are similar to point mutations in that most of them are neutral, or nearly so, and have little effect on pathogen traits. The unexpected magnitude of HGT events in the study population (based on previous findings from analysis of other S. pyogenes emm-types) provided a unique opportunity to test the hypothesis that these HGT events have enhanced the virulence of the epidemic emm89 strains by remodeling of the global transcriptome. As a consequence of the greater technical difficulty and expense involved, global transcriptional variation has been far less studied than genomic variation in bacterial pathogens. Moreover, since the data corresponding to the groups of samples studied here were population based and comprehensive and included temporal-spatial information, we had the additional opportunity to assess the potential effect of transcriptome remodeling on strain emergence and dissemination. We used RNAseq to compare transcript variations at two growth points among genetically representative strains of clades 1, 2, and 3 (Fig. 6). These strains have the allelic variant of the major virulence regulators covRS, mga, and ropB that is most common to the clades they represent. These regulators lack known functionaltering polymorphisms that influence S. pyogenes gene expression and virulence (49,(56)(57)(58)(59)(60). The number of genes differentially expressed in stationary-phase growth exceeded the number in exponential-phase growth by approximately 3-fold in all of the clade-to-clade comparisons (Fig. 7A). A general finding was that the greater the genetic distance between strains was, the greater the number of genes significantly altered in transcription. The largest number of differentially expressed genes was recorded between strains MGAS11027 (clade 1) and MGAS23530 (clade 2), consistent with strains in these clades being separated by the greatest MGD (Fig. 2). Genes altered in transcript level by 1.5-fold or greater accounted for 14% and 36% of the genome at the exponential and stationary growth phases, respectively, in comparisons of MGAS11027 (clade 1) and MGAS23530 (clade 2) (see Table S3, section 1, in the supplemental material). Although genes (n ϭ 182) located within the eight distinct regions of HGT differentiating clade 1 and clade 2 comprise only 11% of the gene content, they accounted for 24% of the differentially expressed genes at exponential growth, a highly nonrandom occurrence (P Ͻ 0.0001). Importantly, genes encoding many key virulence factors had significantly different transcript levels, including the fibronectin/collagen/T-antigen (FCT) region pilin genes, nga-ifsslo, speG, ideS, ska, sclA, fba, enn, emm, mrp, and mga (49). Collectively, these findings demonstrate that the genome segments that had been horizontally acquired and retained on the evolutionary pathway leading from clade 1 to clade 2 strains have contributed disproportionately to remodeling the global transcriptome, including many virulence genes, and argue that they are likely not selectively neutral.
The genomic changes accruing in the molecular evolution of clade 2 to clade 3 are of considerable interest because they are associated with the emergence, dissemination, and recent rapid increase in the frequency of emm89 invasive infections recorded in many countries (46,(61)(62)(63)(64)(65). In contrast to the 11% of the gene content reshaped by HGT in the transition from clade 1 to clade 2, a more modest 1% was reshaped in the transition from clade 2 to clade 3. Despite this modest 1% change we found that in comparing the transcriptomes of clade 2 strain MGAS23530 with clade 3 strain MGAS26844, 4% and 11% of the genes were differentially expressed at the exponential and stationary growth phases, respectively ( Fig. 7A; see also Table S3, section 2, in the supplemental material). Genes located within regions of HGT were significantly overrepresented among the differentially expressed genes in exponential growth (P Ͻ 0.0001). Included among the 28 genes with significantly increased expression in exponential growth were the critical virulence genes nga-ifs-slo (Fig. 7B). To confirm that increased expression of nga and slo is a trait broadly common to clade 3 strains, we assessed the expression of these genes by quantitative PCR (qPCR) in 11 strains selected to represent the range of genetic and geographic diversity present in the emm89 major population ( Fig. 7E and F). The subclades represented by these 11 isolates encompass 1,120 (94%) of the 1,193 strains of the major emm89 population. All 5 of the clade 3 strains had significantly greater nga and slo expression than all 6 of the clade 1 and 2 strains (P Ͻ 0.001). This is consistent with previous findings for Nga NADase activity assessed for 27 strains of the cohort (50). Importantly, significantly increased transcription of nga-ifs-slo was asso-  Table 1. SNPs are nonrandomly distributed. Regions of elevated SNP density correspond to predicted horizontal gene transfer/recombination blocks. ciated with the emergence and epidemic increase in S. pyogenes emm1 invasive infections (32,34,35). Additional genetic changes that differentiate epidemic clade 3 strains from the most recent predecessor clade 2 strains are acquisition of phage 27061.1 encoding speC and spd1 and loss of the hasABC capsule synthesis genes. To explore the role these genetic changes have potentially played in contributing to the emergence of the epidemic clade 3 strains, we inspected transcript data for the speC and spd1 genes and hasABC virulence factor genes between the preepidemic (clade 1 and clade 2) and epidemic (clade 3) emm89 representative strains. Transcript levels of speC and spd1 were significantly greater for the preepidemic clade 1 MGAS11027 strain than for the epidemic clade 3 MGAS26844 strain at both phases of growth assessed (Fig. 7C). The finding of significantly lower levels of speC and spd1 transcripts in the genetically representative epidemic clade 3 strain further argues that presence of these virulence factors in the clade 3 lineage is unlikely to have conferred a fitness advantage relative to clade 1 strains and therefore is an unlikely mechanism for the emergence of the epidemic clone and displacement of the predecessor clade 1 and clade 2 strains (46). Similarly, although the epidemic clade 3 strains are incapable of producing the antiphagocytic hyaluronic acid (HA) capsule due to HGT-mediated loss of the hasABC genes, the transcript data indicate that this gene loss was likely not responsible for a significant decrease in capsule production between the clade 2 and 3 strains. We found that transcription of hasABC was very  weak in clade 2 strain MGAS23530 under both growth conditions assessed (Fig. 7D), arguing that capsule production was already negligible before the HGT-mediated loss of the hasABC genes by the clade 3 lineage. Capsule production was strong only for clade 1 strain MGAS11027 at the exponential growth phase. We next investigated the molecular basis for the differences in capsule production using all strains of clades 1 and 2. Sequence variation in the hasABC promoter has been reported to alter transcription and capsule production (66). Inspection of the genome sequence data, coupled with Sanger sequencing of the hasABC promoter for all clade 1 and 2 strains, identified two major variants (see Fig. S6A in the supplemental material). These promoter variants corresponded to strong clade 1 strain MGAS11027 and weak clade 2 strain MGAS23530 hasABC transcription. Whereas the two promoter variants are equally represented among clade 1 strains, the vast majority (88.5%) of clade 2 strains had the weak transcription variant (see Fig. S6B and C in the supplemental material). Expression of hasA correlated perfectly with the promoter variant (Fig. 7G), which is consistent with results of HA production assays previously reported for 27 strains of the cohort (50). Importantly, hasA transcript levels for strains with the weak promoter variant were not significantly different from those of the clade 3 strains that lack the hasABC genes. Thus, the evolution of clade 3 from a clade 2 progenitor strain likely involved a transition from very little capsule production to no capsule production. This again argues that loss of the hasABC genes by the clade 3 lineage is unlikely to confer a fitness advantage relative to the clade 1 and 2 strains and is therefore an unlikely mechanism for the epidemic emergence and displacement of the predecessor lineages. Whereas some S. pyogenes outbreaks have been associated with strains having a hyperencapsulation phenotype (33,67) we are unaware of a body of epidemiological data associating GAS epidemic outbreaks with a loss of capsule phenotype. To summarize, the global transcriptome data comparing the preepidemic and epidemic strains show that neither production of phage-encoded virulence factors SpeC and Spd1 nor lack of production of the antiphagocytic HA Transcription of hasABC was very weak for the clade 2 strain at both growth phases and was significantly less than for the clade 1 strain at the mid-exponential growth phase (P Ͻ 0.002). (E, F, and G) Relative transcript levels as measured by qPCR for nga, slo, and hasA, respectively. Differences in strain-to-strain expression were assessed by one-way ANOVA. Expression of nga and slo was significantly greater for all 5 clade 3 strains than for all 6 clade 1 and 2 strains (P Ͻ 0.001). All 3 clade 1 and 2 strains with hasABC weak/repressed promoter pattern A expressed significantly less hasA than all 3 clade 1 strains with strong/derepressed promoter pattern B (P Ͻ 0.001). Levels of expression of hasA were not significantly different among all 3 strains with weak/repressed promoter pattern A and all 5 genetically acapsular clade 3 strains. RB, recombination block; RPKM, reads per kilobase per million reads mapped.

ME ES h a s B h a s C h a s A h a s B h a s
capsule is a characteristic unique to the emergent clade 3 strains relative to the predecessor clade 1 and 2 strains and therefore does not correspond to the epidemic increase in invasive infections.
The very recent emergence of SC-3D strains is temporally associated with a single HGT event in which SC-3D strains acquired an 18 kb sequence that includes 21 genes, including genes encoding the secreted virulence proteins SpyA, a C3-like ADPribosyltransferase, and SpeJ, a pyrogenic exotoxin superantigen (49,(51)(52)(53). On the basis of the nearly identical sequences, this 18 kb region likely was acquired from an epidemic emm1 clone donor. Differentially expressed genes accounted for 2% and 11% of the genome at the exponential and stationary growth phases, respectively, in comparisons of the transcriptomes of strain MGAS26844 (clade 3) and MGAS27520 (SC-3D) ( Fig. 7A; see also Table S3, section 3, in the supplemental material). This was the lowest number of differentially expressed genes among the four genetically representative strains studied, consistent with SC-3D strains being a recently emerged closely genetically related subset of the epidemic clade 3 strains.
Further transcriptome remodeling and epidemic perpetuation. Discovery of significant alteration of transcriptomes caused by HGT events, and the role in emergence and dissemination of clade 3 organisms, led us to investigate the hypothesis that additional transcriptome remodeling contributed to perpetuating the emm89 epidemic. We tested this hypothesis by focusing on SC-3D strains, because these organisms disproportionately increased in frequency in Finland starting from 2013 ( Fig. 2A; see also Fig. S4 and Table S1 in the supplemental material). Given the relatively modest number of genes differentially expressed between MGAS26844 (clade 3) and MGAS27520 (subclade 3D), we interrogated the genome data for candidate polymorphisms that may further alter the transcriptome and potentially influence pathogen behavior. Analysis of the genome sequences of the 33 SC-3D strains found unique single amino acid replacements in gene regulators CovR (S130N) and LiaS (K214R). These polymorphisms were prevalent among the SC-3D strains; 11 strains had the CovR (S130N) change, and 6 strains had the LiaS (K214R) change (see Fig. S4). In contrast, none of the other 1,183 emm89 or 3,615 emm1 strains (32) studied had these polymorphisms. The branching of the strains with these mutations in the inferred phylogeny and their absence in other S. pyogenes strains indicate identity by descent rather than identity by independent mutation (i.e., commonality by evolutionary convergence).
Repeated recovery of clonal progeny with either the CovR (S130N) or LiaS (K214R) polymorphisms from invasive episodes has not been reported previously and thus was unexpected. Because relatively little is known about liaS in S. pyogenes, we elected to study the LiaS (K214R) polymorphism in more detail. Consistent with our altered-transcriptome hypothesis, RNAseq analysis showed that the transcriptome of strain MGAS27710 LiaS (K214R) differed from that of SC-3D LiaS wild-type strain MGAS27520, including significant changes in expression of several virulence genes (data not presented). However, as these two strains are not isogenic, the extent to which the altered transcription was due to the LiaS (K214R) polymorphism could not be assessed. To address this issue, we constructed a LiaS (K214R) isogenic mutant from parental strain MGAS27556 and conducted RNAseq analysis. We found that, compared to the wild-type parental strain, the LiaS (K214R) isogenic mutant had 127 and 70 differentially expressed genes in exponential-phase growth and stationary-phase growth, respectively (see Table S3, section 4, in the supplemental material). Virulence genes significantly increased in expression by the LiaS (K214R) isogenic mutant included all 9 genes of the streptolysin S biosynthesis operon (sagABCDEFGHI) in exponential phase and speG encoding streptococcal pyrogenic exotoxin G in stationary phase.
The capacity of the CovR (S130N) and LiaS (K214R) naturally occurring mutant strains to repeatedly cause serious infections means that they can effectively spread between hosts and implies that they are not attenuated in the ability to survive in the upper respiratory tract, the more common S. pyogenes niche. Consistent with this idea, we found that the naturally occurring mutant strains had an enhanced ability to survive in human saliva ex vivo relative to SC-3D wild-type strain MGAS27520 (Fig. 8H). These results contrast with data showing that strains with other covR or covS (covR/S) mutations have reduced survival in human saliva relative to wild-type strains (68).
Comparative strain virulence. The epidemiological, comparative genomic, and transcriptome data demonstrate that clade 1, 2, and 3 organisms are genotypically and phenotypically distinct and strongly suggest differences in virulence. To test this hypothesis, the three genetically distinct reference strains for each clade were compared in mouse and nonhuman primate models of necrotizing fasciitis (NF) (69)(70)(71). Epidemic clade 3 reference strain MGAS26844 was significantly more lethal and caused significantly greater tissue damage in the mouse NF infection model than the two preepidemic reference strains ( Fig. 8A and B). Moreover, relative to clade 1 reference strain MGAS11027, epidemic clade 3 strain MGAS26844 caused significantly larger lesions with greater tissue damage in a nonhuman primate model of NF (Fig. 8C to E).
Concluding comment. We have used S. pyogenes as a model pathogen for studying the evolutionary genomics of epidemic disease and the molecular basis of bacterial pathogenesis. The organism is a strict human pathogen, causes abundant infections worldwide, and has a relatively small genome (~1.8 Mb). In addition to its propensity to cause epidemic waves, the availability of comprehensive, population-based strain collections from many countries, coupled with the fact that humans are its only natural host, means that the history of underlying events that generate genomic diversity is not obscured by molecular processes occurring in nonhuman hosts or environmental reservoirs. These factors afford considerable advantages in the use of S. pyogenes as a model system compared to many other pathogenic bacteria such as E. coli, S. enterica, and S. aureus.
The primary goal of our study was to determine if genomic changes linked with the origin and perpetuation of human epidemic disease have remodeled global gene expression and altered virulence in the model pathogen S. pyogenes. We were especially interested in determining the effect, if any, of horizontally acquired genome segments on global gene expression and virulence of the progeny strains. Despite the importance of bacterial pathogens in human and veterinary health, remarkably few studies have addressed how transcriptome remodeling contributes to the origin and perpetuation of epidemics. Zhou et al. (26) studied diversity in 149 genomes of S. enterica serovar Paratyphi A and used the resulting data to speculate that most recent increases in frequencies of bacterial diseases are due to environmental changes rather than to the novel evolution of pathogenic bacteria. In essence, it was suggested that many epidemics and pandemics of bacterial disease in humans did not involve recent evolution of particularly virulent organisms but instead reflected chance environmental events. A similar conclusion was reached in studies of other pathogens, for example, Yersinia pestis, S. enterica serovar Agona, Mycobacterium tuberculosis, Mycobacterium leprae, and Shigella sonnei (17,25). Although this may be the case for some pathogens, on the basis of the full-genome data from 4,815 strains, human patient information (33), analysis of isogenic mutant strains, RNAseq studies, and experimental animal infection, we arrive at a fundamentally different conclusion for emm89 and emm1 S. pyogenes, organisms that have caused epidemics involving tens of millions of human infections in the last 30 years. In particular, our results unambiguously show that newly emerged clones causing epidemic disease are more virulent than previously circulating precursor organisms. For clarity, we consider all steps in pathogen-host interaction to potentially contribute to the virulence phenotype, including survival and proliferation after initial contact with the host through invasion of deeper tissues and spread to new hosts. Conclusions about molecular pathogenesis and virulence based solely or predominantly on population genomic analyses of a convenience sample of strains and resulting inferences are not likely to fully reflect the biology of pathogen and host interaction. This issue may be especially problematic if only one or a few nucleotide changes significantly alter virulence.
We believe that our findings have important implications for bacterial pathogens that must successfully circumvent host defenses, at both the individual level and the population level. Our analysis demonstrated that, among the various emm89 clades and subclades, considerable variation exists among global transcrip-tomes, in both the spectrum of genes expressed and their magnitude of expression. This means that, in essence, many different antigen, toxin, and virulence factor profiles can be and are being displayed to host populations as a function of individual strain genotype and not necessarily of emm-type. In the absence of one or a small number of conserved antigens mediating protective immunity, regardless of the microbe, significant variations in antigen repertoire have implications for vaccine research, formulation, and deployment.
Many elegant studies of the population genomics of bacterial pathogens have been published over the last decade (4, 5, 10-18, 23, 25, 26, 32, 72-76). There is a small but emerging literature bearing on the impact of regulatory plasticity in bacterial evolution and fitness (77)(78)(79)(80)(81). However, there has been very little work designed to integrate microbial population genomics, molecular pathogenesis processes, microbial emergence, transcriptome remodeling, and virulence. Our findings suggest that this could be a fruitful area of research for other microbial pathogens. The resulting data are likely to have significant implications for understanding bacterial epidemics and for translational research efforts to blunt their detrimental effects.

MATERIALS AND METHODS
Further details of the materials and methods used are described in Text S1 in the supplemental material.
Bacterial strains. We studied 1,200 GAS emm89 strains, including 1,198 strains causing invasive infections and two from pharyngitis patients (see Table S1 in the supplemental material). The vast majority of the (H) Viability of naturally occurring variant strains MGAS28980 CovR (S130N) and MGAS27552 LiaS (K214R) in human saliva persisted for 2 and 4 weeks longer, respectively, than that of wild-type strain MGAS27520. No growth, Ͻ10 CFU/ml for a 1:10 dilution. IM, intramuscularly; NHP, nonhuman primate. strains (n ϭ 1,178) were collected as part of comprehensive populationbased public health surveillance of GAS invasive infections conducted in the United States, Finland, and Iceland between 1995 and 2014. The remaining emm89 strains were recovered from invasive disease cases in Ontario, Canada, and from a pharyngitis case in Italy. A subset of this population has been previously studied, and preliminary genetic findings have been presented (35,48).
Genome sequencing. Isolation of chromosomal DNA, generation of paired-end libraries, and multiplexed sequencing were accomplished as described previously (32,35) using Illumina (San Diego, CA) instruments (HiSeq2500, MiSeq, and NextSeq). Whole-genome sequencing data for the 1,200 isolates studied were deposited in the NCBI Sequence Read Archive.
Reference genome assembly, annotation, and polymorphism discovery. The bioinformatics tools used for assembling and annotating the reference genomes and for identifying and analyzing polymorphisms in the population studied are described in Text S1 in the supplemental material. Complete genome sequences for genetically representative strains MGAS11027, MGAS23530, and MGAS27061 were deposited in the NCBI GenBank database. MGAS11027, MGAS23530, and MGAS27061 were deposited in the BEIR strain repository.
Phylogenetic inference and population structure. The bioinformatics tools used for sequence alignments, detection, and filtering of HGT polymorphisms, for clustering and phylogenetic inference, and for analysis of the population structure are described in Text S1 in the supplemental material.
Gene content and mobile genetic element analysis. The known GAS pangenome core and accessory gene content was determined based on 30 complete genomes of 18 different emm-types (see Table S2 in the supplemental material) as described in Text S1 in the supplemental material. Among the 53,336 coding sequences (CDSs) of the 30 genomes, PanOCT identified 3,338 ortholog clusters, culled by BLAST reciprocal-best-hit analysis to 2,835 on the basis of the criterion of no two clusters sharing Ͼ95% amino acid identity. A GAS pseudo-pangenome sequence of~3 Mbp was generated by concatenating onto the emm89 MGAS23530 reference genome all accessory gene content not already present in the genome, starting with emm89 strains MGAS11027 and MGAS27061, and then the remaining 27 genomes by emm-type (i.e., emm1, emm2, emm3, etc.). Based on mapping of the emm89 reference genome sequencing reads to the GAS-30 pangenome, an RPKM (reads per kilobase of transcript per million reads mapped) value of Ͼ50 corresponded to gene presence. A phage was called present if a minimum of 80% of its gene content represented in the GAS-30 pangenome was found to be present. Reads not mapping to the GAS-30 pangenome were assembled de novo using SPAdes. Resultant contigs with greater than 100 nucleotides were queried against the NCBI nonredundant database using BLAST to determine their nature.
Construction of isogenic mutant strains. The construction of the liaS isogenic mutant strain was accomplished by allelic exchange as previously described (35). Briefly, MGAS27556 LiaS (K214R) was generated by introducing the liaS A641G SNP into wild-type strain MGAS27556, using DNA amplified from strain MGAS27552, a clinical isolate with a naturally occurring liaS A641G SNP (i.e., LiaS K214R substitution) as the template. Successful introduction of the desired SNP and the absence of spontaneous spurious mutations were confirmed in candidate isogenic mutants by whole-genome sequencing. Primers, plasmids, and restriction enzymes used in the construction are listed in Text S1 in the supplemental material.
Transcriptome sequencing and expression analysis .Whole-genome transcriptional analysis was conducted for strains genetically representative of the clades and subclades studied using RNAseq as previously described (35,82) with minor modifications. Briefly, RNA was isolated from triplicate cultures grown in Todd-Hewitt broth supplemented with yeast extract (THY). Multiplexed libraries were subjected to single-end sequence analysis (50 bp) to a high depth value (~10 million reads/sample) with an Illumina HiSeq2500 instrument. RNAseq reads were mapped to the genome of the most closely related emm89 reference strain (for example, clade 3 strains were mapped to the genome of reference strain MGAS27061). Use of multiple reference sequences was critical, as the use of a single common reference did not permit accurate quantitative read mapping to the divergent sequences in the regions of HGT. RNAseq data were normalized, and genes statistically differently expressed following Benjamini-Hochberg correction at a minimum 1.5-fold change in mean transcript level were identified using the bioinformatics tools provided in Text S1 in the supplemental material. RNAseq transcriptome data were deposited in the NCBI Gene Expression Omnibus database. Expression levels of the key virulence genes nga, slo, and hasA were assessed by quantitative real-time PCR in triplicate for 11 strains genetically representative of the most abundant subclades of the major population using primers, probes, and protocols previously described (50). The significance of strain-to-strain differences in expression was assessed by one-way analysis of variance (ANOVA).
Experimental animal infections. The virulence of serotype emm89 reference strains MGAS11027, MGAS23530, and MGAS26844 was assessed using mouse and nonhuman primate models of necrotizing fasciitis (32,(69)(70)(71). These strains have a wild-type (i.e., the most commonly occurring) allele for all major transcription regulators, including covR/S, ropB, and mga. All animal experiments were approved by the Institutional Animal Care and Use Committee of Houston Methodist Research Institute.
Accession numbers. Whole-genome sequencing data for the 1,200 isolates studied were deposited in the NCBI Sequence Read Archive under accession number SRP059971. Complete genome sequences for genetically representative strains MGAS11027, MGAS23530, and MGAS27061 were deposited in the NCBI GenBank database under accession numbers CP013838, CP013839, and CP013840, respectively. MGAS11027, MGAS23530, and MGAS27061 were deposited in the BEIR strain repository under accession numbers NR-33707, NR-33706, and NR-50285, respectively. RNAseq transcriptome data were deposited in the NCBI Gene Expression Omnibus database under accession number GSE76816.