Comparative Population Genomics Analysis of the Mammalian Fungal Pathogen Pneumocystis

ABSTRACT Pneumocystis species are opportunistic mammalian pathogens that cause severe pneumonia in immunocompromised individuals. These fungi are highly host specific and uncultivable in vitro. Human Pneumocystis infections present major challenges because of a limited therapeutic arsenal and the rise of drug resistance. To investigate the diversity and demographic history of natural populations of Pneumocystis infecting humans, rats, and mice, we performed whole-genome and large-scale multilocus sequencing of infected tissues collected in various geographic locations. Here, we detected reduced levels of recombination and variations in historical demography, which shape the global population structures. We report estimates of evolutionary rates, levels of genetic diversity, and population sizes. Molecular clock estimates indicate that Pneumocystis species diverged before their hosts, while the asynchronous timing of population declines suggests host shifts. Our results have uncovered complex patterns of genetic variation influenced by multiple factors that shaped the adaptation of Pneumocystis populations during their spread across mammals.

explored almost exclusively in P. jirovecii in the context of studying drug resistance and the epidemiology of PCP. While some studies have reported a strong local population structure mediated by the geographic distribution of susceptible individuals or clinical characteristics (8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20), other have suggested no or very limited global population structuring (21,22). Methodological differences and the absence of whole-genome sequences make it hard to reconcile these findings. Most studies use a combination of multiple loci (typically Ͻ6), referred to as a MLG (multilocus genotype). These MLGs are often temporally stable (4 to 9 years) within the same geographic locations (23), which supports the idea of a lack of recombination among these MLGs. Others have not observed temporal clusters in MLGs and interpret this as evidence for recombination among MLGs (19). The few population structure studies in P. carinii suggest a lack of apparent structure and genetic diversity (24). Population structures of other Pneumocystis species are largely unknown.
The concept of mitotic recombination in Pneumocystis species is supported by the presence of all the genes required for homologous recombination (5,25). Expression of subtelomeric antigenic genes is also likely mediated by telomeric recombination (26,27). However, the frequency of recombination across the genomes is unknown. Coalescent methods are now widely used to infer population-scaled recombination rates directly from genome-wide single nucleotide polymorphism (SNP) data in a population (28). Unfortunately, key metrics such as mutation rates or population sizes are unknown for Pneumocystis.
Multiclonal infections of Pneumocystis are frequent (29), but their impact on transmission or clinical severity is unknown. Analysis of a small number of genomic regions of P. jirovecii has suggested that genetic diversity is low (30). However, genetic marker selection is of concern because most of these studies have used nonneutral loci, which can bias descriptions of population genetic structures. No equivalent analysis has been described in species infecting hosts other than humans, which calls into question the generality of these observations.
To investigate the diversity and demographic history of natural populations of Pneumocystis infecting humans, rats, and mice, we performed whole-genome and large-scale multilocus sequence typing (MLST) of 93 infected tissues collected in North America, Europe, and China. Here, we describe the genomic variation and population structures within each species.

RESULTS
Genomic variation and population structure. We sequenced the whole genome of 52 strains of three distinct species (P. jirovecii, P. carinii, and P. murina) collected from North America and Denmark from 1983 to 2017 (see Table S1A in the supplemental material). P. jirovecii samples (n ϭ 33) were collected from 26 patients. P. carinii samples (n ϭ 8) were collected from rats in two animal facilities located in Bethesda (MD, USA) and Indianapolis (IN, USA). P. murina samples (n ϭ 17) were collected from infected mice from two facilities located in Bethesda (MD, USA) and New Orleans (LA, USA). We combined these data with published whole-genome sequences from six P. jirovecii isolates collected in Lausanne, Switzerland (3,27), one P. carinii isolate from Lausanne (3), and four P. carinii isolates from Cincinnati (OH, USA) and Bethesda (MD, USA) (5, 31) (Table S1B). Median coverage depths ranged from 0.1-fold to 1,387-fold, with 20% of the isolates having median coverage depths of greater than 5-fold. To identify genetic variants, we mapped reads to reference genomes (5). A total of 99% of the genomes could be mapped, and we applied stringent filtering to remove low (Ͻ5-fold)-coverage variants. We identified 57,764 SNPs in P. jirovecii, 56,771 in P. carinii, and 33,737 in P. murina (Table 1). Variants identified in genomic regions enriched with repeated subtelomeric gene families such as major surface glycoproteins (MSGs;~6% of genomes) were excluded for analyses of infrapopulations.
Preliminary investigation of 17 P. murina isolates indicated the presence of clonal populations (see Fig. S1 in the supplemental material). As most of these isolates were from a limited number of animal facilities, they probably derived from isolated repro-ducing populations. A maximum likelihood phylogenetic tree based on pairwise SNP differences generated for P. jirovecii strains provides no evidence of geographic clustering (Fig. 1).
The average level of pairwise diversity within 42 P. jirovecii strains was 1.3 ϫ 10 Ϫ2 substitutions per site, which is higher than previously reported for a comparison of two isolates (5), 5.4 ϫ 10 Ϫ3 substitutions/site for 14 P. carinii strains and 6.5 ϫ 10 Ϫ4 for 17 P. murina strains. These values are similar to the diversity levels observed within the yeasts Saccharomyces cerevisiae (5.7 ϫ 10 Ϫ3 ) (32) and Schizosaccharomyces pombe (3.0 ϫ 10 Ϫ3 ) (33). The reduced levels of polymorphisms in P. carinii and P. murina are characterized by high frequencies of rare alleles (negative Tajima's D), increased high frequencies of derived alleles (negative Fay and Wu's H), and stronger linkage disequilibrium (LD) in P. carinii (Fig. 2).
To describe the relatedness among strains in each species, we focused on SNPs located in the nuclear genomes. Pneumocystis species are predicted to be haploid and self-fertile (homothallic) (7,16) and present substantial chromosomal shuffling among species (5). These traits are considered strong suppressors of recombination (34). Estimates of population structure can be biased by the presence of large inversions as these genomic regions are inherited without recombination (35). To explore the population structure, we selected sets of SNPs in P. jirovecii (n ϭ 3,515), P. carinii (n ϭ 6,696), and P. murina (n ϭ 29,692) that were close to linkage equilibrium (pairwise r 2 , Ͻ0.5) and were randomly distributed across the genomes (Monte Carlo permutation test P ϭ 0.0019; Fig. S2). These SNPs are more appropriate for population genetic models than SNPs in linkage disequilibrium, since their use assumes widespread recombination and no linkage between markers. In P. jirovecii, a principal-component analysis (PCA) of these SNPs indicated no evidence of clustering of strains by geography ( Fig. S3). The mean Fst value (representing the proportion of between-population genetic variance for two populations) for comparisons between strains from the United States and Europe is 0.039, which indicates some population divergence. A similarly weak population structure was obtained with P. carinii strains, although nearly all samples were from diverse geographic regions of the United States (Fig. S3). Although no geographic clustering was possible for P. murina strains because of their common sampling origin, we found no evidence of clustering by collection dates (Fig. S3). Further phylogenetic and population genetic analysis (Hudson index) using P. jirovecii internal transcribed spacer 1 (ITS1) sequences extracted from our samples combined with published sequences from outbreaks in different places around the world revealed no evidence of clustering by geography (see Text S1 in the supplemental material; see also Fig. S4A and B at https://doi.org/10.5281/zenodo.1215631). This analysis was not conducted with P. carinii and P. murina because of the relatively low number of samples with distinct geographic origins. Pneumocystis species show moderate clustering of strains as determined by local geography and the distribution of susceptible host populations. Investigation of P. jirovecii population structures has provided evidence of both weak global population structures and local geographic clusters (20,22,36). To assess the relationship in our samples, we applied unsupervised genetic clustering methods NGSAdmix (37) and fastStructure (38), which do not account for the geographic origins of the strains and rely on different algorithms (maximum likelihood and Bayesian) to detect population genetic partitioning patterns. No evidence of statistically significant clusters was detected in any of three species (see Fig. S5 at https://doi.org/10.5281/zenodo.1215631), which suggests a lack of population partitioning by geography and is consistent with a previous report (22).
Recombination landscape. Meiotic recombination creates diversity influencing both natural population dynamics and selection. The genomic extent of linkage disequilibrium (LD), which represents the nonrandom association of alleles at physically distinct genomic loci, is expected to be reduced by recombination. LD is negatively correlated with recombination rates and thus provides information about the recom-FIG 1 Phylogenomic analysis reveals a lack of global population structuring in Pneumocystis jirovecii. Each symbol represents a single P. jirovecii isolate. The color-coding scheme indicates the region (city or state and country) in which the isolate was obtained. There is no clustering by location but instead an intermixing of isolates from diverse geographic locations, consistent with a lack of geographic structure. The phylogenetic tree was estimated from genome-wide SNP differences of Pneumocystis jirovecii isolates sequenced in this project (832,669 segregating sites).
Population Genomics Analysis of Pneumocystis species ® bination frequency and population structure. We analyzed pairwise LD for segregating sites in P. jirovecii (n ϭ 16,159), P. carinii (n ϭ 22,904), and P. murina (n ϭ 3,284). These data sets included exclusively biallelic SNPs, and variants located in multicopy major surface glycoproteins were excluded. The disequilibrium decayed to half of its maximum value at~16 kb and fell below 0.2 after 100 kb in two species (P. jirovecii and P. carinii) (Fig. 3), which suggests low levels of recombination and rare outcrossing events. This remained true for P. jirovecii even when the analysis was restricted to the United States isolates (data not shown). We could not obtain reliable estimates of LD decay for P. murina because the r 2 values could not be fitted to Hill's decay function (39). In comparison, the LD fell to half of its maximum value at~3 kb for S. cerevisiae ) data show that these statistics are skewed toward low-frequency variants in P. carinii and, to a lesser extent, in P. murina but not in P. jirovecii (negative Tajima's D values), which suggests that these species experienced different demographic events. (D) Box plots of squared correlations (r 2 ) between pairs of SNPs indicate strong signals of linkage disequilibrium (LD) in P. jirovecii and P. carinii. Signs above the bars indicate significant differences between species determined by the Mann-Whitney U test (***, P value of Ͻ2.2 ϫ 10 Ϫ16 ). The boxes indicate medians and interquartile ranges, whiskers indicate 95% values, and additional points in each box plot represent outliers. (32) and fell below 0.2 at 67 kb for the selfing anther smut fungal pathogen Microbotryum (40).
For P. carinii, we estimated a mean population scale recombination rate of ϭ 8.4 ϫ 10 Ϫ3 event per base pair per generation, assuming an effective population size (Ne) of 7,500 individuals (Ne values are derived from BEAST analyses as described in Text S1) and a generation length of 4.5 days (41) (~80 mitotic events per year). In P. jirovecii, we obtained a mean value of 3.1 ϫ 10 Ϫ4 , assuming a Ne level of 46,250 individuals before population decline and a generation time of 4.5 days. Of note, the generation length for P. jirovecii is not known. The reported generation times for other Pneumocystis species range from 1.7 to 10.5 days (41).
Within-individual infection complexity. Pneumocystis jirovecii pneumonia cases are often caused by multiclonal infections (29). In viruses, multiclonal infections increase the likelihood of recombination among individuals of one species that occur in the same host individual (infrapopulation) (42). To investigate the genetic complexity of our isolates, we used genome-wide read sequence alignments to estimate the multiplicity of infections (MOI) of each isolate. The MOI here refers to the number of different parasite genotypes coinfecting a single individual. Only samples with a genome-wide sequencing coverage level of Ͼ5-fold were selected, and genomic regions, including those corresponding to multicopy surface glycoprotein families, were excluded. Overall, 63% of all isolates had genetically mixed infections, with a median of 2 populations per sample (see Text S1; see also Because many samples were not amenable to direct genome sequencing due to their low Pneumocystis DNA content, we extracted panels of polymorphic sites from whole-genome-sequencing (WGS) data (2,886 SNPs in 114 loci in P. jirovecii, 1,874 SNPs in 93 loci in P. carinii, and 817 SNPs in 35 loci in P. murina). We used PCR to capture these SNPs in 22, 8, and 13 samples for each species, respectively, and sequenced the FIG 3 Decay of linkage disequilibrium (LD) as a function of distance. LD levels measured by the square of the correlation coefficient between two markers (r 2 ) were calculated for all pairs of biallelic SNPs within a 1-kb genomic window and averaged. LD decay levels were below 0.2 at 123 and 114 kb for Pneumocystis jirovecii and P. carinii, respectively (shown as gray vertical lines). The low rates of LD decay suggest low rates of recombination in Pneumocystis, which would be at least 2-fold lower than Saccharomyces cerevisiae recombination rates (32).
Population Genomics Analysis of Pneumocystis species ® PCR products by Illumina sequencing (Table S1C and Text S1). SNP allele frequencies were used to infer the number of strains present in each sample.
We first utilized the internal transcribed spacer 1 (ITS1) locus in P. jirovecii for MLST studies because this is the only locus for which sequences are available from multiple outbreaks around the world. The ITS1 locus is in a single copy per genome in Pneumocystis species (43), has an evolutionary rate similar to that of other fungi (44), and has been extensively used for epidemiological studies. In analyzing 48 ITS1 sequences from 13 patients, we detected three major haplotypes present in most samples (Fig. S4A). Haplotypes clustered more frequently with sequences from different individuals instead of clustering exclusively with those from the same individual, suggesting recurrent introductions of organisms or introductions of multiple populations at one time. To compare these results to other data sets, we merged these 48 sequences with 141 published ITS1 sequences (Fig. S4B). Analysis of nucleotide differences revealed a low level of genetic diversity ( ϭ 0.0052; Tajima's D, Ϫ2.62) and no recombination at this locus (P ϭ 0.35). Our 48 ITS1 sequences fell into two categories: 81% were nearly identical to published sequences, whereas 19% were novel sequences shared by different patients in this study. We found that some strains clustered independently of geography, which, together with the high frequency of multiple infections, suggests high transmission efficiency within host populations.
Neighborhood connectivity analysis of 56 networks (with each network corresponding to a distinct locus in addition to ITS1) that included 292 sequences confirmed that infections by multiple genetically distinct populations are more frequent than clonal expansions (78.4% versus 21.6%). Similar findings were obtained with P. carinii (26 networks, 140 haplotypes in 12 rats, 68.5% with multiclonal infections) and P. murina (28 networks, 223 haplotypes in 13 mice, 92.3% with multiclonal infections). Recombination was detected in~3% of loci (pairwise homoplasy index test P Ͻ 0.005). The overrepresentation of multiple introductions over clonal expansions strongly suggests that these haplotypes are widespread in the environment.
Estimation of the evolutionary rates. Pneumocystis species are globally distributed, but their evolutionary rate is unknown. To accurately estimate evolutionary rates, we sequenced 10 full-length P. jirovecii mitochondrial genomes and used five that were previously reported (3,45). The samples were collected from 1986 to 2013 from different geographic locations in the United States, Europe, and China (Table S1D). Fitting root-to-tip regression data indicated a diffuse positive relationship between the sampling dates (years) and the expected number of nucleotide substitutions (r 2 ϭ 0.41), demonstrating the presence of a temporal signal suitable for phylogenetic molecular clock analysis (Fig. 4) (a detailed protocol is presented in Text S1). Mantel tests as implemented in Murray's R scripts (46) showed no evidence for confounding factors of temporal and genetic structures (P ϭ 0.23). Bayes factors favor relaxed over strict clocks (log BF ϭ 5.4), which indicate variable genetic rates over the mitogenomes. We used the BEAST framework (47) to account for variations in population size and relaxation in molecular clock data. We found an estimated rate of evolution of 5.9 ϫ 10 Ϫ5 substitutions per site per year.
Changes in past population sizes. We evaluated different population modes and applied the extended Bayesian skyline model implemented in BEAST. To infer the demographic history, different coalescent models of effective population size (Ne) as well as molecular clocks were tested using only samples for which collection dates were available (Text S1). Loci encoding MSG sequences were excluded. Our data sets include 36 P. jirovecii isolates collected from 1983 to 2017 (65 nonrecombinant nuclear gene loci and 15 full-length mitogenomes) and 11 P. carinii isolates collected from 1994 to 2010 (34 nonrecombinant nuclear gene loci). P. murina isolates were not investigated because the sampling was limited to a single animal facility. We evaluated two parametric models (constant size and exponential growth) and one nonparametric model (extended Bayesian skyline plot [EBSP]) using our estimated evolutionary rate of 5.9 ϫ 10 Ϫ5 substitutions/site per year. Initial analysis supported the extended Bayesian skyline model against the null hypothesis of constant size and exponential models (log BF ϭ 3.69). We used the EBSP model and strict clocks to reconstruct historical demographics. In both P. jirovecii and P. carinii, the EBSPs rejected a constant size model because the 95% highest posterior density (HPD) excluded the value 0. Population size change tests for P. jirovecii indicated significant signals of reduction of effective population size, which started about 400,000 years ago (Fig. 5). A decline in population size was observed in P. carinii populations, with a severe bottleneck around 16,000 years ago.
Dating the species divergence. To place these events in the context of speciation, we estimated the timing of species divergence using a time-calibrated phylogeny. To improve the resolution of the tree, we first sequenced the transcriptome of the Pneumocystis species infecting rhesus macaques, which we have designated in the manuscript as P. macacae (formerly referred to as Pneumocystis carinii f. sp. macacae [NCBI taxon identifier 112250]) (see Text S1 in the supplemental material). We built two nuclear gene data sets. Data set 1 contains exclusively Pneumocystis and related Taphrinomycotina fungi (43 orthologs), and data set 2 contains 10 genes conserved in Pneumocystis and their respective hosts despite an estimated divergence time of~1.2 billion years (48). The sole purpose of data set 2 was to infer the divergence times of Pneumocystis species and their hosts in the same analysis so that the species estimates would be comparable. The low number of genes in data set 1 was due to the high fragmentation of the low-coverage P. macacae transcriptome assembly. We used Population Genomics Analysis of Pneumocystis species ® fossil-calibrated nodes as priors instead of evolutionary rates because of differences in the rates of heterogeneity across sites among distantly related species.
The divergence times for the two data sets were similar, although some discrepancies were observed (see Fig. S7A and B at https://doi.org/10.5281/zenodo.1215631). The well-supported maximum clade credibility coalescent tree (posterior probability of 1) indicated a median value of 123 million years ago (Mya) (95% confidence interval [CI], 104.4 to 145.5) for an initial emergence of Pneumocystis (Fig. 6), which is consistent with previous estimates (49). Our dated phylogenetic trees, based on conserved genes in both Pneumocystis and their hosts, indicated that P. jirovecii and P. macacae diverged 65 Mya ago (95% CI, 51.4 to 72.5) and humans and macaques~15 Mya ago (95% CI, 13 to 17), which suggested that P. jirovecii and P. macacae may have diverged before their respective hosts. P. jirovecii populations experienced a decline (0.4 Mya; Fig. 5),   (Fig. 5).

DISCUSSION
The genetic diversity of a species is strongly correlated with life history traits such as recombination rates, mating systems, gene densities, generation times, and mutation rates (50). Our results suggest that recombination levels are low for Pneumocystis, which would be consistent with the presence of recombination reducing factors such as selfing and asexuality (34) and, in theory, should lead to reduced genetic diversity. However, we found that genetic diversity in P. jirovecii and P. carinii is similar to that of free-living yeasts, despite the fact that Pneumocystis species appear to reside almost exclusively in the host lungs, which is where diversity must develop, rather than in the external environment. One possible explanation is that outcrossing events (mating Population Genomics Analysis of Pneumocystis species ® between nonrelatives) occur but might be limited to inbred subpopulations. This would seem to be supported by the fact that we detected~3% of intralocus recombination among MLST data from multiple patients. Another possibility is that there is a selective advantage in maintaining diversity (standing genetic variation [51]), for example, to avoid detection by the host immune system. Alternatively, high rates of coinfection with genetically different parasite populations could theoretically increase genetic diversity via competitive interactions among them (52), which would be consistent with our findings that most Pneumocystis infections are multiclonal. We tend to favor the last possibility, although a combination of multiple scenarios is possible. The transmission and population structures of Pneumocystis are likely influenced by the constant movement of human populations around the globe, which has potentially contributed to mixing of genetically distinct Pneumocystis populations following the advent of widespread global travel. In this context, immunocompetent individuals are powerful agents of dissemination.
Low levels of recombination and strong population structure represent a challenge for most population genetic models. Gold standard summary statistics such as Tajima's D depend on recombination rates (53). We applied the recommended methodology in this situation by combining different neutrality tests (54) and by separately analyzing regions with sufficient levels of recombination.
We did not attempt to detect positive selection (i.e., selective sweeps) because current methods to detect such events are sensitive to strong LD and assume widespread recombination (55). Demographic variations (e.g., bottlenecks and founder effects) also impair the detection of positive selection due to their confounding nature. A reliable long-term culturing method would ideally be available to investigate the genome-wide impact of selection in these populations.
We found diffuse positive relationships between collection dates and genetic distances. Several factors may explain this distortion in the regression. First, root-to-tip regression assumes that all branches evolve at a constant rate (strict clock) and that there is no population structure (56). We found that relaxed molecular clocks are more appropriate to our mitogenome data. A weak geographic structure may exist even if it is not statistically significant (Mantel test P ϭ 0.23). Regression correlations were improved by analyzing P. jirovecii mitogenomes from North America and Europe separately, although the P values were not significant (data not shown). This initial finding prompted us to consider an alternative strategy incorporating variations in evolution rates across branches as well as population size. The use of an extremely short sampling period (57) or oversampling of closely related sequences (imbalanced trees) (58) could also lead to biased estimates. These factors should not significantly impact our results because our samples were collected over 27 years from different continents and contained a significant amount of genetic divergence. Nonetheless, our estimation of evolutionary rates presents some caveats: (i) the calculated rates are based on P. jirovecii mitogenomes, but they might be different in other species; (ii) the P. jirovecii samples were from individuals and were probably collected following an antifungal treatment, which could affect the evolutionary rates. For instance, positively selected mutations in dihydrofolate synthase have been associated with sulfa/sulfone prophylaxis (59).
Our estimates of species divergence strongly suggest that Pneumocystis species diverge before their hosts. This means, for example, that P. jirovecii, which is uniquely able to infect humans, was probably infecting another (possibly extinct) species before the divergence of human and macaque lineages. This suggests that Pneumocystis species do not strictly cospeciate along with their hosts and that host shifts likely occurred at some point in their evolution. The P. jirovecii populations seem to have been constant for~15 Mya, possibly since their separation from P. macacae, and suddenly to have declined at~0.4 Mya, which roughly coincides with the emergence of Homo sapiens species 0.4 to 0.7 Mya ago (60). Given our relatively small size sampling (Ͻ50 individuals), it is possible that our Bayesian demographic reconstructions lack power to capture recent population expansions, as shown by simulations (61). We speculate that this sudden population decline could indicate a host shift. The continuous population declines of Pneumocystis might indicate that these species are heading toward extinction as a consequence of the activity of mechanisms such as Muller's ratchet (62). The strict host species specificity of Pneumocystis is consistent with this scenario because extreme host specialization of parasites often results in extinction (63).
Pneumocystis organisms exhibit diversity in karyotype patterns among species (64,65). Unfortunately, the karyotype variability cannot be addressed in the context of the current study since karyotypes are not available for the samples that were used in this study. For karyotyping, fresh samples need to be processed in a very short period of time in a manner that minimizes degradation of DNA, and we utilized frozen samples that in a number of cases had been stored for many years and thus could not be used for karyotyping.
P. wakefieldiae is another species that infects rats and which is almost always found associated with P. carinii. None of the rat samples that we used contained P. wakefieldiae (verified at the sequence level); therefore, we cannot comment on the evolutionary profile of P. wakefieldiae or on its potential interaction with P. carinii.
The lack of a culture system and the difficulty in obtaining sufficiently pure Pneumocystis DNA are major obstacles in the research field. Bronchoalveolar lavage (BAL) fluid samples, which represent the most accessible source for P. jirovecii DNA, are not suitable for WGS without enrichment. The use of whole-genome amplification methods is not ideal but is necessary to obtain sufficient sequence data. By scanning the sequencing depth in our samples, we found no evidence of preferential amplification or reduction of particular sequences. We minimized these effects by applying stringent bioinformatics filtering to produce a comprehensive catalog of polymorphisms that reasonably capture genetic variation patterns.
Conclusion. Our results demonstrate the feasibility of comparative population genome analyses of uncultivable organisms directly obtained from host tissues even with low levels of infection. Our results have uncovered complex patterns of genetic variation influenced by multiple factors that ultimately shape the adaptation of Pneumocystis populations during their spread across mammalian hosts. These results provide a dynamic view of the evolution of Pneumocystis populations and improve our understanding of its transmission.

MATERIALS AND METHODS
Sampling. Animal and human subject experimentation guidelines of the National Institutes of Health were followed in the conduct of these studies. P. jirovecii samples were obtained from infected autopsy lung tissues or BAL fluid pellets; some samples had been utilized in previous studies (3,5). P. murinainfected lung samples were obtained from CD40 ligand knockout mice (66), and P. carinii-infected lung samples were obtained from immunosuppressed Sprague-Dawley male rats. Rhesus macaque lung samples were obtained from a simian immunodeficiency virus (SIV)-infected animal with PCP.
Nucleic acid extraction and sequencing. Genomic DNA was extracted using a MasterPure yeast DNA purification kit (Epicentre). Total RNA was extracted using an RNeasy minikit (Qiagen).
Samples with low Pneumocystis DNA content (mostly BAL fluid samples) were enriched by the use of a NEBNext microbiome DNA enrichment kit (New England Biolabs, Inc.) that selectively removes CpG-methylated host DNA and preserves the microbially diverse populations. Enriched samples were purified by ethanol purification. Five microliters of each DNA was amplified in a 50-vol reaction using an Illustra GenomiPhi DNA V3 DNA amplification kit (GE Healthcare, United Kingdom). Purified samples were bar-coded, pooled, and sequenced using an Illumina HiSeq platform with a Nextera library preparation kit. P. jirovecii full-length mitogenome DNA was extracted, sequenced as described previously (45), and annotated using MFannot (https://github.com/BFL-lab/Mfannot).
The linkage disequilibrium was calculated as r 2 , which indicates the square of the correlation coefficient between two SNPs, using VCFtools and excluding singletons and multiallelic SNPs. The values corresponding to the mean LD for each nonoverlapping 1-kb window within the genomes were averaged. The estimate decay of LD with distance was determined by fitting the observed r 2 values to the decay function (39) in R. The population-scaled recombination rates ( ϭ 2 Ner [where Ne represents the effective population size and r is the recombination rate]) were calculated using whole-genome scaffolds and excluding small chromosomes enriched with MSGs using the INTERVAL program in LDhat v.2.2 (76). Data sets for use with the INTERVAL program were extracted from VCF files using VCFtools. We excluded multiallelic sites and sites with minor-frequency allele values of Ͻ0.1 or with a proportion of missing data of Ͼ0.5. INTERVAL's Markov chain Monte Carlo scheme was run for 1e6 iterations, with a block penalty of 5, a sampling step performed every 5,000 iterations, and a burn-in phase of 100. Likelihood look-up tables were generated from precomputed tables using LDhat's lkgen program (https://github.com/auton1/LDhat) and adjusted for Pneumocystis total nucleotide diversity estimated from 10-kb nonoverlapping windows. Results were summarized using Stat Interface LDhat with a 10% burn-in value.
Species divergence dating. Nuclear gene sequences in the following categories were downloaded from NCBI (last accessed January 2018): fungi (P. jirovecii, P. carinii, P. murina, Schizosaccharomyces pombe, and Taphrina deformans) and mammals (Homo sapiens, Macaca mulatta, Rattus norvegicus, and Mus musculus). One-to-one orthology was inferred using a reciprocal best BLASTN hit with an E value of 10 Ϫ10 as the threshold (77). Two data sets were constructed. The first data set included only fungi, and the second data set included Pneumocystis species and their mammalian hosts. Data set 1 included 43 single-copy orthologs, and data set 2 included 10 genes. Multiple alignments were generated using MUSCLE (78), manually inspected for inconsistencies in Jalview (79), and concatenated. The best-fitting substitution model was the GTRϩG model as estimated by jModel v.2.1.7 (80) using the corrected Akaike information criterion. Divergence times were estimated using BEAST2 v.2.4.6 (47). For data set 1, we applied an internal calibration point corresponding to the separation of S. pombe and T. deformans (median, 467 MYA) (81) as a constraint following a lognormal prior distribution. For data set 2, we applied calibration points to the human-macaque node (median divergence, 28 MYA) (82) and to the mouse-rat node (16 MYA) (83). BEAST inputs were prepared using BEAUTi with a strict clock model, and a calibrated Yule model was used to estimate the divergence times and the credibility intervals. Trees were summarized using TreeAnnotator (http://beast.bio.ed.ac.uk/treeannotator) and visualized using FigTree v.1.4.2 (http://tree.bio.ed.ac.uk/software/figtree) to obtain the means and 95% higher posterior densities (HPD).
Data availability. All sequence data have been linked to NCBI Umbrella project PRJNA385300. The mitochondrial genomes have been deposited into GenBank under accession codes MH010437 to MH010446. Data sets used for molecular clocks and phylodynamic analyses are available at https://doi .org/10.5281/zenodo.1215631.