A New Lineage of Cryptococcus gattii (VGV) Discovered in the Central Zambezian Miombo Woodlands

Cryptococcus gattii is an environmental pathogen that causes severe systemic infection in immunocompetent individuals more often than in immunocompromised humans. Over the past 2 decades, researchers have shown that C. gattii falls within four genetically distinct major lineages. By combining field work from an understudied ecological region (the Central Miombo Woodlands of Zambia, Africa), genome sequencing and assemblies, phylogenetic and population genetic analyses, and phenotypic characterization (morphology, histopathological, drug-sensitivity, survival experiments), we discovered a hitherto unknown lineage, which we name VGV (variety gattii five). The discovery of a new lineage from an understudied ecological region has far-reaching implications for the study and understanding of fungal pathogens and diseases they cause.

C ryptococcosis is a severe fungal infection responsible for high levels of mortality and morbidity worldwide (1). The etiological agents are two species complexes of the genus Cryptococcus: Cryptococcus neoformans and C. gattii. While the first described cases of clinical cryptococcosis due to these two distinct species complexes were reported in the mid-1890s under the names Saccharomyces hominis (2) and S. subcutaneous tumefaciens (3), respectively, clinical Cryptococcus isolates have been taxonomically treated as a single species (C. neoformans) for more than 100 years (4). Heterogeneity among cryptococcosis-causing yeast isolates became increasingly apparent from the middle of the 20th century onward and led to the recognition of four serotypes (A, B, C, and D) on the basis of the antigenic determinant of capsular polysaccharide (5,6). Subsequent discovery of two distinct sexual cycles produced by the isolates of serotypes A and D versus serotypes B and C (7,8) and phylogenetic analysis using various gene sequences (9)(10)(11) confirmed these complexes to be genetically divergent enough to be considered separate species. Thus, in 2002, the isolates of serotypes B and C were formally classified as C. gattii (12) while C. neoformans includes all serotype A and D strains (13).
Over the past 2 decades, population structure analysis of the two species using molecular typing methods such as PCR fingerprinting (14), amplified fragment length polymorphism (AFLP) analysis (15), and multilocus sequencing (16) has demonstrated that both species contain genetically diverse lineages that qualify them to be considered two species complexes, and those species complexes have been further subdivided into numerous molecular types (14,17). To date, four major lineages, denoted VGI/AFLP4, VGII/AFLP6, VGIII/AFLP5, and VGIV/AFLP7, are recognized for C. gattii. Recently, a fifth genotype was described on the basis of a single strain but with several different designations, including clade B (based on multilocus sequence typing [MLST]), VGIIIc/VGIV, and C. decagattii (17,18). Elevation of these five lineages to separate species has been proposed previously (17). However, such taxonomic treatment is currently controversial, mainly due to the lack of clear biological differences between the lineages and of a clear consensus with respect to the limits and numbers of the putative species boundaries. As such, the various C. gattii lineages are collectively considered the "C. gattii species complex" (18).
In this paper, we describe the discovery of a new lineage/molecular type within the C. gattii species complex, which we designate VGV (variety gattii five). The six VGV isolates were identified among 32 C. gattii isolates recovered from soil, animal dung, and tree bark samples collected in Zambia by Vanhove et al. in 2013 (19). In this paper, we characterize genomic and phenotypic features of the VGV molecular type. Additionally, we present a new, improved genome assembly and gene sets for C. decagattii (17) which we confirmed for the first time to represent a separate lineage and which we therefore name "VGVI" for consistency with the other lineages. distribution of private alleles (i.e., alleles not shared between lineages) ( Fig. 3a and b), maximum likelihood phylogenetic reconstruction (Fig. 2a), Wright's fixation index (F ST ) (see Fig. S4 in the supplemental material) or NeighborNet Network data (Fig. 2b). Additional population genetics analyses confirmed low levels of genetic exchange between the six well-resolved C. gattii lineages. For example, principal-component analysis (PCA) resolved distinct grouping for the lineages, with the first component (PC1) separating VGII from all other lineages, forming distinct clusters for VGIII and VGVI on PC2 (Fig. 2c). The projection of PC3 and PC4 further allowed identification of distinct tight clusters for the VGI, VGIV, and VGV lineages (Fig. 2d).
The new VGV lineage is represented by six isolates falling within two distinct subclades (A and B). Clade A comprises three VGV isolates (MF5, MF13, and MF54) that were recovered from soil and animal dung sampled in hyrax middens, from which we also identified VGI and VGII isolates (Fig. 1a). Clade B comprises a further three VGV isolates: two (MF34 and MF51) that were recovered approximately 345 km away from clade A and a third (MF56) that was recovered approximately 430 km away from the other clade B isolates. Clade B isolates were recovered from both a tree hole and hyrax   (Table S1). Analysis of the relative proportions of shared private alleles for the C. gattii lineages ( Fig. 3a and b) indicates that VGII shared the fewest alleles (Ͻ4.6 kb total; Ͻ0.2 SNPs per kb) ( Fig. 2 and Fig. 3a) with any of the other lineages, reflecting its greater divergence. The newly discovered VGV lineage shared fewer alleles with VGVI (0.24 per kb) and VGIII (0.29 per kb) than with VGI (1.18 per kb) and VGIV (2.53 per kb). The lineages that shared the most private alleles were VGVI and its closest relative, VGIII (92 kb total; 5.37 per kb), which account for an average of 12% of all SNPs (based on alignments to VGII) found in isolates from those lineages. (b) Alleles not shared between C. gattii lineages (i.e., private alleles) (SNPs per kilobase). (c) Nucleotide diversity () within each lineage against the number of isolates representing each lineage. (d) Admixture K optimization based on cross-validation error. (e) Unsupervised ADMIXTURE clustering analysis of all isolates at K ϭ 9. (f) Lineage-specific gene counts and lineage-specific gene loss counts. The tree topology is based on the core ortholog RAxML tree, setting equal branch lengths, and the numbers of multilineage-specific gene gains and losses are shown above internal nodes. Nearly 1 in 10 nucleotides in the C. gattii genome has an alternative allele across the six lineages (1.55 ϫ 10 6 sites; 9.01% of the C. gattii genome). Indeed, Ͼ180 kb of these unique/private alleles were identified for each lineage, including for VGV, which had 220-kb private alleles (12.75 per kb) (Fig. 3b). VGI is the most distinct in terms of both the highest count of private alleles (378 kb/21.93 SNPs per kb) and its nucleotide diversity () (Fig. 3c), which is reflected in the three distinct subclades of VGI isolates in the whole-genome phylogeny ( Fig. 2a and b). Conversely, the three VGVI isolates are thought to be derived from a single isolate recovered from a patient in Mexico and subsequently distributed to different laboratories where they have been renamed and sequenced (14,17,20,21). Its few clonal differences are illustrated by its low nucleotide diversity () (Fig. 3c).
Unsupervised model-based clustering identified highly structured ancestry components enriched in each of the lineages. The clustering solution with the lowest crossvalidation error value (K ϭ 9) grouped the VGV isolates into a single genetically homogenous group (Fig. 3d and e; see also Fig. S1) while identifying four unique components within the VGII lineage. Of these, subclades VGIIx and VGIIb share small proportions of ancestry with other defined VGII subclades. For example, VGIIb is inferred to share ancestry with other VGII subclades (isolates Ram5 and B8554) and other lineages (B7394 has alleles from VGIV, and B7735 has alleles from VGV). Conversely, none of the isolates in VGIIa and VGIIc have a demonstrable admixture with other subclades or lineages, both being formed by single unique ancestry components. VGIII isolate B8212 (a clinical isolate collected in Oregon, USA, in 2007 [22]) is also modelled as sharing ancestry with VGVI.
Finer-scale clustering was performed by considering patterns of genome-wide haplotype sharing in fineSTRUCTURE (23). Here, VGV isolates forming a separate cluster with greater haplotype similarity to isolates from VGI, VGIII, and VGIV than VGII (Fig. S2). While the haplotype sharing patterns overwhelmingly accorded with each lineage being genetically distinct, a notable exception was VGIII isolate B8212, which shares haplotypes with VGIV and VGVI (also in accordance with model-based clustering), perhaps owing to a small amount of genetic exchange with one or both of those lineages. As also observed using ADMIXTURE-based clustering (Fig. 3e), two isolates from VGII, B7394 and B7735, were also genetically distinct and were assigned to their own cluster which was most closely related to isolates from subclade VGIIb.
All six VGV isolates were haploid, with no evidence for aneuploidy based on allele frequencies and depth of coverage (Fig. S3). However, we did observe copy number variation (CNV) between the three VGVI isolates derived from a single clinical isolate from Mexico (14,17,20,21). Specifically, isolate CBS11687 acquired an ϳ200-kb duplication of supercontig 5 (sc5) (position 1040000 through the end of the supercontig). Separately, isolate WM1804 had a smaller, 40-kb duplication on sc21 (positions 150,000 to 190,000). Isolate WM1802 had neither CNV. In terms of base changes, the three VGVI isolates (WM1802, WM1804, and CBS11687) differed by only 419 SNPs, with the lowest number found between WM1804 and CBS11687 (n ϭ 126) and the highest number found between WM1802 and CBS11687 (n ϭ 315). These genetic differences may have occurred as a result of microevolution during or following passaging or cryopreserving, although large CNVs are common in C. gattii (24,25). All of the newly isolated VGI (n ϭ 7) and VGVA (n ϭ 3) samples from Zambia had a small, Ͻ10-kb duplication within supercontig 6 of the R265 genome (kb position 400 to position 410). This genomic region encodes a single 87-amino-acid (aa) protein that is conserved in C. neoformans and C. gattii but has no functional annotation (by PFAM, GO terms, KEGG-EC, TMHMM, or SigP4).
The results from our phylogenetic and population genetic analyses are in line with previous work (26), indicating that lineages within the C. gattii species complex have remained largely genetically isolated since their divergence. Pairwise lineage calculations of , representing Weir's formulation of Wright's fixation index (F ST ), suggest very low levels of genetic exchange between the lineages (Fig. S4), which is also reflected in analyses of genetic structure ( Fig. 2 and 3; see also Fig. S1 and S2). Both depth-of-coverage plots and F ST nonoverlapping sliding 10-kb window plots across the mating type locus (MAT) at the start of supercontig 18 demonstrate that all VGIV and VGV isolates included in this study were MAT␣ (the reference genome of R265 is MAT␣; high depth of coverage and values of Ͼ0.98 across the MAT loci). In contrast, for VGI, VGII, VGIII, and VGVI, MATa isolates were included in our panel.
Genome assembly and analysis of VGV reveals C. gattii lineage-specific differences. We assembled and annotated a nearly complete genome for the newly discovered C. gattii VGV lineage (isolate MF34) using both Oxford Nanopore and Illumina sequencing reads. The resultant assembly consisted of 15 contigs corresponding to the 14 chromosomes; the single break in one chromosome corresponds to the ribosomal DNA (rDNA) region. Other than underrepresenting rDNA genes, this assembly provides a complete representation of the genome, with telomeric repeats (TTAGGG) present at 28 contig ends. Gene annotation revealed 6,322 predicted protein-coding genes, which is similar to the data from the seven other representative C. gattii isolates with publicly available complete genomes (26,27) representing the four previously known major lineages (ranging from 6,092 to 6,763), as well as C. neoformans H99 (28) (n ϭ 6,962) ( Fig. S5 and S6).
To establish the evolution of protein-coding genes in C. gattii, we compared the gene content data for two representative annotated genomes per lineage where possible (no second annotated reference genomes were available for VGIV, VGV, and VGVI), identifying 4,565 single-copy core orthologs that are shared among the five lineages of C. gattii and C. neoformans (ϳ74% of Cryptococcus genes) ( Table 2). For VGVI, we sequenced and assembled the WM1802 isolate, obtaining a similar genome length (17.42 Mb) and a similar count of protein-coding genes (n ϭ 6,092). For VGII, we included the updated VGII R265 PacBio assembly in our panel of genomes (29) ( Table 2). Detection orthology analysis performed between just the two R265 assemblies identified 91% of genes in 1:1 orthology (n ϭ 5,642), ϳ4% of genes unique to the new assembly (n ϭ 252), and ϳ6% of genes in paralogous clusters (n ϭ 364). The previous VGII R265 assembly had 635 genes that were not called in the new assembly, likely representing a difference in the annotation protocol. Analysis of core eukaryotic genes (CEGs) and BUSCO revealed a high level of completeness of gene sets and increased completeness in the new annotation (Fig. S6). Furthermore, all assemblies generated using long-read sequencing technology assembled into 14 scaffolds/supercontigs, suggesting that all Cryptococcus lineages/species have conserved numbers of chromosomes.
Ortholog amino acid differences within and between lineages were consistent with results from our phylogenetic and population genetic analyses. VGV MF34 had the highest amino acid sequence similarity to VGIV IND107 (53,000 amino acid differences ϭ 97.55% similarity), which was observed in both alignment-based and orthologbased phylogenies ( Fig. 2 and Fig. 4). The most similar interlineage orthologs were those of VGIII and VGVI (49,500 predicted amino acid differences ϭ 97.71% similarity) (Table S1), while the most distinct pairwise comparisons were those between C. gattii a Asterisks (*) indicate genome assemblies that are newly described in this paper. All others have been described previously (26). and C. neoformans (between 205,000 and 218,000 amino acid changes; ϳ90% protein similarity). Overall, synteny is conserved within C. gattii (26), though with notable differences between some lineages. For example, VGV has a single 171-kb inversion on supercontig 7 (positions 544,906 to 716,249) compared with the middle of VGIV IND107 supercontig 7 and the middle of VGIII CA1280 supercontig 5 (Fig. 4). VGVI also has some syntenic differences from its closest relative, VGIII ( Fig. 2) (see also Fig. 3 and Fig. 4). For example, approximately half of VGVI supercontig 5 is syntenic for the start of VGIII (CA1873) supercontig 16, while the second half of VGVI supercontig 5 is syntenic for a middle region of VGIII supercontig 1, indicative of a chromosomal translocation. Further improvements and additional genome assemblies should establish the full number and genetic impact of lineage-specific genomic rearrangements.
Lineage-specific genes and multilineage-specific genes (found in two or more lineages) were identified in each of the lineages ( Fig. 3f; see also Fig. S7). Many of these lineage-specific genes (223/605; 37%) were previously identified from a panel of genome assemblies without the addition of VGV and VGVI (26). A further 53/605 (9%) of the newly detected lineage-specific genes were previously categorized as multilineage-specific genes. The lineage-specific genes in newly sequenced lineages (VGV and VGVI) include 74 genes that were unique to VGV and 49 genes that were uniquely absent in VGV. Genes unique to VGV include those encoding two sugar transporters (D1P53_002216 and D1P53_002944), an alcohol dehydrogenase (D1P53_004471), and an aldehyde dehydrogenase (D1P53_006242). Conversely, eight transmembrane proteins and a single uncharacterized secreted protein were uniquely missing in VGV. All of the genes involved in the ergosterol biosynthesis pathway were present in single-copy To the right is a synteny plot, visualizing regions that span three or more orthologs between any two species as a connected gray line. Supercontig numbers are shown above each genome axis if longer than 400 kb, where a plus sign (ϩ) represents the forward orientation and a minus sign (-) represents the negative orientation.

Farrer et al.
® form in VGV. The VGVI WM1802 genome includes 80 genes that are unique and lacks 127 genes that are uniquely absent. Among the unique genes in WM1802, 14 are predicted to be involved in transport and include genes encoding three monosaccharide transporters, one hexose transporter, one cadmium ion transporter, and one monocarboxylic acid transporter.
Predictably, C. neoformans VNI H99 had the greatest number of lineage-specific/ absent genes, with 578 unique genes and 28 absent genes. These included 47 genes predicted to encode transmembrane proteins (including four sugar transporters, five major facilitator superfamily [MFS] transporters, and a caffeine resistance transporter), and 34 secreted proteins. Fewer genes were uniquely absent in C. neoformans, which included genes encoding eukaryotic translation initiation factor 3 subunit B, an ACC oxidase, a copper amine, an allantoate permease of the major facilitator, and a 3-hydroxyacyl-dehydrogenase with oxidoreductase activity. Full details of all lineagespecific genes are provided in Table S1.
Phenotypic characteristics of VGV. All six isolates that belonged to the new VGV lineage based on whole-genome sequencing (Table 1) were first incorrectly identified as VGIV based on the URA5 restriction fragment length polymorphism (RFLP) banding pattern. Unlike most of VGIV, all six VGV isolates were serotype B. These were further characterized as MAT␣ and as melanin and urease positive (Fig. S8) and grew well at 37°C (Fig. 5A). However, the VGV strains grew slightly more slowly on canavanine glycine bromothymol blue (CGB) agar (Fig. S8) (Table 1). Hence, the positive blue/green color development on CGB agar took longer in VGV than in the control strains (Fig. S8). Since VGV is genetically closest to VGIV ( Fig. 2a and b; see also Fig. S2), two serotype C VGIV strains were used as control isolates for further phenotypic comparisons (WM779 isolated from a cheetah in South Africa [16] and MF46 isolated from Miombo tree bark in Zambia near Hyrax middens) ( Table 1). The size and morphology of VGV yeast cells were typical for Cryptococcus and indistinguishable from those of cells of the control strains (Fig. 5B). Two distinct patterns of capsule formation were found among the six VGV isolates grown in yeast extract-peptone-dextrose (YEPD) broth (Fig. 5B). The isolates recovered from clade A (MF5, MF13, and MF54) produced thinner (Յ1-m-thick) capsules than those recovered from clade B (MF34, MF51, and MF56), which produced thick (2-to-4-m-thick) capsules similar to those seen with the VGIV control strains.
The VGV isolates and the control strains of VGIV manifested unusually high resistance toward fluconazole (FLC), particularly given that they were sampled from an environmental niche. The three isolates of VGV clade A were more resistant to FLC, with drug MICs of Ն128 g/ml, than the three isolates in clade B, which showed drug MICs of 24 to 64 g/ml. All six VGV isolates showed a drug MIC of 0.0625 g/ml for 5-fluorocytocine (5-FC), a value similar to that shown by WM779. The drug MIC of MF46 for 5-FC, 4 g/ml, was unusually high. The VGV amphotericin B MIC ranged between 0.5 and 1 g/ml, higher than the levels measured for the control strains, which showed drug MICs below 0.5 g/ml (Fig. 5C).
To explore the relative pathogenicities among VGV subclades we selected two isolates from clade A (MF5 and MF13) and two isolates from clade B (MF34 and MF51) for inoculation in mouse models. Mice infected by all four isolates survived for 70 days, while all the mice infected by WM779, a virulent serotype C control isolate, succumbed to infection within 30 days postinfection (Fig. 5D). VGIV environmental isolate MF46 (serotype C) caused no death in the mouse model. Fungal loads in the lungs of VGV-infected mice were substantially lower than those seen with the mice infected with WM779 and slightly lower than those seen with the mice infected with MF46. Brain fungal loads of mice infected with the VGV strains were also low to negligible. The control isolates of VGIV showed little neurotropism (Fig. 5E). Histopathological analysis of the lungs demonstrated significant pathology in WM779-infected mice, with yeast found throughout the lung together with notable disruption of lung tissue. In many locations, extensive leukocyte recruitment was evident in areas of concentrated infection (Fig. 6).
Histopathological analysis of the VGV isolates displayed substantially lower pulmonary yeast levels. That said, mice infected by MF46, MF34, and MF51 had higher levels of yeast than those infected by MF5 and MF13 as shown by both CFU counts (Fig. 5E) and the results of histopathological analysis (Fig. 6), in which MF51 was shown to represent clade B. The lung histopathology of the mice infected by MF34 was similar to that of the mice infected by MF51 (data not shown). Notable for its absence, leukocyte infiltration was mostly low or absent from sites of VGV infection. MF13 showed some leukocyte infiltration to a subset of infectious foci (Fig. 6).
Identification of VGV by URA5 RFLP. The patterns of URA5 RFLP, obtained by double digestion with Sau96I and HhaI, have been widely used to identify the lineage/ molecular type in both C. neoformans and C. gattii species complexes (14). The URA5 RFLP of clade A isolates obtained by Sau96I/HhaI digest showed a pattern identical with that of VGIV (Fig. S9A). Those of clade B, however, produced an additional 1.3-kb amplicon which was absent from clade A and from all of the other VG molecular type reference isolates. This 1.3-kb amplicon was present even in uncut DNA of clade B isolates ( Fig. S9A and B), but its nature is not known at this juncture. Since the URA5 RFLP patterns of VGIV and VGV were not conclusively different, we compared the URA5 gene sequence of VGV isolate MF34 to that of VGIV reference strain WM779 to identify possible restriction enzymes that can clearly distinguish the two lineages. This led to our identifying two highly discriminatory restriction enzymes: StuI and EarI. The expected sizes (measured in base pairs) of the URA5 gene fragments resulting from StuI digestion are as follows: 221 bp, 237 bp, and 322 bp in VGIV and 237 bp and 543 bp in VGV. The EarI digestions produced 247-bp and 501-bp fragments in VGIV and 247-bp and 300-bp fragments in VGV. We compared URA5 RFLP data from 17 VGIV isolates (Table S1) with the data from 6 VGV isolates obtained by StuI or EarI digestion, and the results are shown in Fig. S9C and D.
Type strain of VGV. We have designated MF34 (clade B isolate) as the type strain of VGV. MF34 was isolated from a tree hole located in Mutinondo (latitude Ϫ12.45, longitude 31.29), Central Zambezian Miombo Woodlands (Table 1). Its genome has been assembled and annotated to near-completion (15 scaffolds with N 50 ϭ 1.3 Mb and telomeric repeats at 28 contig ends). MF34 is serotype B and MAT␣ and causes mild pneumonia in C57BL/6 mice with negligible neurotropism.

DISCUSSION
Over the past decade, increased sampling worldwide and whole-genome sequencing (WGS) methods have uncovered greater genetic diversity of important pathogens, including the C. neoformans and C. gattii species complexes. For example, sampling from Botswana revealed the existence of the C. neoformans VNB lineage (30), which itself has recently been shown to be deeply split into two genetically isolated lineages, VNBI and VNBII (31). Thus far, VGVI is the only lineage that exists as a single genotype since the three isolates previously designated C. decagattii appear to have been isolated from the same patient (21). The previously identified lineages of Cryptococcus have recently been designated separate taxonomic species based on phylogenetic species recognition criteria (17). While we agree that Cryptococcus contains a number of genetically diverse and monophylectic clades that can be viewed as species under an evolutionary species concept (32), we have previously argued that it is premature to give each clade a separate taxonomic name at this juncture (18, 33). One notable concern raised by Kwon-Chung et al. (18) was that the proposed seven-species taxonomy (33) was likely to be unstable due to incomplete knowledge of the true extent of Cryptococcus diversity worldwide. Our discovery of C. gattii VGV from the Miombo woodlands of Zambia clearly shows that we have not yet achieved a full understanding of the global biodiversity of Cryptococcus and that further exploration will likely yield additional phylogenetic species. Until we have a more accurate consensus on the true numbers of Cryptococcus lineages, we propose that the names "VN" and "VG" serve as practical "Zip code" identifiers within C. neoformans and C. gattii, offering a convenient way to describe newly discovered lineages or recombinants without introducing unwanted nomenclatural instability and confusion.
Our discovery of C. gattii VGV in samples from hyrax-associated environments suggests an association with these mammals. Hyrax are small herbivores that are most closely related to elephants (Proboscidea) and sea cows (Sirenia) and are characterized by the behavior of defecating in communal latrines, usually located in crevices in rocky kopjes, over many generations (34). These locations are often sheltered in rocky caves and droppings are likely to accumulate for upwards of 50,000 years, in some cases forming a stable paleoenvironmental hot spot of urea-rich nitrogenous material (35). Cryptococcus has a pronounced trophism for urea as a nutritive substrate, and pigeon guano is known to support prolific growth of C. neoformans and (to a lesser extent) C. gattii (36). Our finding that hyrax middens are hotspots of Cryptococcus diversity suggests that their ecological stability in landscapes that are low in nitrogen availability may lead to them being important arenas for the evolution of Cryptococcus and that they will likely be fertile ground for further discovery of diversity within this genus.
Fungal association with small mammals may suggest adaptations that confer pathogenicity, known as the "endozoan, small-mammal reservoir" hypothesis (37), and deserves to be explored further following our findings of an association of Cryptococcus with hyrax. Accordingly, alongside further study of potential mammalian reservoirs, the search for VGV clinical isolates is also needed in order to understand the true virulence potential of VGV and whether it can spill over into humans. Murine models have shown that environmental isolates are less virulent than clinical isolates of the same molecular type in both the C. gattii and C. neoformans species complexes, suggesting that polymorphic virulence factors exist (38,39). However, despite its great genetic distance from all other lineages, the new VGV lineage is not clearly distinguishable from others by existing methods such as serotyping or the routinely used approach of RFLP analysis employing Sau96I and HhaI URA5 digestion (14). Thus, it is possible that previously analyzed isolates belonging to VGV may have been misidentified using non-WGS methods. The most likely candidates for the search of clinical VGV are VGIV serotype B isolates recovered from patients. Geographically, the most likely place to find the VGV clinical isolates appear to be in sub-Saharan Africa since the current panel of isolates were found in the Zambian environment within an ecoregion that includes Tanzania, Burundi, Democratic Republic of the Congo, Angola, and Malawi.
Previous work has shown that most isolates of the C. gattii species complex cause pulmonary infection in a murine model with low neurotropism (20,40,41). The four VGV isolates tested here were less neurotropic than the MF46 VGIV isolate that was collected from the same Zambian environment, and all the examined Zambian envi-ronmental isolates were significantly less virulent than a VGIV control strain, WM779. It remains to be shown whether the differences in neurotropism are due to lineagespecific genes or to alleles in VGV. As previous work has shown in C. neoformans (42), capsule size difference manifested by clades A and B in vitro was unrelated to virulence in mice.
Although serotypes have not yet been conclusively linked to virulence in Cryptococcus, they remain important for strain identification. The majority of C. gattii isolates tested to date have been of serotype B, except for a subset of VGIII and the majority of VGIV isolates, which are serotype C. The six VGV isolates are also all serotype B-but, due to the lower growth rate on CGB agar, the CGB reaction was weaker than that seen with other isolates of serotype B or serotype C. It took 24 h longer for VGV than for the other VG isolates (VGI to VGIV) to turn the medium dark blue. As the six VGV isolates were all serotype B whereas the majority of VGIV isolates (the most closely related lineage) reported thus far have been serotype C, it is possible that VGV may also occur in serotype C. Additional environmental sampling of VGV is therefore necessary to establish the dominant serotype, since the current sample size of six is insufficient for definitive conclusions.
Surprisingly, five of the six VGV isolates and the two control VGIV isolates were highly resistant to fluconazole (MIC of Ͼ64 g/ml), a commonly used antifungal drug. The three isolates of VGV clade A were more resistant to FLC than the isolates of VGV clade B. Although the C. gattii species complex was previously known to be more resistant to FLC on average than C. neoformans (43), such high resistance to FLC in environmental isolates is notable and was not previously reported (44). All of the VGV isolates had identical nucleotide sequences for ERG11 and AFR1, demonstrating that the resistant isolates did represent not a result of genetic differences in the target or transporter of FLC. However, innate fungal resistance to FLC can be due to multiple factors besides the ERG11 gene or efflux pumps and the mechanism(s) of FLC resistance in VGV remains a subject for future investigation. Why environmental VGV isolates should have such high resistance to azoles is unclear, as, given the relatively pristine environments from which they were recovered, it is unlikely that they had come into contact with agrichemicals. More likely, fluconazole resistance is a pleiotrophic effect that has evolved as a consequence of exposure to xenobiotics other than azoles. Further investigations into the evolution of FLC resistance in VGV may take on additional importance as clinical cases due to VGV are a distinct possibility in the Sub-Saharan regions where 12% of the members of the Zambian population are living with the HIV (https://www.unaids.org/en/regionscountries/countries/zambia).
In this paper, we present a nearly complete genome assembly for VGV type strain MF34. The MF34 genome allowed us to conclusively establish that VGV is a lineage of C. gattii that is separate and distinct from any previously identified and that it is not the result of hybridization, as has been seen for other divergent isolates (31). Indeed, while both Cryptococcus species complexes appear to have a conserved chromosome number of 14 based on the current panels of assembled and annotated genomes available, intrachromosomal and interchromosomal rearrangements as well as large CNVs appear to be common. This chromosomal variation may provide the genetic basis for phenotypic variation and may act as a genetic barrier to recombination between more highly divergent isolates such as those from separate lineages. At the within-lineage level, there are also a number of unique and uniquely lost "lineage-specific" genes which may contribute to phenotypic differences between lineages. However, it should be noted that many of the main phenotypes routinely measured, including virulence in animal models, growth rates, and ability to cause pulmonary versus central nervous system (CNS) infections, appear to differ as much within as between lineages.
One line of future inquiry that may help to explain this phenotypic diversity may come from the characterization of further transcriptional differences. For example, VGII has been shown previously to upregulate many of the ergosterol genes during coincubation with bone-marrow derived macrophages (45) and it will be important to determine whether other traits exist which differentiate the lineages of Cryptococcus.
Further, it will be important to examine whether similar lineage-specific differences underpin VGV's increased FLC resistance and whether clinically relevant traits such as drug resistance are linked to the environment within which these isolates have evolved. Ultimately, our report testifies to the deep reservoir of diversity that exists within Cryptococcus, which, despite decades of research into this genus, still harbors abundant surprises.

MATERIALS AND METHODS
Library preparation and sequencing of Zambian isolates. Environmental sampling took place in January and September of 2013. Samples were collected using Transwab Amies swabs (MWE; code MW170) and sterilized 30-ml screw-cap glass bottles. Amies liquid transport swabs were taken from tree bark (n ϭ 20), soil (n ϭ 19), and cracks in granite kopjes or droppings from rock Hyrax middens (n ϭ 16). Samples were collected and processed according to previously established protocols (46,47), and the samples were kept at 4°C before being processed on niger seed agar. All samples were collected under license from the Zambian Wildlife Authority (ZAWA).
Single colonies purified from the original isolation media were maintained under cryopreservation conditions at -80°C at Imperial College in London since 2013. The isolates were has been changed to revived on YPD agar (yeast extract 1%, peptone 1%, dextrose 2%) and incubated at 30°C before use. Genomic DNA was isolated with the cetyltrimethylammonium bromide (CTAB) extraction method as described previously (48) with modifications. Paired-end libraries (150 bp) were prepared and sequenced using an Illumina HiSeq 4000 platform by Novogene (Davis, CA). Two Oxford Nanopore libraries of isolate MF34 were constructed from genomic DNA using a one-dimensional (1D) library construction kit (model no. SQK-LSK109). A total of 243,660 reads with an N 50 of 9,827 were generated on a FLO-MIN106 flow cell using a Minion. Reads were base called using Albacore v2.3.1. This resulted in 923,997,900 total bases (ϳ46ϫ coverage).
Genome assembly and annotation. For isolate MF34, a hybrid assembly of Oxford Nanopore long reads and Illumina short reads was generated. An initial assembly of the Oxford reads was generated using Canu v1.5 (49) with parameter genomeSize ϭ 20,000,000. The assembly was inspected for the presence of telomeric repeat (TTAGGG) at contig ends; for two contig ends missing a telomeric repeat, contigs were extended by aligning unassembled Canu contigs to these ends using NUCmer v3.1 (50). Base-called reads were then aligned to the contigs using the Burrows-Wheeler transform algorithm (BWA mem) (51) with flag "-x ont2d" and the alignments used for polishing with nanopolish (52). Two rounds of Pilon v1.13 (53) correction were performed using Illumina BWA read alignments (51). Paired Illumina sequences of C. decagattii (VGVI) were assembled and scaffolded using SPAdes v3.1.1 (54) with k-mer lengths of 21, 33, 55, and 77. A summary of statistics for the assembly is provided in Table 2. Reads were aligned to the assembly with BWA v0.7.4-r385 mem (51), and Pilon v1.12 (53) was further used to improve the assembly. Scaffolds that were less than 1 kb were removed. The genome assembly has been submitted to NCBI (see below for project accession number).
The C. gattii VGV MF34 and VGVI WM1802 genomes were annotated using Genemark (55), BLASTx against Swiss-Prot (56) and KEGG (57), and HMMER hmmscan (58) against PFAM (59). We ran tRNAscan (60) and RNAmmer (61) to identify non-protein-coding genes. Gene predictions were checked for a variety of issues, including overlapping of noncoding genes, overlapping of coding genes, and the presence of in-frame stops. Genes were named according to evidence from BLASTx and HMMER in the following order of precedence: (i) Swiss-Prot (56), (ii) TIGRfam (62), and (iii) KEGG (57) (where BLASTx hits must meet the 70% identity and 70% overlap criteria to be considered a good hit and for the name to be applied). Otherwise, genes were classified as hypothetical proteins.
Read alignment and variant identification. The 36 newly sequenced isolates from this study were compared to an additional 65 isolates that had been sequenced and described in previous studies (20,26,38,66,67). These additional isolates were obtained from the NCBI Sequence read archive (SRA) and converted from SRA format to FASTQ using SRA Toolkit version 2.3.3-4. Illumina reads were aligned to the C. gattii VGII R265 reference genome assembly using Burrows-Wheeler Aligner (BWA) v0.7.4-r385 mem (51) with default parameters and were converted to sorted BAM format using SAMtools v0.1.9 (r783) (68).
Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11 (69) was used to call both variant and reference nucleotides from the 101 alignments (as previously described [24]). Briefly, the Picard tools AddOr-ReplaceReadGroups, MarkDuplicates, CreateSequenceDictionary, and ReorderSam were used to preprocess the alignments (http://broadinstitute.github.io/picard/). GATK RealignerTarget-Creator and IndelRealigner were then used to resolve misaligned reads close to indels. Next, GATK Unified Genotyper (with the haploid Genotyper ploidy setting) was run with both SNP and indel genotype likelihood models (GLM). We also ran Base Recalibrator and PrintReads for base quality score recalibration on those initial sites for GLM SNP. We then recalled variants with Unified Genotyper with the parameter "-output-_mode EMIT_ALL_SITES." We merged and sorted all of the calls and then ran Variant Filtration with the parameters "QD Ͻ 2.0, FS Ͼ 60.0, MQ Ͻ 40.0." Next, we removed any base that had a minimum genotype quality of below 50, a minimum proportion of alternative alleles (AD) of 80%, or a minimum depth of 10. Finally, we removed any positions that were called by both GLMs (i.e., incompatible indels and SNPs), any marked "LowQual" by GATK, any nested indels, and any sites that did not include a PASS flag.
Phylogenetic and population genetic analysis. The variants identified from the 101 alignments were filtered for positions that were homozygous (reference or SNP) and polymorphic in at least one isolate (Fig. 2), resulting in alignment of 1,517,353 nuclear sites and 970 mitochondrial sites. A FASTA file of these positions was created and converted into PHYLIP format, and a phylogenetic tree was generated using RAxML v7.7.8 (70) with 1,000 bootstrap replications. RAxML was run with the generalized time-reversible (GTR) and category (CAT) rate approximation, with final evaluation of the tree performed using GTR plus gamma-distributed rates. The same sites were analyzed using the NeighborNet Network of SplitsTree v4.14.6 (71).
A multisample variant call format (VCF) corresponding to all 101 genomes was made with VCFtools (72) and converted to ped and map file formats for use in PLINK v1.90 (73). Unsupervised ADMIXTURE (74) was run on a moderately linkage disequilibrium (LD)-pruned alignment for values of K between 1 and 15. A value of K ϭ 9 provided the lowest cross-validation error (Fig. 3d and e; see also Fig. S1). To explore finer patterns of population structure among our sampled lineages, we applied a technique designed to characterize patterns of haplotype sharing between panels of "donor" and "recipient" haplotypes within a recombining population. We ran Chromopainter v2 (23) to infer, at each position in a recipient isolate's genome, the donor to which it is most closely ancestrally related relative to all others in the data set. To do this, we assumed a uniform recombination rate of 1.5 M (morgans)/megabase based on the genome-wide recombination rate previously estimated in C. neoformans (75) and with Chromopainter's switch and mutation rate parameters estimated using 10 runs of expectationmaximization (Ϫn 190.29; ϪM 0.0011). We then ran Chromopainter in linked mode using the haploid switch (-j) under an "all-versus-all" framework, painting all samples using all others to produce a pairwise coancestry matrix describing the amount of DNA matches of each isolate to every other under the copying model.
Haplotype-based clustering was then implemented in fineSTRUCTURE (23) with an estimated normalization parameter of c ϭ 0.51, sampling cluster assignments every 10,000 iterations for 1 ϫ 10 6 Markov chain Monte Carlo (MCMC) iterations after 1 ϫ 10 6 initial burn-in steps. We then performed an additional 1 ϫ 10 5 hill-climbing iterations beginning with the MCMC sample with the highest posterior probability. This classified our data into 34 clusters (Fig. S2).
For the C. neoformans VNI H99-rooted C. gattii tree, we identified 1:1 orthologs for each of the nine isolates with Orthofinder v2.1.2 (76) using the Synima pipeline (77). We aligned orthologs with MUSCLE v3.8.31 (78), extracted the coding DNA sequences (CDSs) in a codon context, trimmed to the smallest contiguous sequence, and then concatenated alignments. In total, we aligned 2.16 Mb of transcripts for each genome. Prottest v3.4 (79) was used to determine the best-fitting amino acid transition model (JTT) according to Bayesian information criterion. The final tree was produced using RAxML v7.7.8 (70) and the CAT rate approximation and WAG amino acid replacement matrix with 1,000 bootstrap replicates. Synima (77) was used to visualize synteny between the genomes. The same pipeline was used to compare the previous and updated R265 genomes.
Phenotypic analysis. To determine the growth rate of C. gattii VGV, cells of all six VGV isolates were inoculated in YEPD broth and incubated at 30°C on a shaker (200 rpm) for 18 h. Cells were washed with sterile phosphate-buffered saline (PBS), and 2 ϫ 10 5 cells/ml were resuspended in PBS. Three-microliter aliquots of 10-fold serial dilutions were spotted onto YEPD agar and incubated at 30°C and 37°C. For biological confirmation of the species, isolates were inoculated on canavanine glycine bromothymol blue (CGB) agar (80) for species-specific CGB reactions and on Christensen's urea agar (Sigma) and norepinephrine agar (81) for urease and melanin production reactions, respectively, and were incubated at 30°C for 48 h. India ink staining of the cells grown on YEPD broth for 24 h at 30°C was used for microscopic observation of the cell and polysaccharide capsule size. The reference strains used were WM148 or H99 (serotype A, VNI), WM626 (serotype A, VNII), WM179 (serotype B, VGI), WM178, R265 and R272 (serotype B, VGII), WM161 (serotype B, VGIII), and WM779 (serotype C, VGIV) (14). The mating type of each isolate was determined by PCR using primers specific to STE12␣ and STE20a (82).
Determination of MIC for antifungal antibiotics. MICs for fluconazole (FLC), 5-fluorocytosine (5FC), and amphotericin B were determined using Etest strips according to the Etest technical guide (AB Biodisk, Solna, Sweden), with slight modification. Fungal cells were grown in 5 ml of YEPD at 30°C for 18 h. Harvested cells were diluted in sterile saline solution to an optical density at 600 nm (OD 600 ) of 0.05 and plated on yeast nitrogen base (YNB) agar plates. Etest strips were placed at the center of the plates and incubated at 30°C for 72 h. The susceptibility endpoint was read at the first growth inhibition ellipse. The concentration ranges tested were as follows: for FLC, 0.016 to 256 g/ml; for both 5-FC and amphotericin B, 0.002 to 32 g/ml.
URA5 gene RFLP. The URA5 gene of each isolate was amplified from genomic DNA by PCR to identify the molecular type using 50 ng of two primers: URA5 (5=-ATGTCCTCCCAAGCCCTCGACTCCG-3=) and SJ01 (5=-TTAAG ACCTCTGAACACCGTACTC-3=). Reactions were carried out in a total volume of 50 l as previously described (14). PCR was performed for 40 cycles at 94°C for 2 min of initial denaturation, 30 s of denaturation at 94°C, 30 s of annealing at 55°C, and 2 min of extension at 72°C. The reactions were completed by performing a final extension step for 10 min at 72°C. PCR products were analyzed by 1% agarose gel electrophoresis, and 5 l of PCR products was double digested with Sau96I (10 U/l) and HhaI (20 U/l) for 3 h at 37°C. Then, digested samples were separated by 3% agarose gel electrophoresis at 80 V for 5 h. The RFLP patterns of URA5 gene were analyzed using well-defined bands in the gel images by comparing them with the patterns obtained from the standard reference strains.
Restriction enzyme analysis of the URA5 gene to distinguish VGV from VGIV. We found that the URA5 RFLP banding patterns (14) of VGV and VGIV were not clearly distinguishable using Sau96I and HhaI (Fig. S9A). We compared the DNA sequences of the URA5 gene from MF34 (VGV) and WM779 (VGIV) and found two restriction enzymes, StuI and EarI, that can be used to distinguish the two molecular types based on URA5 RFLP. Three microliters of URA5 PCR products was digested with StuI (10 U/l) or EarI (20 U/l) (New England BioLabs Inc.) at 37°C for 4 h, and restriction fragments were separated by electrophoresis in 3% agarose Tris-acetate-EDTA (TAE) gels at 80 V for 5 h. Standard reference strains for molecular typing were used as controls.
Virulence in mice. The virulence of four VGV isolates, two from clade A and two from clade B, was assessed using 7-to-8-week-old female C57BL/6 mice (Taconic Farms). Isolates to be tested in mice were inoculated in YEPD broth and incubated overnight, washed twice, and diluted in PBS to 2.5 ϫ 10 5 cells/ml. Mice (14 per isolate) were inoculated with 20 l of cell suspension (5 ϫ 10 3 cells/mouse) by pharyngeal aspiration. Eight mice were used for each isolate for determination of the survival rate, and six mice each were used for the analyses of fungal burden and histopathology at the indicated time points. Mice were monitored twice per day, and differences in survival were determined using GraphPad Prism, version 7 (GraphPad Software, San Diego, CA).
To assess the organ fungal burden, lungs and brains of four mice from each infected group were inspected. The mice infected with WM779 started to die on day 25 postinfection, and the lungs and brains were harvested immediately from the dead mice on day 25. Mice infected with other isolates were euthanized on day 60, and organs were harvested. Harvested lungs and brains were homogenized in 7 ml and 2 ml of sterile water, respectively, and 5-l aliquots of 10-fold serial dilutions were plated on YEPD agar and incubated at 30°C for 48 h. Fungal colonies were counted and the tissue fungal burden was analyzed using GraphPad Prism, version 7 (GraphPad Software, San Diego, CA).
Histopathological analysis. For histopathological analysis, organs of infected mice from each group were harvested at 10 and 20 days postinoculation and fixed in 3.7% buffered formalin and embedded in paraffin. Sections were stained with hematoxylin and eosin (H&E) or alcian blue/periodic acid-Schiff (AB/PAS) at Histoserv Inc.
Ethics statement. The Institutional Animal Care and Use Committee of the National Institute of Allergy and Infectious Diseases approved all animal studies (approval no. LCIM-5E). Studies were performed in accordance with recommendations of the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health.
Data availability. The raw sequence and genome assembly of VGV MF34 are available in NCBI under BioProject accession no. PRJNA487802, and the culture has been deposited at the American Type Culture Collection (accession number pending). Raw sequence data were submitted to the NCBI Sequence Read Archive under BioProject identifiers (ID) PRJNA476154 (all C. gattii non-VGV isolates) and PRJNA480403 (all C. gattii VGV isolates).