Selective Whole-Genome Amplification Is a Robust Method That Enables Scalable Whole-Genome Sequencing of Plasmodium vivax from Unprocessed Clinical Samples

ABSTRACT Whole-genome sequencing (WGS) of microbial pathogens from clinical samples is a highly sensitive tool used to gain a deeper understanding of the biology, epidemiology, and drug resistance mechanisms of many infections. However, WGS of organisms which exhibit low densities in their hosts is challenging due to high levels of host genomic DNA (gDNA), which leads to very low coverage of the microbial genome. WGS of Plasmodium vivax, the most widely distributed form of malaria, is especially difficult because of low parasite densities and the lack of an ex vivo culture system. Current techniques used to enrich P. vivax DNA from clinical samples require significant resources or are not consistently effective. Here, we demonstrate that selective whole-genome amplification (SWGA) can enrich P. vivax gDNA from unprocessed human blood samples and dried blood spots for high-quality WGS, allowing genetic characterization of isolates that would otherwise have been prohibitively expensive or impossible to sequence. We achieved an average genome coverage of 24×, with up to 95% of the P. vivax core genome covered by ≥5 reads. The single-nucleotide polymorphism (SNP) characteristics and drug resistance mutations seen were consistent with those of other P. vivax sequences from a similar region in Peru, demonstrating that SWGA produces high-quality sequences for downstream analysis. SWGA is a robust tool that will enable efficient, cost-effective WGS of P. vivax isolates from clinical samples that can be applied to other neglected microbial pathogens.

IMPORTANCE Malaria is a disease caused by Plasmodium parasites that caused 214 million symptomatic cases and 438,000 deaths in 2015. Plasmodium vivax is the most widely distributed species, causing the majority of malaria infections outside sub-Saharan Africa. Whole-genome sequencing (WGS) of Plasmodium parasites from clinical samples has revealed important insights into the epidemiology and mechanisms of drug resistance of malaria. However, WGS of P. vivax is challenging due to low parasite levels in humans and the lack of a routine system to culture the parasites. Selective whole-genome amplification (SWGA) preferentially amplifies the genomes of pathogens from mixtures of target and host gDNA. Here, we demonstrate reference genome. We selected the top 10,000 primers of each length, yielding a total of 70,000 primers for further analysis. We filtered these primers based on characteristics such as melting temperature (18 to 32°C), ability to homodimerize (no greater than 3 consecutive matches), binding frequencies on the human genome and Sal-1 genome (less frequent than once every 500,000 bp and more frequent than once every 50,000 bp, respectively), and infrequent binding of the human mitochondrial genome (less than 4 binding sites). Next, we removed primers predicted to bind the Sal-1 subtelomeres. These filters resulted in a pool of 222 primers. We separated these primers into 6 sets that were predicted not to form heterodimers and identified the top set (pvset1) of 10 primers using a selection algorithm described previously (22). gDNA from an unprocessed P. vivax-infected whole-blood sample (MRL2) was subjected to SWGA with primer set pvset1 prior to shotgun sequencing (see Fig. S1 in the supplemental material). SWGA significantly increased the percentage of reads that mapped to the P. vivax Sal-1 reference genome, from 0.7% to 73.5% (see Fig. S1A), and improved the genome coverage obtained from~80 million bp of sequencing from 1.5% to 58% (see Fig. S1B).
We observed that the genome coverage obtained per base pair sequenced was lower than that achieved with SWGA of P. falciparum gDNA using the same primer set design methods (22). Visual inspection of the P. vivax genome coverage from samples subjected to SWGA revealed that coverage gaps were typically in regions with comparatively higher GC content ( Fig. 2A). The P. falciparum genome is extremely AT rich, with only 19.4% of bases consisting of G's or C's, while the GC content for the P. vivax genome is 42.3% (9). The P. vivax genome also has an isochore structure: internal chromosomal areas have a high GC content and subtelomeres and centromeres have lower GC contents (25). Since phi29 DNA polymerase pauses more frequently during strand displacement and primer extension in regions with a high GC content of DNA (26), we hypothesized that differences in base composition could explain the more uneven amplification of the P. vivax genome compared to that of P. falciparum.
We thus designed primer sets specifically targeting regions of the P. vivax Sal-1 reference genome with high GC content and poor coverage using the swga program (https://github.com/eclarke/swga; unpublished data), a program that identifies and scores SWGA primer sets (see Fig. S2 in the supplemental material). Primers were designed to bind regions of the P. vivax Sal-1 genome that had even AT/GC composition, were longer than 195,000 bp, and had low sequence coverage when amplified with pvset1. We identified 1,939 primer sets (consisting of up to 15 primers) with minimal human genome binding and maximal P. vivax genome binding and scored them based on evenness of binding, as well as mean distance between primer binding sites in the foreground and background genomes. The primer set with the best score, pvset1920, was chosen for subsequent testing. SWGA of an unprocessed human blood sample with pvset1920 yielded overall superior P. vivax genome coverage compared to that of SWGA with pvset1 ( Fig. 3). Visual inspection of these post-SWGA P. vivax sequences revealed that pvset1920 achieved improved coverage particularly in regions with high GC content (Fig. 2B), with troughs in coverage in genomic regions of lower GC content, which include the centromeres and subtelomeres.
Having developed a method that worked well for SWGA of P. vivax gDNA from whole blood, we tested whether the method could also be applied to gDNA extracted from dried blood spot samples. Dried blood spots are a common method of storing patient and parasite DNA that utilizes a smaller volume of blood and does not require immediate cold storage. DNA extracted from dried blood spots can have variable quality depending on the method of collection and storage (27). SWGA has been used to enrich P. falciparum DNA from dried blood spots for WGS (23), with an average of 48.1% Ϯ 3.5% of the genome covered at Ն5ϫ for samples with an average parasite density of 73,601 parasites/l Ϯ 19,399 (1.5% parasitemia). Since P. vivax clinical samples generally have lower parasite densities, we wondered if it would be feasible to obtain significant genome coverage on P. vivax from dried blood spots with SWGA. We extracted DNA from blood spots obtained from symptomatic patients in Peru and performed SWGA with pvset1920, achieving 73% and 42% genome coverage at 1ϫ on initial testing for samples with a high (sample C, 56,790 parasites/l; 1.1%) and low (sample L, 2,572 parasites/l; 0.05%) level of parasitemia, respectively (see Fig. S3 in the supplemental material).

FIG 2
Plasmodium vivax chromosomal coverage following SWGA using primer set pvset1 (A) or pvset1920 (B). The base compositions of chromosomes 2 and 6 were visualized in Geneious (version 9.1) using the P. vivax Sal-1 reference genome; green and blue lines represent percentages of AT and GC content, respectively, plotted for 25-bp windows across the chromosome (scale shown above the graph). Shown in blue and red below are the corresponding MiSeq read coverage depths using primer sets pvset1 and pvset1920, respectively. Coverage plots were generated using IGVTools (version 2.3.40) and are shown on a log scale with maximum read depth indicated in the upper left corner of the plot.
We finally tested whether an enzymatic digest to remove contaminating human DNA could further improve P. vivax genome coverage. Modification-dependent restriction endonucleases (MDREs), such as MspJI and FspEI, which specifically recognize cytosine C-5 methylation or hydroxymethylation (28), have been used to selectively degrade human DNA in P. falciparum clinical samples. Enzyme digest of DNA extracted from clinical samples with Ͼ80% human contamination has previously been shown to enrich P. falciparum DNA~ninefold for more efficient WGS (29). However, when we performed a digest with MspJI and FspEI enzymes on gDNA extracted from whole blood obtained from patients with P. vivax infection, we observed either no change or markedly decreased genome coverage in the 5 enzyme-digested samples (see Fig. S4 in the supplemental material).
SWGA and WGS of P. vivax from patient samples. To test the utility of SWGA for variant calling and population genetics analysis of P. vivax from unprocessed clinical blood samples, we used primer set pvset1920 to perform SWGA on P. vivax gDNA from 18 whole-blood and 4 dried blood spot samples collected from symptomatic patients with P. vivax infection in Peru. Since the whole-blood samples had not been leukocyte filtered, they had significant contamination with human DNA, with less than 1.5% of reads mapping to the Sal-1 reference genome in unamplified samples (not shown). For all samples, SWGA significantly increased the proportion of reads that mapped to the P. vivax Sal-1 reference genome, resulting in higher genome coverage and higher percentages of callable total-and core-genome regions (covered by Ն5 reads) (Table 1). Comparison of the SWGA-amplified samples to 10 leukocyte-filtered samples from a field study in Peru which were sequenced to a similar depth (1.5 Ϯ 0.2 billion bp sequenced for SWGA samples versus 1.5 Ϯ 0.5 billion bp for leukocyte-filtered samples) showed that SWGA yields a twofold increase in the percentage of sequencing reads that map to the P. vivax genome and an average 5ϫ P. vivax core-genome coverage of 60.1% Ϯ 26.0%, compared to 43.7% Ϯ 41.4% for leukocyte-filtered samples (10). For the 4 dried blood spot samples, we achieved an average 5ϫ core-genome coverage of 54.0% Ϯ 34.6%. There was a trend toward improved mean coverage and percentage of the genome callable in samples with higher parasite densities (see Fig. S5 in the supplemental material). This is consistent with previous SWGA results for P. falciparum (22) and results from other P. vivax-enriching methods, such as hybrid selection (16). Samples subjected to SWGA yielded similar Ն5ϫ genome coverage per sequenced base pair compared to that obtained by direct sequencing of a leukocyte-filtered patient sample (Fig. 4A). The percentage of the 14 large chromosomes of P. vivax considered callable for samples that underwent SWGA fell within the range of that obtained by direct sequencing of leukocyte-filtered samples (Fig. 4B). Additionally, post-SWGA sequences yielded similar levels of mean base quality when normalized across 100-bp windows of various %GC contents in the reference genome compared to the results for leukocyte-filtered samples (Fig. 5).

Variant analysis.
To examine the utility of post-SWGA sequences for variant analysis, we called 45,821 single-nucleotide polymorphisms (SNPs) from the wholeblood and dried blood spot samples that were subjected to SWGA. For the whole-blood samples, an average of 14,463 SNPs was identified per sample, which is consistent with prior studies of P. vivax field isolates (10). Compared to the results for leukocyte-filtered samples, SNP characteristics such as SNP rate, transition-to-transversion (Ti/Tv) ratio, and nonsynonymous-to-synonymous ratio were nearly identical in the samples that underwent SWGA (Table 2).
In addition, the proportions of SNPs that were exonic, intronic, intergenic, or at 5= and 3= untranslated regions were similar between samples prepared using the two methods of P. vivax enrichment. We also detected SNPs in several known drug resistance genes previously detected in samples from Peru (10) and Colombia (8) in the whole-blood and dried blood spot samples (Table 3; see also Table S1 in the supplemental material), further validating the utility of sequences derived from SWGA for variant calling. This includes several intronic mutations around a putative chloroquine resistance transporter gene (pvcrt), in addition to coding mutations in the dihydrofolate reductase (pvdhfr), multidrug resistance protein 1 (pvmdr1), multidrug resistance protein 2 (pvmrp2), and dihydropteroate synthetase (dhps) genes.
We also compared sample clonality estimates of post-SWGA sequences to microsatellite analyses on the same unamplified samples. We estimated the clonality of the 6 post-SWGA sequences with the highest coverage using the F ws statistic, a measure of within-host diversity previously used to characterize multiplicity of infection in Plasmodium falciparum patient samples (Table 4) (5,30). An F ws score of Ն0.95 indicates low within-host diversity and infection with a single parasite, while an F ws score of Յ0.70 is suggestive of a multiclonal infection. Microsatellite analysis on these same 6 unamplified samples indicated that all were clonal, except for sample 9, where the presence of 2 microsatellite markers at more than one position suggested that it could be a multiclonal sample. However, for all 6 post-SWGA sequences, the F ws score was Ն0.95, suggesting that Finally, we constructed a neighbor-joining tree using core-genome SNPs to determine the relatedness of our samples to one another and to other P. vivax isolates from  (Fig. 6). In this tree, our samples, which were from the Iquitos region of Peru, clustered with one another and were most closely related to leukocyte-filtered samples from another region of Peru (10). They also exhibited the expected degree of relatedness to previously published P. vivax sequences derived from other leukocyte-filtered (15), hybrid-selected (17), and monkey-adapted (19) clinical samples from diverse areas of the world, further validating the use of post-SWGA sequences from downstream analyses.

DISCUSSION
In this study, we validate SWGA as a cost-effective, robust method to enrich P. vivax gDNA from unprocessed whole-blood and dried blood spot clinical samples to improve the efficiency and decrease the cost of subsequent WGS. This is a method that can be applied to clinical samples infected with other malaria species, such as P. malariae, P. ovale curtisi, and P. ovale wallikeri, where parasite densities are low, and where there is no routine ex vivo culture (31), though species-specific primer sets would likely be required. SWGA utilizes readily available reagents, does not require processing at the time of sample collection, and can be performed in a simple, overnight reaction. While several methods have been used successfully to enrich P. vivax gDNA for WGS from infected whole-blood samples, most are resource and labor intensive. Short-term ex vivo culture of P. vivax isolates or adaptation to growth in monkeys produces a large amount of P. vivax DNA, but these methods require significant resources. Single-cell sequencing allows for highly sensitive dissection of multiclonal samples; however, this approach requires cryopreserved samples and specialized laboratory equipment (20). While leukocyte filtration is cost-effective and efficient, it is not always possible to perform at field sites with limited infrastructure, because samples require refrigeration  Selective Whole-Genome Amplification Of Plasmodium vivax ® within 6 hours to minimize white blood cell lysis and reduce irreversible contamination from human DNA. Hybrid selection is less labor intensive, but the production of the RNA baits used for capture requires either large amounts of P. vivax Sal-1 DNA or costly commercially synthesized RNA bait. Using SWGA, we achieved a higher-than-average callable P. vivax genome than was obtained for leukocyte-depleted clinical samples sequenced at a similar depth. SWGA generally yielded the highest genome coverage for clinical samples with the highest parasite densities, consistent with our experience with P. falciparum (22). For the 12 samples with parasite densities of Ͼ5,000 parasites/l (0.1% parasitemia), we were able to call, on average, 71.5% of the core genome, compared to 37% for the 6 samples with parasite densities of Ͻ5,000 parasites/l. Increased sequencing effort is needed to obtain maximal genome coverage for samples with lower parasite densities (see Table S2 and Fig. S6 in the supplemental material). In these cases, the low genome coverage is likely the result of stochastic amplification of a small number of starting P. vivax genomes, which leads to very deep coverage of some genomic regions and little or no coverage of others (22). If maximal genome coverage is desired, sequential SWGA reactions with pvset1920 followed by pvset1 increase coverage slightly (3 to 5% 1ϫ genome coverage) (see Fig. S7). Additionally, performing multiple independent SWGA reactions on a sample and combining the sequencing reads can improve genome coverage (4 to 12% 1ϫ genome coverage) (see Fig. S8). Since multiple rounds of SWGA or pooling the products from multiple reactions increases the workload and expenses for a small improvement in genome coverage, we opted for a single amplification reaction with pvset1920 for our samples. However, these protocol modifications may be useful if high genome coverage is needed from samples with low parasitemia.
One potential limitation of SWGA of P. vivax from clinical samples is the ability to amplify all clones in a multiclonal sample. One of our samples appeared to be multiclonal by microsatellite analysis of the unamplified gDNA, and yet, F ws analysis of the post-SWGA sequence suggested that the sample was comprised of a single clone. It is possible that in this case, a majority clone was amplified preferentially over minority clones. Another possibility is that the microsatellite marker assessment may overestimate clonality. Analyses of additional multiclonal samples will be necessary to address this question. Another important limitation of SWGA is that copy number variant (CNV) detection is not possible on post-SWGA sequences. The uneven distribution of primertargeted motifs in the target genome results in peaks and troughs in mean genome coverage that can confound CNV detection methods. Finally, SWGA requires long strands of gDNA for efficient amplification of the target genome and is unlikely to work well on degraded or ancient DNA samples.
Whole-genome analysis has the potential to reveal much about the biology and epidemiology of P. vivax infections. For example, comparison of recurrent infections using WGS can help distinguish relapse due to reactivation of hypnozoites from reinfection or drug resistance, an epidemiological distinction of public health importance. SWGA enables high-quality and cost-effective WGS of P. vivax from unprocessed blood samples that would otherwise be impossible or prohibitively expensive to sequence. Advanced technologies like SWGA will greatly facilitate future P. vivax whole-genome sequencing projects, thereby improving our ability to understand and combat the most widespread form of malaria.

MATERIALS AND METHODS
Patient sample collection and preparation. The P. vivax-infected DNA sample used for initial testing of selective amplification primer sets (MRL2) was provided by the Malaria Reference Laboratory of the London School of Hygiene and Tropical Medicine, London, United Kingdom. The six dried blood spot samples used in this study were collected from patients with symptomatic P. vivax infection in Peru. Selective Whole-Genome Amplification Of Plasmodium vivax ® Parasitemia was quantified with real-time PCR, and gDNA was extracted using a QIAamp DNA blood minikit (Qiagen).
Eighteen whole-blood samples used for additional testing and further sequencing analysis were derived from whole-blood samples collected from patients with symptomatic P. vivax infections from two sites around Iquitos, Peru, during a study conducted by U.S. Naval Medical Research Unit No. 6 (32). Thick blood smears were examined to identify the parasite species and to determine the level of parasitemia. Parasite density was calculated by counting the number of asexual parasites per 200 white blood cells in the thick smear (Assuming an average of 6,000 white blood cells per l). Each blood smear was examined by two microscopists independently and by a third microscopist in the event of a discrepancy. The final parasite density was calculated as the average of the density readings from the two concordant microscopists. Whole-blood samples were collected in the field using EDTA-containing Vacutainer tubes. Samples were frozen and transported to the central laboratory until further processing. DNA was isolated from thawed whole blood using a QIAamp DNA blood minikit (Qiagen) following the manufacturer's recommendations and as described elsewhere (33). Samples were subsequently resuspended in Tris-EDTA (TE) buffer, and gDNA was quantified using a Qubit 2.0 fluorometer.
Selective whole-genome amplification. Thirty to 70 ng of input DNA was added to a 50-l reaction mixture containing 3.5 M SWGA primers, 30 U phi29 DNA polymerase enzyme (New England Biolabs), 1ϫ phi29 buffer (New England Biolabs), 4 mM dNTPs (Roche), 1% bovine serum albumin, and water. The reaction was carried out on a thermocycler with cycling conditions consisting of a ramp down from 35°C to 30°C (10 min per degree), 16 h at 30°C, 10 min at 65°C, and hold at 4°C. The samples were diluted 1:1 with DNase-free, RNase-free water and purified with AMPure XP beads (Beckman-Coulter) at a 1:1 ratio according to the manufacturer's protocol. When a second round of selective amplification was performed, the reaction mixture contained 100 to 200 ng of the AMPure XP-purified product from the first reaction.
Methylation digest. One hundred twenty-five to 500 ng of gDNA extracted from P. vivax-infected whole-blood samples was digested with 5 units of FspEI (New England Biolabs) and 5 units of MspJI (New England Biolabs) enzymes in a 30-l reaction mixture. A mock digest with an identical amount of gDNA and no enzymes was run in parallel. Samples were digested for 2 h at 37°C and then heat inactivated at 80°C for 15 min. For coverage analysis, rarefaction analysis was performed on Illumina MiSeq sequencing reads derived from samples digested with MspJI or FspEI (digest and SWGA) or mock digested with no enzyme prior to SWGA with primer set pvset1920. The coverage obtained by mapping 200,000 MiSeq sequencing reads (~25 million bp of sequencing depth) was compared between digested and mockdigested samples. P values were obtained using the two-tailed t test.
Whole-genome sequencing. The SWGA products and unamplified DNA used for primer set testing were sequenced on an Illumina MiSeq using a modified Nextera library preparation method. For HiSeq runs, next-generation-sequencing libraries of SWGA products were prepared using the Nextera XT DNA preparation kit (Illumina) according to the manufacturer's protocol. These samples were pooled and clustered on a HiSeq 2500 (Illumina) in Rapid Run mode with 100-bp paired-end reads. For the coverage analysis of leukocytefiltered samples, fastq files from a prior study of P. vivax field samples (10) were used.
Raw fastq files were aligned to the Sal-1 reference genome (PlasmoDB version 13, http://plasmodb.org/ common/downloads/release-13.0/PvivaxSal1/fasta/data/) using the Burroughs-Wheeler Aligner (version 0.7.8) (34) and SAMtools (version 0.1.19) (35,36) in the Platypus pipeline, as previously described (37). Picard (version 2.0.1) was used to remove unmapped reads, and the Genome Analysis Toolkit (GATK) (38) was used to realign the sequences around the indels. Picard's CollectGcBiasMetrics tool was used to generate the GC bias plots. GATK's DepthOfCoverage tool was used to determine the percentages of the total and core genomes covered by Ն5 reads, mean coverage, and coverage over the core genome. The coordinates of the P. vivax core genome, which excludes subtelomeric and hypervariable regions with significantly higher read mapping errors, was obtained from a recent analysis of hundreds of P. vivax sequences from clinical isolates (15).
Variant calling and analysis. We followed the GATK's best practices to call variants (42,43). The aligned sequences were run through GATK's HaplotypeCaller in "reference confidence" mode to create genomic variant call format (GVCF) files for each sample. This reference confidence model highlights areas of the genome that are likely to have variation and produces a comprehensive record of genotype likelihoods and annotations for each site. The samples were joint genotyped using the GenotypeGVCFs tool.
Variants were further filtered based on quality scores and sequencing bias statistics based on default parameters from GATK. SNPs were filtered out if they met any of the following criteria: quality depth (QD) of Ͻ2.0, mapping quality (MQ) of Ͻ50.0, Phred-scaled P value using Fisher's exact test to detect strand bias (FS) of Ͼ60.0, symmetric odds ratio (SOR) of Ͼ4.0, Z score from Wilcoxon rank-sum test of alternative versus reference read-mapping qualities (MQRankSum) of ϽϪ12.5, and ReadPosRankSum (RPRS) of ϽϪ8.0. Variants were annotated using snpeff (version 4.2) (44).
F ws of samples with the highest genome coverage was estimated using moimix (https://doi.org/ 10.5281/zenodo.58257), a package available through R. The package calculates the F ws statistic using the equation F ws ϭ 1 Ϫ (H w /H s ), where H w is the within-host heterozygosity and H s is the population-level heterozygosity (5,30). The core P. vivax genome, as defined by Pearson et al. (15), was used for core-genome analysis. For microsatellite genotyping, five neutral microsatellite loci of significant variability in the Peruvian Amazon were typed in a previous study (32). If there was more than one marker at any given locus, the sample was considered multiclonal, per prior genotyping studies (45)(46)(47).
A neighbor-joining tree was constructed using SNPs from the core P. vivax genome from sequences obtained in this study, along with sequences from previously published studies that are available in the NCBI Short Read Archive. The Mdio samples were from a previous study conducted by our laboratory in Peru (10), and the rest of the sequences were obtained from two recent large-scale P. vivax sequencing studies (15,17). In order to assess the phylogenetic relationships of sequenced isolates, we constructed a multiple sequence alignment from filtered SNPs called in GATK using an in-house Perl script. This alignment was used as the input for maximum-likelihood phylogenetic analysis in the randomized accelerated maximum-likelihood program (RAxML) (48) with 500 pseudoreplicates using the generalized time-reversible model, and the resulting tree was visualized in Dendroscope (49).
Ethics approval and consent to participate. The sample from Malaria Research Laboratories (MRL) was an anonymized DNA sample previously collected by the MRL and provided under the MRL's remit to undertake epidemiological surveillance relevant to imported malaria in the United Kingdom. The protocol for the collection of field samples was approved by the Institutional Review Board of the U.S. Naval Medical Research Center (Protocol NMRCD.2005.0005) and the National Institutes of Health of Peru (protocol 009-2004) in compliance with all applicable federal regulations governing the protection of human subjects. All adult subjects provided written informed consent, and all children 8 to 17 years old provided verbal assent to participate in the study.
Accession number(s). The P. vivax genome Illumina sequencing reads of the 22 samples used for variant analysis in this study are available in the National Center for Biotechnology Information's Sequence Read Archive with the study accession number SRP095853.