Global Transcriptome Analysis Identifies a Diagnostic Signature for Early Disseminated Lyme Disease and Its Resolution

Lyme disease (LD), caused by Borrelia burgdorferi, is the most common tick-borne infectious disease in the United States. We examined gene expression patterns in the blood of individuals with early disseminated LD at the time of diagnosis (acute) and also at approximately 1 month and 6 months following antibiotic treatment. A distinct acute LD profile was observed that was sustained during early convalescence (1 month) but returned to control levels 6 months after treatment. Using a computer learning algorithm, we identified sets of 20 classifier genes that discriminate LD from other bacterial and viral infections. In addition, these novel LD biomarkers are highly accurate in distinguishing patients with acute LD from healthy subjects and in discriminating between individuals with active and resolved infection. This computational approach offers the potential for more accurate diagnosis of early disseminated Lyme disease. It may also allow improved monitoring of treatment efficacy and disease resolution.

first principal component (x axis) accounts for 37.7% of the variability in the data and, with few exceptions, clearly separates the healthy donor and late convalescent LD blood samples from the acute and early convalescent LD blood samples. No further separation of samples within each of these groups occurs when the second (y axis) or third (z axis) principal component is applied.
Significant differentially expressed transcripts (DETs) were defined as those having a P value of Ͻ0.05 and at least a 2-fold change in expression at any time point relative to the healthy donor group. A total of 335 DETs, representing 233 unique genes, were identified (see Table S1 in the supplemental material). The greatest number of DETs (241 total; 187 induced, 54 repressed) was observed in the acute phase blood samples of the LD subjects (Fig. 2). The 1-month convalescent phase samples contained 142 DETs (142 total; 84 induced, 58 repressed); most of these (92; 65%) were also differentially expressed during acute LD. Only 56 DETs (56 total; 45 induced, 11 repressed) were identified in 6-month convalescent-phase samples; of these, an overwhelming majority (51; 91%) were unique to this group. A list of the DETs with the greatest change in expression (at least 2.5-fold) is provided in Table 2, along with the corresponding fold change values for each time point.
In order to visualize temporal gene expression changes occurring during different disease states, a profile plot was generated using the normalized intensity values of the 335 DETs. Healthy donors displayed a relatively broad range in intensity values (Fig. 3); this likely reflects normal variation in gene expression in the population (27,28). The range of normalized intensities appeared to be more restricted in the acute LD samples relative to samples from the healthy controls, likely reflecting a common response to B. burgdorferi infection among subjects. Consistent with the Venn diagrams, the profiles for acute LD and 1-month convalescent LD samples were found to be strikingly similar; however, the intensity of many of the transcripts was slightly reduced in the 1-month convalescent samples. Importantly, expression intensities for the 6-month convalescent LD samples showed greater variability in general, as was observed in healthy donors (Fig. 3). Interestingly, at 6 months convalescence, the expression levels of some transcripts that had been repressed during acute LD exceeded values observed in healthy controls. This may indicate a "rebound effect" as immune cells returned to homeostasis following clearance of the infection. Numerous genes involved in innate immune mechanisms are differentially expressed during acute and early convalescent LD but not during late convalescence. To further identify transcriptional patterns characteristic of disease states, the 335 DETs were used for unsupervised hierarchical clustering. As shown in Fig. 4, samples separated into two main clusters. Consistent with the principal-component analysis, all healthy donor and late-convalescent-phase samples clustered together (group A), while the majority of the acute-phase (22 of 28) and early-convalescentphase (20 of 27) samples from LD subjects comprised a second group (group B). Four

FIG 2
Venn diagram depicting common and unique patterns of differential gene expression among Lyme disease patients during acute LD and at 1 month or 6 months after the initiation of an appropriate antibiotic regimen. Venn diagrams were generated using a total of 335 DETs that had a fold change of at least 2, with P value of Ͻ0.05, relative to healthy controls. DETs for acute, 1-, and 6-month samples are represented by colored ellipses. The sizes of the ellipses are adjusted for the number of DETs in each group.
of the remaining six acute LD samples formed a small subcluster immediately adjacent to group B. One acute LD sample was distinctly separated from the other acute LD samples; this sample had been collected from the only LD subject who did not have serologic evidence of B. burgdorferi infection at any time point during the course of the study and was culture negative from skin and blood; the diagnosis of LD was based solely on the presence of MEM.  Table S1 in the supplemental material). Cluster 1 (54 genes, 76 probe sets) and cluster 3 (38 genes, 45 probe sets) contained genes that were strongly or moderately induced in the 42 acute and early convalescent LD samples in group B relative to healthy controls. However, increased expression of these genes was not observed in the six acute LD subjects that clustered in group A. Cluster 1 was characterized by genes involved in innate immune processes (Table S1). Significant gene ontology (GO) terms associated with cluster 1 included platelet alpha granule (P ϭ 3.15E-08), wound healing (P ϭ 3.28E-04), blood coagulation (P ϭ 0.001), hemostasis (P ϭ 0.001), and response to stress (P ϭ 0.005). Cluster 3 featured genes involved in fatty acid catabolism (Table S1). Significant GO terms included carnitine O-palmitoyltransferase activity (P ϭ 2.72E-04), choline kinase activity (P ϭ 2.72E-04), ethanolamine kinase activity (P ϭ 2.72E-04), and intracellular lipid transport (P ϭ 5.32E-04).
The majority of acute LD subjects showed a significant induction of genes in cluster 4. This result contrasted with that for clusters 1 and 3, where different responses were observed for the acute LD subjects in group A and group B. Of the 69 transcripts in cluster 4, 28 (41%) are involved in innate immune cell functions, including pathogen recognition, phagocytosis, neutrophil activation, chemotaxis and cell migration, and inflammation. The most highly induced transcript encodes DEFA1/DEFA1B/DEFA3 (defensin, alpha 1/defensin, alpha 1B/defensin, alpha 3, neutrophil specific), microbicidal proteins of neutrophil granules that effectively kill B. burgdorferi in vitro (29) ( Table 2). With the single exception of DEFA1/DEFA1B/DEFA3, which was upregulated at all time points, genes in cluster 4 were significantly induced only during acute and early convalescent LD and returned to levels observed in the healthy donors within 6 months ( Table 2).
Cluster 2 contained 22 genes (26 probe sets) that, with three exceptions, were not significantly changed during acute or early convalescent LD but were significantly induced in the late convalescent LD (6 months) subjects. Cluster 5 consisted of transcripts for 50 genes that were significantly repressed in the majority of acute and early convalescent LD patients relative to healthy subjects. Significant GO terms for these genes included immune system process (P ϭ 4.98E-06), response to wounding (P ϭ 1.21E-05), and cell migration (P ϭ 2.92E-04).
Interferon-regulated genes characterize the response to acute disseminated B. burgdorferi infection. Interferome (http://www.interferome.org/interferome/home .jspx), a database of interferon (IFN)-regulated genes (30), was employed to analyze the genes dysregulated during acute LD. The following parameters were applied to the analysis: human (species), hematopoietic/immune (system), and blood (organ). Totals of 106 of 131 (81%) induced genes (encoded by 187 transcripts) and 25 of 30 (83%) of the repressed genes (encoded by 54 transcripts) were identified as interferon regulated. These included 32 of the 40 genes with the greatest expression changes ( Table 2).
Normalization of transcriptome following treatment is concordant with resolution of symptoms. LD subjects were questioned regarding symptoms at each visit. Symptoms that had existed due to a preexisting condition were not included in the FIG 4 Hierarchical clustering distinguishes between disease states. Heat map with the dendrogram resulting from unsupervised hierarchical clustering performed using 335 transcripts (representing 233 genes) that were differentially expressed (at least a 2-fold change, with a P value of Ͻ0.05) relative to healthy controls. The values shown are normalized intensities relative to the mean. Red or blue indicates high or low expression, respectively, of the normalized intensities relative to the mean. The heat map displays five distinct clusters, three containing induced genes and two containing repressed genes. Boldfacing indicates genes that were later identified as classifiers for disease states (Tables 4 and 5). A list of the top 40 genes with greatest changes in LD subjects is presented in Table 2, and all dysregulated genes are provided in Table S1 in the supplemental material.
Identification and validation of predictor genes. One major limitation of serological tests is the inability to detect infection prior to the appearance of antibodies. A predictive model was developed based on application of the random forest algorithm to the 2004 most highly variable genes in three data sets (acute LD, 6-month convalescent LD, and healthy controls). In the first comparison, the capability of this model to correctly distinguish between subjects with acute LD and healthy controls was determined and the top 20 genes with the highest random forest importance levels were identified (Table 4). Hierarchical clustering using only these 20 genes accurately separated acute LD subjects and healthy controls into two distinct clusters (Fig. 5A). Moreover, this 20-gene classifier set correctly distinguished subjects with acute LD from healthy donors with 100% sensitivity and 96% accuracy (correct predictions/test set   (Fig. 6). In comparison, only 22/27 of these subjects tested positive by ELISA for B. burgdorferi-specific antibodies at the initial visit, resulting in 81% sensitivity for the serology-based test. Four of the five patients who were seronegative by ELISA at the initial visit seroconverted by the time of the second visit. Another major limitation of most serologic diagnostic tests is the inability to distinguish between active and prior infection as circulating antibodies are present long after the pathogen is cleared. Application of the random forest algorithm to samples from LD subjects at baseline and at 6-month convalescence resulted in a separate distinct set of 20 classifier genes (Table 5). Hierarchical clustering of samples using this unique 20-gene classifier set correctly categorized the preponderance of samples from these two groups (Fig. 5B). In addition, acute LD could be discriminated from 6-month convalescent subjects with 100% sensitivity and 97% accuracy (Fig. 6).
Validation of the specificity of the classifier gene set was performed by applying the prediction model to a published microarray data set generated using peripheral blood mononuclear cells (PBMCs) from patients with acute infections caused by common bacterial and viral pathogens: Staphylococcus aureus, Streptococcus pneumoniae, Esch-  Table 4). (B) A second unique set of 20 genes (shown on the right and in Table 5) having the highest random forest importance levels when comparing acute LD subjects (orange) and 6-month convalescent LD subjects (green) was used for hierarchical clustering of samples from these groups.
Diagnostic Gene Signature for Lyme Disease ® erichia coli, or influenza A virus (17). First, the top 10% of genes with the greatest variance were selected. Next, iterations (n ϭ 10) of the random forest algorithm were run to identify the top 20 genes associated with each infectious agent that had the highest importance levels ( Table 6). Using these 20-gene classifier sets, random forest analysis correctly identified patients with specific infections with prediction accuracies of 100% (influenza A virus), 98% (B. burgdorferi), 95% (S. pneumoniae and S. aureus), and 94% (E. coli). Comparison of the 20-gene sets revealed that acute infections due to E. coli, S. aureus, and S. pneumoniae shared multiple classifiers; the greatest number (eight) of shared classifiers was between E. coli and S. aureus infections ( Table 6). The gene lists were analyzed for IFN-responsive genes using Interferome as described above. The only classifier sets that contained more than one IFN-regulated gene were those for B. burgdorferi (n ϭ 15) and influenza A (n ϭ 6) ( Table 6). Remarkably, however, all 20 classifier genes for acute infection with B. burgdorferi were unique to that organism; none was shared with any of the other bacterial infections or with infection due to influenza A.

DISCUSSION
In this study, multiple approaches were used to identify a peripheral blood signature that would enable reliable detection of early disseminated LD at a time point when  standard serologic testing may be suboptimally sensitive. A 20-gene classifier set that correctly distinguished subjects with acute LD from healthy donors with 96% accuracy, 100% sensitivity, and 90% specificity was identified. A second major limitation of antibody-based tests is the inability to differentiate between acute infection and resolved infection (after antibiotic treatment) due to specific circulating antibodies that may persist for years after the microbe has been eliminated. The identified 20-gene classifier set was able to discriminate acute LD from 6-month convalescent subjects with 97% accuracy, 100% sensitivity, and 90% specificity. Notably, gene expression changes corresponded to reported symptoms. The greatest number of genes with altered expression was present in the acute LD group; symptoms were reported by 82% of all acute LD subjects in this study and by 93% of the 28 subjects whose blood was analyzed for gene expression. In contrast, return of the gene expression profile to that observed in the healthy donors corresponded with resolution of symptoms: only one 6-month LD convalescent subject (9%) reported having any symptom. Thus, the identified classifier set has the potential for serving as a test for disease resolution.
The algorithm used to generate the classifier gene set for acute B. burgdorferi infection was applied to published microarray data sets for PBMCs collected from patients with acute infections caused by three common bacterial pathogens or by influenza A virus. Importantly, all 20 classifier genes for acute B. burgdorferi infection were completely unique and were not associated with any of these four pathogens. Therefore, the gene classifier sets described here not only demonstrated high sensitivity for acute LD relative to healthy donors and convalescent LD patients, but the 20-gene classifier set for acute LD distinguished B. burgdorferi infection from the other tested bacterial or viral infections with 100% specificity.
In sharp contrast to the gene classifiers for the other three bacterial pathogens, the classifier gene sets for B. burgdorferi and influenza A infection were both characterized by an IFN-regulated signature, although the individual genes comprising each set were unique. IFI27 (interferon alpha inducible protein 27) is the classifier gene for influenza A that has the highest random forest importance value. IFI27 has been described in a separate study as a novel single-gene biomarker in patient blood that was able to discriminate, with 88% diagnostic accuracy and 90% specificity, between influenza virus-and bacterium-associated respiratory infections (31). We have previously demonstrated that B. burgdorferi induces numerous IFN-regulated genes in skin at the site a Genes are listed in order of random forest analysis importance level (highest to lowest). *, interferon-regulated gene. Genes that appear on the classifier list for more than one infectious agent are designated in boldface.
Diagnostic Gene Signature for Lyme Disease ® of an EM lesion (32), many of which were also dysregulated in Lyme disease patient PBMCs in the present study. Of note, the 20-gene classifier set for B. burgdorferi infection included 15 IFN-regulated genes; five were also significantly induced in EM skin biopsy specimens from patients with disseminated Lyme disease (32). Several of these genes encode proteins involved in pathogen recognition and phagocytosis, and antigen processing, including: the Fc gamma receptors FCGR1A and FCGR1B (Fc fragment of IgG, high-affinity 1a and 1b, receptor [CD64]), TNFSF10 (tumor necrosis factor [ligand] superfamily, member 10), and PSMB8 (proteasome subunit beta 8).
Interestingly, the classifier gene sets for infections caused by each of the other three bacterial pathogens evaluated were nearly devoid of IFN-regulated genes, with none associated with E. coli infection and one IFN-regulated gene each associated with S. aureus and S. pneumoniae infections. Collectively, these results confirm and extend our previous observation that B. burgdorferi elicits an IFN-dominated transcriptional signature during early infection, a sharp distinction from the immunological footprints generated by the other bacterial pathogens examined. In addition, the 20-gene classifier set clearly distinguishes B. burgdorferi infection from that caused by influenza A, although both pathogens potently stimulate the interferon signaling pathway (33). Bouquet and colleagues also examined the transcriptional profile in PBMCs of LD patients with EM before antibiotic treatment, 3 weeks later, and then 6 months after the completion of antibiotic therapy (34). There is general consensus between Bouquet et al. and a major finding of the present study: acute infection with B. burgdorferi elicits a distinct gene expression profile in patient blood that persists for at least 3 weeks after infection. However, in contrast to Bouquet et al., we observed that the majority of differentially regulated genes return to healthy donor levels by 6 months posttreatment. There are several differences between the two studies that might explain the discrepancies in the findings. The most significant difference may be in the patient population under investigation. The present study was restricted to subjects with definitive early disseminated LD. A total of 95% of enrolled LD subjects had either MEM (67%) and/or positive blood culture for B. burgdorferi (74%); the remaining subject had facial palsy, a sign of disseminated LD. The inclusion criteria of Bouquet et al. were less stringent and consisted of a physician-documented EM of Ͼ5 cm with at least one concurrent nonspecific symptom (headache, fever, chills, fatigue, and/or new muscle or joint pains). Cultivation of B. burgdorferi from any clinical samples was not reported, and only 43% of LD subjects had MEM. Of the 29 subjects with LD in the Bouquet et al. study, 8 did not seroconvert, and 1 was not tested. In addition to the enrollment criteria, the definition for altered gene expression differed between the studies; Bouquet et al. used a 1.5-fold change cutoff compared to the 2-fold change in the present study. Significantly, in the present study, random forest analysis was employed to build predictive models. Classifier gene sets that could separately distinguish healthy controls from patients with acute disseminated infection, and between such patients and those with resolved infection, were identified.
It is important to note the limitations of the current investigation. It was not completely longitudinal and included a relatively small sample size for the 6-month visit. This was primarily due to the fact that 6-month samples were not collected during the first 2 years of the study; the 6-month convalescent time point was added when it became apparent that transcript levels had not returned to normal by 1 month posttreatment. In addition, some study subjects were lost to follow-up, and some RNA samples did not meet the quality requirements for microarray hybridization. A sample size of 10, however, has proven to be sufficient for rigorous statistical comparison with earlier time points and with healthy donors in other studies (18). Since only one of the 10 subjects reported having any symptoms at 6 months, the small sample pool was insufficient for identifying potential transcriptome alterations associated with persisting symptoms. Another limitation is the specific focus on patients with definitive evidence of disseminated infection. An optimal diagnostic test for LD should be able to detect infection at its earliest stages, when B. burgdorferi is still localized to the skin. Current studies are under way to test the sensitivity of the diagnostic biomarker set using Petzke et al. samples from subjects with EM, but without evidence of dissemination. It should also be noted, that the use of published data sets rather than prospectively collected samples (as in Table 6) could potentially lead to artifacts in the cross-comparisons.
In conclusion, we report the development, using gene expression data, of an efficient computational framework to generate a 20-gene classifier set that detects disseminated B. burgdorferi infection with high sensitivity and specificity. This unique classifier set may have a critical advantage over current serologic tests in that it accurately discriminated between active and resolved infection. This computational approach offers the potential for more accurate diagnosis of early disseminated Lyme disease. It may also allow improved monitoring of treatment efficacy and disease resolution.

MATERIALS AND METHODS
Study subjects. All subjects were adult volunteers of at least 18 years of age and provided written informed consent prior to sample collection, in accordance with the study protocol approved by the Institutional Review Board of New York Medical College (NYMC). Healthy donors were recruited from NYMC staff, excluding members of the investigators' laboratories, and met the following inclusion criteria: no history of LD, no receipt of a Lyme disease vaccine, no evidence of a current infectious disease, not pregnant, and no usage of an immunosuppressive medication. Patients were recruited from the Lyme Disease Diagnostic Center of NYMC during the summer seasons of 2005 to 2006 and 2010 to 2013. Blood samples were collected at the time of diagnosis (acute LD) and at approximately 1 and 6 months after the initiation of a recommended course of antibiotics (7). Serologic testing of LD subjects for antibodies to B. burgdorferi was performed by a whole-cell sonicate ELISA. Serologic testing of healthy controls for antibodies to B. burgdorferi was performed once by IgG immunoblot. Analysis was restricted to samples collected from individuals with objective evidence of dissemination, most often based on the presence of multiple erythema migrans (MEM) skin lesions and/or the cultivation of B. burgdorferi from blood, as previously described (35).
Blood collection and RNA isolation. Venous blood was collected directly into BD-Vacutainer CPT tubes (Becton Dickinson, Franklin Lakes, NJ). PBMCs were isolated by centrifugation, according to the manufacturer's protocol, no later than 3 h after blood collection. PBMCs were washed with Hanks' balanced salt solution without calcium, magnesium, or phenol red (Gibco-BRL, Grand Island, NY), and RNA was isolated immediately thereafter under RNase-free conditions using the PureScript total RNA isolation kit (Gentra, Minneapolis, MN) or the Ambion ToTALLY RNA isolation kit (Life Technologies, Grand Island, NY), according to the manufacturers' instructions. Contaminating DNA was removed using the DNA-free kit (Ambion, Austin, TX). RNA was eluted in 20 l RNase/DNase-free water and stored at -80°C after the addition of 32 U of RNase inhibitor (Promega, Madison, WI). RNA integrity was assessed by electrophoresis using an Agilent Bioanalyzer 2100 (Agilent, Palo Alto, CA) prior to cDNA synthesis for microarray hybridization. Samples having an RNA integrity number below 6 were excluded from further analysis.
Microarray hybridization. Between 5 and 20 ng of total RNA from each PBMC sample was used to generate high-fidelity cDNA using an Ovation RNA amplification system (NuGEN Technologies, Inc., San Carlos, CA) according to the manufacturer's protocol. The amplified cDNA was fragmented to 50 to 100 nucleotides, labeled with biotin, and hybridized to the Affymetrix GeneChip.HG-U219 high-density oligonucleotide array (Affymetrix, Santa Clara, CA). After hybridization, the arrays were stained with streptavidin-phycoerythrin and washed in an Affymetrix fluidics module using standard Affymetrix protocols. The detection and quantitation of target hybridization was performed using a GeneArray Scanner 3000 (Affymetrix). All procedures were performed at the Bionomics Research and Technology Center, Rutgers University, Piscataway, NJ.
Microarray data analysis. Microarray data were analyzed using GeneSpring GX14.9 software (Agilent Technologies, Santa Clara, CA). Raw expression values in CEL file format were normalized by robust multiarray analysis (RMA) and quantile normalization, filtered to include only those with intensity values above the 20th percentile, and baseline transformed to the median of all samples. Statistical analysis was performed using one-way analysis of variance with Benjamini-Hochberg multiple testing correction to reduce false positives (36). Differentially expressed transcripts, defined as those having a P value of Ͻ0.05 and a fold change of at least 2 relative to the healthy donor group, were subjected to hierarchical clustering and principal-component analysis.
Predictive modeling. A generic predictive modeling framework was developed and applied to two comparisons: acute LD (n ϭ 28) versus healthy donors (n ϭ 21) and acute LD versus 6-month convalescent LD (n ϭ 10). In the first step, the distribution of the gene expression variance across all experimental groups was computed, and genes with variance at or above the 90th percentile were identified. This threshold is a parameter of the framework and can be appropriately set based on the variance distribution in a considered cohort of samples. In the second step, expression data containing the top 10% of variance in each experimental group were subjected to iterations (n ϭ 50) of random forest analysis, a well-established machine learning algorithm (37). An importance value for each gene was generated following each iteration of random forest analysis, and a final importance value for each gene was computed by averaging the importance values across all 50 iterations. Averaged importance values were used to rank all top selected genes. Finally, for each experiment, leave-one-out predictive modeling Diagnostic Gene Signature for Lyme Disease ® was performed, as well as tested using incrementally expanding sets of the most significant genes (top 20 through top 2004), to assess the changes in accuracy performance across different sets of predictors.
Comparison of classifier genes for LD and other infectious diseases. Microarray-based transcriptome data set GSE6269, containing gene expression profiles from PBMCs from patients with acute infections due to Escherichia coli, Staphylococcus aureus, Streptococcus pneumoniae, or influenza A virus (17) was downloaded from the GEO database and subjected to random forest analysis using the same framework and parameters that were applied to the LD data.
Data availability. The transcriptome data obtained in this study have been submitted to the Gene Expression Omnibus (GEO) data repository under accession number GSE145974.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TABLE S1, DOCX file, 0.04 MB.