Quantitative Framework for Model Evaluation in Microbiology Research Using Pseudomonas aeruginosa and Cystic Fibrosis Infection as a Test Case

Laboratory models have become a cornerstone of modern microbiology. However, the accuracy of even the most commonly used models has never been evaluated. Here, we propose a quantitative framework based on gene expression data to evaluate model performance and apply it to models of Pseudomonas aeruginosa cystic fibrosis lung infection. We discovered that these models captured different aspects of P. aeruginosa infection physiology, and we identify which functional categories are and are not captured by each model. These methods will provide researchers with a solid basis to choose among laboratory models depending on the scientific question of interest and will help improve existing experimental models.

IMPORTANCE Laboratory models have become a cornerstone of modern microbiology. However, the accuracy of even the most commonly used models has never been evaluated. Here, we propose a quantitative framework based on gene expression data to evaluate model performance and apply it to models of Pseudomonas aeruginosa cystic fibrosis lung infection. We discovered that these models captured different aspects of P. aeruginosa infection physiology, and we identify which functional categories are and are not captured by each model. These methods will provide researchers with a solid basis to choose among laboratory models depending on the scientific question of interest and will help improve existing experimental models.
KEYWORDS Pseudomonas aeruginosa, model, transcriptomics, cystic fibrosis, infection F or over a century, microbiologists have relied on laboratory models to study pathogenic bacteria (1,2). Due to obvious ethical prohibitions on human experimentation, laboratory infection models have become a cornerstone in bacterial pathogen research. These models range in complexity from standard laboratory media, to in vitro models specifically designed to mimic infection, to the most complex class of models, animal hosts.
There are typically a range of laboratory models used to study any given infection type. These model systems are selected based on a balance between each model's perceived strengths and limitations. For instance, in vitro models are often inexpensive and highly controllable, whereas animal models are thought to capture important aspects of human pathogenesis, such as host immunity and tissue structure, which can be difficult to reproduce in vitro (2). The accuracy of a laboratory model is likely shaped by several factors, including the bacterial genotype used (3), the chemical and physical environment the bacteria are exposed to (4), and even the mode of inoculation (5). Although most researchers are aware of these considerations, there is no clear framework for deciding which model best addresses a given research question. Until recently there have been insufficient data on bacterial behavior and physiology in clinical infections to effectively evaluate laboratory model performance, and beyond this limitation, there has been no formalized framework to do so. The lack of a systematic framework for model selection has left researchers to rely on intuition or ad hoc rationalizations for selecting their model. We address this issue by proposing a framework to evaluate the accuracy of human infection models using RNA sequencing (RNA-seq) data. RNA-seq provides a snapshot of pathogen gene expression, giving a rare glimpse of bacterial behavior and physiology in a natural, unmanipulated human infection environment. The present study was motivated by a desire to evaluate the accuracy of models used to study Pseudomonas aeruginosa infection of the lungs of cystic fibrosis (CF) patients. CF is a recessive genetic disease caused by mutations in the gene encoding the cystic fibrosis transmembrane conductance regulator (CFTR), an ion channel that conducts chloride and bicarbonate across epithelial cell membranes. Mutations in CFTR result in accumulation of viscous mucus (sputum) in an individual's lungs, which is subsequently colonized by P. aeruginosa and other bacteria (6,7). After colonization, P. aeruginosa can use lung sputum as a carbon and energy source (8)(9)(10), allowing it to grow to high densities and persist throughout a patient's life.
One of the challenges to studying CF infection biology has been the lack of robust in vivo and in vitro model systems. Numerous murine models have been developed to study P. aeruginosa lung infection, many of which use mice with wild-type CFTR (11)(12)(13)(14)(15)(16). Models using mice with mutant CFTR have also been developed (17), although these models generally show high-level resistance to P. aeruginosa infection, which is likely due to specific aspects of mouse lung physiology (18). Several in vitro models have also been developed, including one in which P. aeruginosa is inoculated on the apical surface of CFTR mutant human airway epithelial cells that have been differentiated at the air-liquid interface (19). While clearly a simplified model compared to the mouse, this model has been used to study coinfection (20) and has the advantage of studying the host-pathogen interaction while maintaining experimental versatility. Other in vitro model systems do not include host cells but are instead meant to mimic the chemical and physical aspects of expectorated CF sputum. These models include a defined synthetic CF sputum medium (SCFM2) that both mimics the chemistry and viscosity of CF sputum (9,10,21,22). P. aeruginosa grown in synthetic sputum has a similar gene expression signature to P. aeruginosa grown in human CF sputum in vitro (22), and P. aeruginosa requires similar genes to grow in SCFM2 as it does in expectorated human CF sputum (23).
Though the relevance of the models outlined above has been rationalized for the study of CF infection, it is likely that these models do not capture all aspects of P. aeruginosa physiology in the CF lung. To further explore this possibility, we analyzed P. aeruginosa transcriptomes from human CF sputum samples taken immediately following expectoration from two CF clinics, one in Copenhagen, Denmark, and one in Atlanta, GA. We propose a computational framework that uses these transcriptomic data, along with transcriptomic data from CF infection models, to assess the overall model accuracy for reproducing P. aeruginosa CF lung physiology. In addition to assessing overall accuracy, we used gene expression to infer P. aeruginosa biological functions that are and are not reproduced in each model. Our results revealed that SCFM2 and an in vitro CF epithelial cell model mimicked the P. aeruginosa transcriptome in expectorated human sputum better than other models tested, including a mouse lung infection model currently used to study CF lung infections. Although the models differed in overall accuracy, we found that each model reproduces particular P. aeruginosa functions present in CF lung infections and, further, that some functions were not reproduced by any models we tested. The framework we propose is a step toward a grounded, evidence-based approach for selection of an infection model based on the function(s) of interest and for identifying strategies for model improvement.
between human and in vitro transcriptomes occurs primarily along the first principal component, which accounts for approximately 32% of the overall variance in the transcriptomes. Genes contributing most to the differences in the first principal component included several genes with higher expression in the CF sputum transcriptomes than in in vitro ones, such as llda (L-lactate dehydrogenase), genes involved in alginate production (algA and alg8), and a heme uptake receptor (phuR). In addition, genes with higher expression in vitro also contributed significantly to the first principal component included an aerotaxis methyl-accepting chemotaxis protein (aer2), elastase (lasB), and a gene encoding a quinolone signal response protein (pqsE).
While sputum samples from the United States and Denmark clinics were shown to be similar in Fig. 1, a PCA conducted only among the sputum samples demonstrates a separation between transcriptomes from the two clinics (see Fig. S2 in the supplemental material). Consistent with this separation, differential expression analysis indicates several differences in P. aeruginosa gene expression between the two clinic populations (see Data Set S2 in the supplemental material). The genes with the largest expression differences between clinics had no reads mapping for many samples, such as the AraC-like transcriptional regulator vqsM that had no reads mapping to it in any sputum transcriptomes except for three samples from the U.S. clinic. Because it is difficult to determine whether these differences were caused by strain differences in gene content rather than differences in regulation, we restricted the differential expression analysis to the 3,225 genes that had at least one read mapping to them in all CF sputum transcriptomes. The most differentially expressed genes with this approach were often not consistently different between the clinics. For instance, the gene most differentially expressed between the two clinics was the autoinducer synthase lasI (expressed at ϳ26-fold-higher levels in the Denmark samples, P ϭ 9.5 ϫ 10 Ϫ5 ), which encodes a protein that produces a quorum sensing signal in P. aeruginosa. However, the magnitude of this effect is primarily due to relatively high expression in just two of the seven Denmark samples. Other highly differentially expressed genes include dctA (C4dicarboxylate transport protein, higher in the Denmark sputum samples), lptG (lipopolysaccharide export system permease protein LptG, higher in the Denmark sputum samples) and flgI (flagella P-ring protein precursor, higher in the U.S. sputum samples). We also found that the beta-lactamase precursor ampC was expressed at 6.7-foldhigher levels among the Denmark samples. P. aeruginosa metabolism in CF sputum. One of our major interests is microbial metabolism during infection, and the 20 CF sputum transcriptomes provided an opportunity to examine P. aeruginosa metabolism in the CF lung. To accomplish this, we compared the CF sputum transcriptomes to a well-characterized laboratory environment in which cellular metabolism is well understood. Specifically, we compared the sputum transcriptomes to transcriptomes of the reference strain PAO1 grown planktonically with vigorous shaking to mid-logarithmic phase in a well-defined MOPSbuffered medium with succinate as a sole carbon source (4). The sputum samples showed several indications of lower oxygen levels compared to growth in MOPSsuccinate, including higher expression of denitrification operons nor, nir, nar, nos, and nap. In addition the sputum transcriptomes displayed relatively high expression of genes encoding the high-affinity cyanide insensitive terminal oxidase (cioAB). However, genes encoding the low-affinity cytochrome o ubiquinol oxidase (cyoABCDE), which is not critical for growth in low oxygen concentrations (Ͼ2%), were also highly upregulated in sputum. Decreased expression of genes encoding enzymes for several decarboxylation steps of the tricarboxylic acid cycle (via sucA, sucC, and lpd), together with increases in aceA and glcB expression, suggest an increased flux in carbon through the glyoxylate shunt among sputum samples. Also, consistent with previous work, we observed higher expression of lldA in sputum samples, which is involved in L-lactate catabolism, as well as greater expression of zinc uptake and transport genes (znuB, znuC, and zur) (24).
To identify additional differences in metabolic activity between the MOPS-succinate and sputum transcriptomes, we performed gene set enrichment analysis (GSEA) with Pseudocyc gene annotations (25) (see Materials and Methods for details). Compared to those of the MOPS-succinate samples, the sputum sample transcriptomes showed enrichment of genes with lower expression in sputum among pathways involved in the synthesis of several amino acids and intermediate products including threonine (P ϭ 3.5 ϫ 10 Ϫ3 ), homoserine (P ϭ 2.9 ϫ 10 Ϫ3 ), leucine (P ϭ 1.7 ϫ 10 Ϫ2 ), isoleucine (P ϭ 1.9 ϫ 10 Ϫ2 ), histidine (P ϭ 7.0 ϫ 10 Ϫ3 ), and glutamate (P ϭ 2.9 ϫ 10 Ϫ3 ). We also performed an enrichment analysis using TIGRFAM "function" categories and discovered that genes encoding TonB dependent receptors were enriched for greater expression in the sputum, including siderophore receptors (P ϭ 1.2 ϫ 10 Ϫ3 ) (consistent with an increased expression of genes involved in synthesis and regulation of the siderophores pyoverdine and pyochelin), as well as heme/hemoglobin/transferrin/lactoferrin receptors (P ϭ 4.2 ϫ 10 Ϫ3 ).
In addition to differences in metabolic activity between the sputum samples and MOPS-succinate samples, the DNA-damage stress response regulator lexA and genes encoding reactive oxygen species-scavenging enzymes (ahpB, ahpC, ahpF, katB, sodM, and ohr) were expressed at substantially higher levels in the sputum samples. Consistent with previous work (3), genes for choline and L-carnitine degradation to the osmoprotectant glycine betaine (opuCD, betA, betB, and cdhC) were highly expressed in sputum, as well as the transcriptional repressor betI, but gbt, which is required to use choline and glycine betaine as a carbon source, was expressed less in sputum than in MOPS-succinate. This result may indicate that glycine betaine is primarily used as an osmoprotectant rather than for carbon and nitrogen acquisition (3).
Framework to evaluate infection models. While differential expression analysis was useful in understanding basic differences in gene expression between CF sputum and specific laboratory conditions such as MOPS-succinate, our goal was to develop an explicit framework to evaluate how well different aspects of any experimental model system mimic the "target" system (P. aeruginosa in human CF sputum in this case). An ideal evaluation framework should have the following characteristics: (i) it should provide a simple biological interpretation that allows for a straightforward comparison of models, (ii) it should be able to determine both the genome-wide accuracy of the experimental model and the model's accuracy for any functional category of interest, and (iii) it should not be inherently dependent on the number of available samples (as with differential expression analysis, where additional samples lead to increased statistical power and thus more genes being called as differentially expressed).
We propose a framework based on the number of standard deviations in normalized expression for each gene the model is from the mean expression among target transcriptomes. In the first step, we calculate the mean and standard deviation of normalized read counts for each gene among target transcriptomes. Then, for each model system, we average the expression levels of each gene among the replicates and calculate a z-score, which is the number of standard deviations in expression that the model is above the mean observed among the target transcriptomes. We use the absolute value of each gene's z-score as an indication of how similarly the gene is expressed between the model and target. From this perspective, one can ask for any model system, what fraction of a study organism's genes are within two standard deviations (for example) of the mean expression in the target transcriptomes or, similarly, one can ask how many standard deviations from the target transcriptome mean are required in order to include 95% (for example) of the organism's genes. These perspectives are complementary; however, in practice we find the former to be more intuitive and here define an "accuracy score" based on it. A model's accuracy score (AS) is the fraction of the organism's genes that are within a specified number of standard deviations from the target. For example, if a model has an AS 2 of 80%, then the expression of 80% of the model's genes fall within two standard deviations of the means of the gene among the target transcriptomes. An AS 2 of 90% would indicate a more accurate model because 90% of the model's genes are expressed at levels within two standard deviations of the target's means.
We first used this framework to provide a genome-wide evaluation of the accuracy of P. aeruginosa strain PAO1 grown planktonically in MOPS-succinate as a model for CF lung infection. Again, we began with MOPS-succinate since it provides a well-defined growth condition for our initial comparisons. Figure 2A shows the percentage of genes whose average normalized expression among replicates is within each standard deviation cutoff of interest. The AS 2 of PAO1 in MOPS-succinate is approximately 82% (meaning 82% of the PAO1 genome has expression in MOPS-succinate that is within two standard deviations of the expression in the sputum samples). As a control, we also calculated the performance of resampled human CF sputum P. aeruginosa transcriptomes by randomly choosing pairs of the 20 CF sputum transcriptomes, averaging the normalized expression for each gene and then calculating the standard deviations from the mean expression in these genes calculated using all clinical sputum samples except the chosen two being treated as the "model." We repeated this process 100 times to obtain a mean value and a 95% confidence interval ( Fig. 2A). The average AS 2 for these resampled human CF sputum transcriptomes was 97%, indicating that randomly chosen human CF sputum transcriptomes performed significantly better than the MOPS-succinate transcriptomes (P ϭ 1.6 ϫ 10 Ϫ5 [t test]).
This approach can be applied to any functional category, pathway, or individual gene of interest. To obtain functional resolution of the accuracy assessment, we used the TIGRFAM functional gene annotation database, wherein annotated genes are assigned to have at least one "main role," "sub role," and "function," forming a hierarchy of increasing specificity. We first focused on the accuracy of metabolic genes in PAO1 growing in MOPS-succinate (Fig. 2B). This analysis demonstrates that MOPS-succinate mimics P. aeruginosa gene expression in sputum for some metabolic functions such as "biosynthesis of cofactors, prosthetic groups, and carriers," with an AS 2 of approximately 89%, whereas others such as amino acid biosynthesis (AS 2 ϭ 71%) are poorly mimicked. The latter aligns well with our differential expression and enrichment analysis above, which indicated that several amino acid biosynthetic pathways were differentially regulated between CF sputum and MOPS-succinate.
We then took a more expansive view, calculating the model performance for the genes within every TIGRFAM category (Fig. 2C). We added an additional, overarching level to the standard TIGRFAM hierarchy called "meta roles," as previously described (26). By calculating these values for different levels of the TIGRFAM hierarchy, we could determine whether categories were influenced by a minority of subcategories or whether genes across these categories scored similarly. For instance, we identified amino acid biosynthesis as a poor performing category (AS 2 ϭ 71%), and Fig. 2C shows that the sub roles "pyruvate family" (AS 2 ϭ 55%) and "histidine family" (AS 2 ϭ 50%) reduce the overall "amino acid biosynthesis" score substantially, whereas the "aromatic amino acid family" performs better (AS 2 ϭ 80%). Similarly, though the "protein synthesis" category scores poorly overall (AS 2 ϭ 58%), this poor performance is predominantly restricted to genes involved in the synthesis and modification of ribosomal proteins.
Accuracy of several models used to study P. aeruginosa lung infection. We then applied this framework to several additional experimental models: lysogeny broth (LB), a mouse pneumonia model (13), synthetic sputum medium (SCFM2) (4), and the in vitro CFTR ΔF508 CFBE41omutant polarized airway epithelial cell model (27). In order to reduce the impact of strain differences on the results and assess how well a common laboratory strain captures P. aeruginosa physiology in the CF lung, we only compared data from laboratory systems inoculated with the reference strain PAO1. It has previously been shown that PAO1 strains can differ both genotypically and phenotypically (28,29); although the MOPS-succinate (4), SCFM2, airway epithelial cell model, and LB (30) experiments used PAO1-UW (31, 32), the mouse experiments used a PAO1 strain from the Vasil lab (13). Replicates for all models clustered closely according to the z-score across the P. aeruginosa genome, indicating that these models have high reproducibility (Fig. S3).
We began by comparing the genome-wide accuracy for all five experimental models (Fig. 3). The two models that were explicitly designed to mimic CF lung infection, SCFM2 and the CF airway epithelial cell model, had the highest raw AS 2 values (86 and Percentages of PAO1 genes (x axis) within different numbers of standard deviations of the mean expression in human CF sputum (y axis). For each gene, the mean and standard deviation of normalized read counts among the sputum samples were calculated. The mean expression for each gene in MOPS-succinate was then determined, and a z-score (the number of standard deviations each gene is from the mean expression in CF sputum) was calculated. For reference, we performed the same procedure on 100 randomly resampled pairs of human CF sputum samples to provide a robust assessment of the variance in these samples (sputum resampled). The mean and 95% confidence interval for each of these resampled values for each gene are shown. Any genes with values over 10 standard deviations from the sputum mean are not shown. (B) An accuracy metric was calculated for PAO1 grown in MOPS-succinate for all TIGRFAM "metabolism" meta roles. (C) AS 2 for each TIGRFAM meta role, main role, and sub role category for PAO1 grown in MOPS-succinate. The color in the middle represents the AS 2 for all PAO1 genes (those with or without TIGRFAM designations). The next level out from the middle of the circle contains "meta roles," the next contains "main roles," and the outermost layer contains "sub roles." The area of each category is proportional to the number of genes in that category. 84%, respectively) (Fig. 3B). These models were followed by MOPS-succinate (82%), then the mouse pneumonia model (81%), and finally LB (80%) (Fig. 3B). However, only SCFM2 showed statistically significant improvement over other models: LB (P ϭ 0.015), the mouse pneumonia model (P ϭ 0.018), and MOPS-succinate (P ϭ 0.073) (pairwise t test using a Bonferroni adjustment for multiple tests). All models performed worse than the resampled human CF sputum (Fig. 3A), indicating that there is still room for improvement for every model. The z-scores for all genes across all models tested is available in Data Set S2.
Because SCFM2 was specifically designed to mimic the metabolism of P. aeruginosa growth in CF sputum, we focused on the accuracy of metabolic genes in PAO1 growing in SCFM2 (Fig. 4A). We also evaluated the other explicit CF model, the CF airway epithelial cell model, wherein the bacteria acquire nutrients partially from the airway epithelial cells (Fig. 4B). The metabolic categories that SCFM2 captured best were "purines, pyrimidines, nucleosides, and nucleotides," which the CF airway epithelial cell model captures considerably worse. On the other hand, SCFM2 performed worst at mimicking fatty acid and phospholipid metabolism, a category for which the CF airway epithelial cell model performed markedly better. As before, we then expanded our view to all TIGRFAM categories by calculating the AS 2 of genes within every TIGRFAM category ( Fig. 4C and D). Though SCFM2 captures "transport and binding proteins" well overall (AS 2 ϭ 82%), some subcategories, such as "porins" are poorly mimicked (AS 2 ϭ 62%) (Fig. 4C). Porins were captured better in the CF airway epithelial cell model (AS 2 ϭ 70%), but genes involved in "synthesis and modification of ribosomal  Fig. 2A. For reference, we performed the same procedure on 100 randomly resampled pairs of human CF sputum samples to provide a robust assessment of the variance in these samples (sputum resampled). The mean and 95% confidence interval of these resampled values for each gene is shown. Any genes with values over 12 standard deviations from the sputum mean are not shown. (B) Table containing the accuracy scores (AS 2 ) and number P. aeruginosa genes in each model not within two standard deviations of the mean in CF sputum (genes missed). The genes are divided into "all genes," "known," and "unknown." "Unknown" refers to genes that have a TIGRFAM "main role" with either no category designation, have a "main role" annotated as "unknown function," or are not annotated in the TIGRFAM database. We calculated pairwise t tests between sample types using genome-wide AS 2 scores for individual replicates in each sample type, with a Bonferroni adjustment for multiple tests. The most significant comparisons between model types were for SCFM2 compared to LB (P ϭ 0.015), the mouse pneumonia model (P ϭ 0.018), and MOPS-succinate (P ϭ 0.073). All other model pairs had adjusted P values of Ͼ0.2.
Cornforth et al. proteins" performed worse than in SCFM2 (AS 2 of 58% versus 83%). To ensure that our analysis was not biased by sequencing depth, we repeated the basic analysis with all samples resampled down to 100,000 reads, which did not qualitatively affect our results (Fig. S4). Improvements in accuracy by combining models. Figure 3 shows that, as expected, no tested model perfectly mimics the gene expression of P. aeruginosa CF sputum infections. An obvious question is whether each model misses the same genes, or whether each model has different limitations. This is a critical question since it will determine whether a CF infection researcher can study nearly any gene of interest by selecting the appropriate PAO1 model. To answer this question, we assessed the number of genes in each model not within two standard deviations of the mean in CF sputum, individually and in each possible combination (Fig. 5). In combination, the SCFM2 and acute mouse model outperforms all other pairs of models; only 358 genes are not mimicked by either the SCFM2 or the acute mouse model (compared to LB and the acute mouse model for instance which failed to capture 692 genes). There were 211 genes that were missed by every model we studied. These "elusive genes" include several genes whose expression is known to change via mutation that are common in P. aeruginosa lung-adapted strains, including genes involved in alginate production, pilus biosynthesis, and multidrug efflux.
Since it is unsurprising that some of these genes are poorly captured by the strain PAO1 which lacks these common CF lung mutations, we next tested whether a CF  Fig. 2A). (C and D) AS 2 for each TIGRFAM meta role, main role, and sub role category for SCFM2 (C) and the CF airway epithelial cell infection model (D). The color in the middle represents the AS 2 for all PAO1 genes (those with or without TIGRFAM designations). The next level out from the middle of the circle contains "meta roles," the next contains "main roles," and the outermost layer contains "sub roles." The area of each category is proportional to the number of genes in that category.

P. aeruginosa Infection Model Assessment
® clinical strain, LESB58-SED21, which is well adapted to the CF lung, would perform better than PAO1 when grown in SCFM2. Interestingly, expression of 51 of the elusive 211 genes, including genes involved in the processes mentioned above (alginate production, pilus biosynthesis, and multidrug efflux), was within two standard deviations from sputum transcriptome means when LESB58-SED21 was grown in SCFM2 (Data Set S2). Further, the LESB58-SED21-in-SCFM2 model captured more genes than all models evaluated in Fig. 3 (AS 2 ϭ 89%, missing 579 genes shared with PAO1 [Data Set S2]). This result indicates that a CF clinical strain mimics certain features of P. aeruginosa CF lung infection better than PAO1, although it should be noted that PAO1 grown in model systems mimics the CF sputum gene expression of the majority of P. aeruginosa genes.
Technical considerations of the framework. A few details of our model evaluation framework are worth further exploration. First, we have normalized the read counts with the "variance stabilizing transformation" (VST) that is implemented in DESeq2, but other normalization methods are commonly used. We repeated the analysis with two other common normalizations, the "regularized log" (rlog) transformation also implemented in DESEq2, as well as the "trimmed mean of M values" (TMM) method as  implemented in edgeR (33) (Fig. S5). The results were similar in all approaches; however, the standard deviations of sputum sample gene expression were greater using the TMM method, and though the qualitative results were similar (e.g., the SCFM2 and the CF airway epithelial cell model performed best), there was also a greater separation between the accuracy of sample types. We repeated the same process for metabolic subcategories of the MOPS-succinate model shown in Fig. 2B. Again, the qualitative results were similar for the three normalizations: "amino acid biosynthesis" performed poorly, and "biosynthesis of cofactors, prosthetic groups, and carriers" performed well (Fig. S5). As before, there was a greater spread among categories with the TMM normalization than with the two DESeq2 normalizations. Thus, while the normalization method influences model accuracy scores, the qualitative outcomes were very similar. A second interesting consideration is whether genes that fell within a chosen standard deviation threshold actually had similar expression between the model and the sputum samples or whether they fell within this margin due to high variance among the sputum samples. To address this issue, we reanalyzed the data, restricting analysis to genes with low noise (with a low coefficient of variation, or the standard deviation divided by the mean, among the sputum samples). We found a small but statistically significant negative correlation between the coefficient of variation of expression for all genes among the clinical samples and number of standard deviations they are from the mean (r ϭ Ϫ0.05, P ϭ 2.2 ϫ 10 Ϫ4 ). Consistent with this finding, restricting the assessment to genes with low noise did decrease the genome-wide accuracy scores across the models, but it had a negligible effect on the relative accuracy scores of models compared to each other (Fig. S6). This result is likely because genes with noisy expression among the sputum samples increase the accuracy scores similarly for all models.

DISCUSSION
Despite microbiology's heavy reliance on laboratory models, their accuracy has not been systematically evaluated. As a result, models are typically selected based on a researcher's intuition, a laboratory's expertise, or on limited experimental evidence, rather than on a solid biological rationale. Taking advantage of recent innovations in RNA-seq and sample preparation procedures that enable sequencing of human CF lung infection samples (24), we have begun to address this gap by focusing on a set of models used to study P. aeruginosa CF lung infections. By comparing 20 P. aeruginosa transcriptomes from human CF sputum, collected from clinics in the United States and Denmark and preserved immediately after expectoration, to several transcriptomes from commonly used laboratory models, we found that all models differed from the CF infections in important ways (Fig. 1). We propose a framework based on the deviation in expression among P. aeruginosa genes between model systems and human CF infection to provide an easily interpretable gauge of model performance (Fig. 2). Different models excelled at mimicking distinct biological functions in CF sputum (Fig. 3), and thus by combining the models we were able to accurately represent the expression of over 96% of P. aeruginosa PAO1 genes (Fig. 4). However, there were 211 genes that could not be captured by laboratory models using PAO1, but many of these genes could be captured when using a clinical strain.
Our initial survey of laboratory CF models already provides several useful insights. Surprisingly, gene expression in all models, even those such as LB that were not specifically designed to mimic P. aeruginosa CF infection, were similar overall to that in the CF lung. Indeed, over 80% of genes expressed in all models were within two standard deviations of the mean in CF sputum (Fig. 3B). These data support the notion that growing the lab strain PAO1 planktonically in LB is a viable model system for studying many aspects of CF lung infection and that many P. aeruginosa genes do not vary significantly in expression regardless of the genotype and growth environment. Also, somewhat surprisingly, the murine model performed no better than the in vitro models we tested. Murine infection models have become the gold standard for laboratory models because they are thought to approximate the chemical and physical environment of human infections, especially in terms of the host immune response (34). However, overall P. aeruginosa CF physiology is better captured by SCFM2 than by the murine lung infection model we tested (Fig. 3). Although it is clear that SCFM2 performs better overall than the mouse model by our accuracy score approach (Fig. 3B), there are functional categories for which the mouse model outperforms SCFM2; for instance, "porins" are more accurately represented in the mouse model than in SCFM2. Thus, while assessing overall accuracy scores is important, it is critical to also assess functional categories, pathways, and genes since it is likely that even the best overall models will not be superior in all cases. It is also important to point out that the mouse pneumonia model we evaluated here may be more accurate for non-CF pulmonary infections because it is an acute, rather than chronic, infection model. This acute murine model, when combined with the SCFM2 model, had expression within two standard deviations of the clinical sample means for all but 358 P. aeruginosa PAO1 genes (Fig. 5), better than any other model pairing.
Our work also identified 211 genes that were not expressed similarly to CF sputum when P. aeruginosa PAO1 was grown in any of the tested models. The expression of many of these "elusive genes" is known to change due to mutations accumulated during chronic lung infection, including genes involved in twitching motility, alginate production, and multidrug efflux. The fact that 51 of the 211 "elusive" genes (including most genes in these categories) were captured using the CF clinical isolate LESB58-SED21 grown in SCFM2 indicates that laboratory CF models may be improved for specific functional categories merely by using clinical strains rather than standard reference strains. It should be pointed out that LESB58-SED21 is a Liverpool epidemic strain from the United Kingdom (35) and was not an infecting strain in the CF sputum samples used in this study. Thus, it is not necessary to use a strain collected from the same clinic as the sputum samples in order to mimic CF lung-adapted gene expression profiles. Finally, while it is clear that using a clinical strain can be advantageous for studying specific functions such as mucoidy, this again does not seem critical for many aspects of P. aeruginosa physiology since the LESB58-SED21 strain in SCFM2 did only somewhat better than PAO1 in SCFM2 (missing 579 versus 681 genes shared by the two strains, P Ͼ 0.05 [t test]).
Our framework also provides strategies in addition to using CF-adapted strains for improving model systems. For example, SCFM2 performed poorly in the "polyamine biosynthesis" sub role, with some of the genes involved in biosynthesis of the polyamine spermidine expressed higher in SCFM2 than in CF sputum (speD and speE). Since SCFM2 does not contain spermidine, one can hypothesize that the addition of spermidine to SCFM2 would result in reduced expression of genes in the "polyamine biosynthesis" sub role, thus yielding a more accurate model. While this approach may currently be most useful for genes that respond to known regulatory cues, such as genes encoding well-understood biosynthetic and catabolic processes, we anticipate it will be useful for genes of unknown function as additional transcriptomic data become available to inform approaches such as gene interaction networks and functional annotations.
Of course, as we and others have proposed (36)(37)(38), interactions between microbes may also be an important modulator of P. aeruginosa gene expression in the CF lung. Although there is no definitive evidence of microbial interactions in the human CF lung, our approach will also be useful to determine whether the presence of commonly cooccurring microbes can improve the accuracy of P. aeruginosa model systems. In particular, we hypothesize that the addition of other microbes to our models will allow many of the elusive genes to be better mimicked, ultimately providing evidence for interactions in the CF lung. Finally, the addition of human cells such as neutrophils to the in vitro model systems may provide a step forward in defining the signals and cues that drive P. aeruginosa gene expression in the CF lung.
As more transcriptomes become available for each model, we will be able to assess not only the mean accuracy score of any set of genes but also the distribution of these scores, as sometimes the mean will miss interesting features of transcriptome variation. For simplicity, we have used the z-score of normalized counts, which is based on the normal distribution, but other distributions are also viable. For instance, the t-distribution is typically used in place of a normal distribution when the number of samples is small; however, we used z-scores because the number of available CF sputum sample transcriptomes is quickly growing, and we also wanted to avoid the accuracy score explicitly depending on the number of available transcriptomes. Another distribution commonly used in gene expression analysis is the negative binomial (39), but we felt the normal distribution was more appropriate as an initial framework. Also, we have focused on genes whose expression in the models fall within two standard deviations (AS 2 ) of the mean in the clinical population primarily out of convention, since two standard deviations for each gene encompasses expression in ϳ95% of the CF sputum samples. However, there may be instances in which it is valuable to be more stringent. For example, if a gene or function is mimicked by a large number of model systems, one could further explore the models using more stringent criteria such as AS 1.5 (although this would have a false-negative rate of approximately 13%). Ultimately, the ease with which our framework can be adapted provides researchers with the ability to rapidly and quantitatively compare multiple models for any trait of interest. Lastly, we expect that other approaches, including proteomics, metabolomics, or microscopy, will eventually be integrated into a comprehensive model evaluation approach.
Conclusions, caveats, and future directions. We have proposed a simple computational framework that can be used to aid experimentalists in selecting laboratory infection models. For example, if one were studying bacterial metabolism in relation to CF infection, SCFM2 is a better model than the airway epithelial cell model ( Fig. 4A and  B). However, if one were specifically studying fatty acid and phospholipid metabolism, then the epithelial cell model may be a better choice. Similarly, our results suggest that using clinical isolates may be the only way to accurately reproduce the gene expression profiles for some genes.
We focused on CF lung infection models because of the availability of clinical samples and developed models; however, we did not conduct an exhaustive characterization of CF infection models. Such an evaluation would require systematically sweeping a range of important experimental variables, including the strain or host genotype, coinoculated microbial community, the physiology of the bacteria before inoculation, and the time point after inoculation that the sample is taken. Any of these factors may impact bacterial physiology and behavior, and we consider each separate perturbation to be a different experimental model. For instance, here, we focused on in vitro samples collected during mid-logarithmic growth, but LB medium samples taken at 7 h versus 10 h may have distinct gene expression signatures, and so we consider these different models. It is also important to note that different laboratories may conduct in vitro experiments slightly differently even when using the "same" experimental system and protocol. Further, differences in library prep processes can potentially impact accuracy score calculations and comparisons between models. Some library prep kits are stranded, and others are nonstranded; in the present study, we compared models without using strand information because not all samples were prepped with stranded RNA-seq library prep kits. Lastly, downstream analysis, including the software and normalization method used, may also impact the accuracy score of a system. All of these issues must be controlled for before definitive conclusions can be reached about the superiority of one model over another for a particular biological question.
Since Robert Koch's first use of guinea pigs as a model for TB infection, microbiology has relied on a range of laboratory models (40). However, there is no system to comprehensively evaluate a model's accuracy. The framework we propose here is a step toward a general model evaluation framework that is applicable to any microbial model and will only become more powerful as functions for unknown genes are discovered. Our approach can also easily be extended in the future to community-wide functionality in polymicrobial communities, rather than simply the functions of its individual members.

MATERIALS AND METHODS
Data. SRA accessions for the raw reads from all analyzed sequencing files are provided in Data Set S1 in the supplemental material. Data that have not been published have been uploaded under accession number PRJNA576508. Expectorated CF sputum samples for this study were collected in RNAlater from Emory-Children's Center for Cystic Fibrosis and Airways Disease Research by the Cystic Fibrosis Biospecimen Laboratory as previously described by our group (24) with IRB approval (Georgia Tech approval H18220).
RNA extraction and preparation of sequencing libraries for RNA-seq. In vitro and human samples were prepared as previously described (24) with a few modifications for the human samples. For the human sputum samples, expectorated sputum was collected from adult patients that were clinically stable and immediately added to RNAlater and stored at 4°C overnight and then at -80°C. Samples in RNAlater were thawed on ice and centrifuged at 4°C for 30 min at 10,000 ϫ g. RNAlater was removed from the sample, and the sputum was transferred to bead-beating tubes containing a mixture of large and small beads (2-mm zirconia and 0.1-mm zirconia/silica, respectively). In vitro cultures stored in RNAlater were pelleted, resuspended in 1 ml of RNA-Bee (AMS Biotechnology), and transferred to bead-beating tubes. Samples were resuspended in RNase and DNase-free TE buffer (Acros Organics) and lysozyme (1 mg/ml, final concentration) and lysostaphin (0.17 mg/ml, final concentration) were added to each sample. Samples were incubated at 37°C for 30 min to enzymatically lyse cells. RNA-Bee was added to each sample, and samples were lysed mechanically by bead beating three times for 30 s, placing the tubes on ice for Ն1 min between each homogenization. Portions (200 l) of chloroform per 1 ml of RNA-bee were added, and the tubes were shaken vigorously for 30 s and then incubated on ice for 5 min or overnight to allow phases to partition. The samples were centrifuged at 12,000 ϫ g for 15 min at 4°C to separate the aqueous and organic phases. The aqueous phase from each tube was transferred to a new microcentrifuge tube to which 0.5 ml isopropanol per 1 ml of RNA-Bee was added in addition to 20 g of linear acrylamide, and the tubes were incubated at -80°C overnight. Samples were thawed on ice and centrifuged at 12,000 ϫ g for 30 min at 4°C. Pellets were washed with 1 ml of 75% ethanol, air dried for 5 min, and resuspended in 100 l of RNase-free water. The RNA concentration for each sample was determined with a NanoDrop spectrophotometer (Thermo Fisher Scientific). rRNA was depleted using a RiboZero Gold bacteria kit (Illumina) for the in vitro samples and a RiboZero Gold epidemiology kit (Illumina) for the human samples and purified by ethanol precipitation using linear acrylamide to help precipitate the RNA. The depleted RNA was fragmented for 2 min with the NEBNext Magnesium RNA fragmentation module kit and cDNA libraries were prepared using the NEBNext multiplex small RNA library prep kit (New England Biolabs) according to the manufacturer's instructions. Libraries were sequenced at the Molecular Evolution Core at the Georgia Institute of Technology on an Illumina NextSeq500 using 75-bp single-end runs.
Bioinformatic analyses. RNA-seq reads were trimmed using Cutadapt 1.13 (51), using a minimum read length threshold of 25 bases. The non-P. aeruginosa "decoy" species were identified from previous work using CLARK 1.2.3 with an abundance cutoff 2% in at least one human sample (see Data Set S1 in the supplemental material) (24,41). We built a metagenome using these species by downloading at least one genome from each of the 53 species identified, in addition to S. epidermidis, from the National Center for Biotechnology Information as previously described (24). We expect that these non-P. aeruginosa reads map to the similar decoy species better than to P. aeruginosa. For all samples, reads were mapped to this metagenome using Bowtie 2.2.6 with the default parameters for end-to-end alignment (42). We removed the reads that mapped to non-P. aeruginosa species from our trimmed reads files using Seqtk (43) and mapped the remaining reads to PAO1 (NC_002516.2, NCBI Assembly: GCF_000006765.1, gff-spec-v1.21). Reads were counted for using Rsubread 1.26.1 using default options and an SAF input file composed of all genes without decimal points in their locus tags (44). Because not all models had data with strand information available, comparisons were conducted without strand specificity; however, this did not qualitatively impact results compared to strand-specific analysis for differential expression and enrichment analyses. For the PCA in Fig. 1, we used a subset of 2,606 genes such that each had least one read mapping to it for all the samples (Data Set S2). For PCA and calculation of model accuracy, DESeq2's variance stabilizing transformation function was used (v1.20.0) (39). Differential expression between the Denmark and U.S. clinics (Data Set S2) was also determined using DESeq2. All figures were created using ggplot2 (45). The R package fgsea was used for enrichment analysis using the stats score -log(P value) ϫ sign(log 2 FC) and nperm ϭ 1,000 (46). The sunburst plots in Fig. 2 and 4 were generated using R package ggsunburst (47). The UpSet plot in Fig. 5 was prepared by altering the UpSetR R package (48).
Mammalian cell culture. Immortalized homozygous CFTR ΔF508 CFBE41ohuman bronchial epithelial cells (obtained from J. P. Clancy, Cincinnati Children's Hospital) were maintained in a humidified incubator at 37°C and 5% CO 2 in minimal essential Eagle medium (MEM) containing phenol red (Gibco) supplemented with 10% fetal bovine serum (Gemini Bio-Products), 0.5 g/ml Plasmocin prophylactic (InvivoGen), 2 mM L-glutamine, 5 U/ml penicillin, and 5 g/ml streptomycin (Sigma) (19). CFBE41ohuman bronchial epithelial cells are not on the commonly misidentified list, and cells were tested quarterly for mycoplasma using a Southern Biotech mycoplasma detection kit. CFBE41oepithelial cells were seeded at near confluence on transwell permeable-membrane supports (Costar). After attachment and confluence, CFBE41oepithelial cells were differentiated at air-liquid interface for 1 week (49).

Infection of differentiated respiratory epithelium.
For bacterial biofilm growth on biotic surfaces, CFBE41ohuman bronchial epithelial cells were inoculated in duplicate with P. aeruginosa that was prewashed in MEM lacking phenol red supplemented with 2 mM L-glutamine at a multiplicity of infection of approximately 25. A strain of P. aeruginosa PAO1 carrying a multiple copy plasmid that constitutively expresses gfp (pSMC21) was used (19). After 1 h of attachment, nonattached bacteria were removed, and the apical medium was adjusted to 0.4% L-arginine. After an additional 5 h, biofilms were processed for RNA. P. aeruginosa strain PAO1 RNA was isolated from biofilms grown for 6 h on differentiated CFBE41ocells by phenol-chloroform extraction using RNA-Bee and 0.1-mm zirconia/silica beads in a BeadBeater (BioSpec Products) (50). RNA was precipitated with isopropanol and linear acrylamide, and RNA pellets were washed by ethanol precipitation (50). RNA was treated with Turbo DNase (Ambion) and purified by an RNA Clean and Concentrator (Zymo Research). RNA concentration was measured by using a NanoDrop apparatus. DNA removal was confirmed by 260/280 and 260/230 ratios and by PCR for the rplU gene. RNA integrity was determined by agarose gel electrophoresis and visualization of 5S, 16S, 18S, 23S, and 28S bands. RNA-seq library preparation and sequencing were performed by the Health Sciences Sequencing Core at Children's Hospital of Pittsburgh. RNA concentration and integrity was confirmed by fluorometric quantification (Qubit) and Tapestation analysis (Agilent). RNA was rRNA depleted using Ribo-Zero Epidemiology, and sequencing libraries were prepared using a Truseq stranded total RNA kit (Illumina). Single-end sequencing was performed on a NextSeq 500. For CFBE41ohuman bronchial epithelial cell-P. aeruginosa biofilm coculture samples, approximately 75 million 75-bp reads were obtained.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.