Critical Relevance of Stochastic Effects on Low-Bacterial-Biomass 16S rRNA Gene Analysis

DNA contamination from external sources (reagents, environment, operator, etc.) has long been assumed to be the main cause of spurious signals that appear under low-bacterial-biomass conditions. Here, we demonstrate that contamination can be separated from another, random signal generated during low-biomass-sample sequencing. This stochastic noise is not reproduced between technical replicates; however, results for any one replicate taken alone could look like a microbial community different from the controls. Using this information, we investigated respiratory samples from healthy humans and determined the narrow range of bacterial biomass where samples transition from producing reproducible microbial sequences to ones dominated by noise. We present a rigorous approach to studies involving low-bacterial-biomass samples to detect this source of noise and provide a framework for deciding if a sample is likely to be dominated by noise. We anticipate that this work will facilitate increased reproducibility in the characterization of potentially important low-biomass communities.

of bacterial biomass where samples transition from producing reproducible microbial sequences to ones dominated by noise. We present a rigorous approach to studies involving low-bacterial-biomass samples to detect this source of noise and provide a framework for deciding if a sample is likely to be dominated by noise. We anticipate that this work will facilitate increased reproducibility in the characterization of potentially important low-biomass communities. KEYWORDS 16S rRNA gene, contamination, exhaled breath condensate, low biomass, lung microbiome, sequencing noise, next-generation sequencing T he advent of next-generation sequencing sparked a keen interest in investigating the microbial communities of sample types from locations or environments that were previously considered sterile or too low in microbial biomass to characterize accurately (1)(2)(3)(4)(5)(6)(7)(8)(9). Many of these investigations have fundamentally changed our definition of a sterile site and have demonstrated the large impact that low-bacterialbiomass communities can have, for example, on a human host. However, these studies also have raised critical issues and controversy due to irreproducible results (10)(11)(12)(13)(14). The identification of DNA contamination from the laboratory and in reagents used for DNA isolation and sequencing (15) helped greatly in trying to detangle spurious signal that appeared out of nowhere. Nevertheless, results in studies of low-bacterial-biomass samples that cannot be reproduced remain common, despite consideration of proper quality control samples (10,13,14). Other studies have considered the effect of barcode or index switching, whereby mistakes in reading the identifier for a sequencing group cause the sequence to be attributed to a different group (16)(17)(18). Although this effect has largely been studied on the Illumina HiSeq, the effect in amplicon sequencing does not appear to be large (18). A very recent study identified well-to-well contamination as a major contributor to variation; however, even after repeated assays using multiple thermocyclers, handling robots, and separate locations, no consistency could be found in the variation associated with any given library preparation method (18). This finding suggests that another factor is also at play.
The goal of this study was to test the hypothesis that a stochastic sequencing noise is generated during sequencing on the MiSeq platform-and, likely, other platforms sharing that sequencing design-when sequencing input is below a critical threshold. Stochastic sequencing noise can lead to erroneous and irreproducible conclusions when interpreted as authentic sequences within a given sample. We tested this hypothesis using both laboratory-generated serial dilutions of DNA extracted from Pseudomonas aeruginosa strain PAO1 and authentic respiratory samples collected from healthy human subjects.
For the purposes of this study, we define the "signal" of the sequencer as classifiable sequences shared between technical replicates and "noise" as sequences that can be classified but which lack consistency between technical replicates. In contrast, "contamination," such as reagent contamination, is defined as a signal shared between replicates that is also present in experimental controls. Thus, contamination can be detected only with the appropriate controls. It is also worth noting, however, that even in cases where controls fail to amplify, the consistency, or lack thereof, between sample replicates still provides critical information as to the balance of signal and noise within the sample.

RESULTS
Identifying sequencing noise beyond lab and reagent contamination. We began by modeling how different types of contamination effects could be visualized using the similarity of different kinds of replicates (Fig. 1A). Figure 1A presents an idealized model of contamination effects on a single sample purified using 3 separate DNA isolation kits and 3 technical replicates for each of those kits. Contamination within each kit is assumed to be 100% different from contamination in all other kits. The distribution of a hypothetical similarity score is plotted along the x axis for similarity FIG 1 Serial dilutions of P. aeruginosa DNA purified using three separate DNA isolation kits and sequenced in quadruplicate: effects of low biomass on results. (A) An idealized model of contamination effects on a single sample purified using 3 separate DNA isolation kits and 3 technical replicates for each of those kits. Contamination within each kit is assumed to be 100% different from that of reagents in all other kits. The distribution of a hypothetical similarity score is plotted along the x axis for similarity between technical replicates (in red) and between kits (in blue). Where overlap occurs, the curves appear purple. The first column (Continued on next page) Stochastic Effects on Low-Biomass Gene Sequencing ® between technical replicates (in red) and between kits (in blue). Where overlap occurs, the curves appear purple. The first column depicts the condition of high concentrations of DNA, the second column depicts low concentrations of DNA where kit or reagent contamination is dominant, and the third column depicts low concentrations of DNA where random noise dominates. The critical takeaway from this modeling is that all specific forms of DNA contamination should result in technical replicates looking highly similar. We tested this by sequencing the 16S rRNA gene from serial dilutions of DNA isolated from a pure bacterial culture of a single bacterial strain, in this case, Pseudomonas aeruginosa PAO1. DNA was isolated using three separate DNA isolation kits, and each dilution was sequenced using four technical replicates. Consistent with previous observations, as the concentration of input DNA decreased, the bulk of the community membership was composed of reads belonging to operational taxonomic units (OTUs) unrelated to P. aeruginosa and thus spurious (Fig. 1B). For example, at a dilution of 1:100,000, only 28.65% Ϯ 4.23% (mean Ϯ standard error of the mean [SEM]) of reads comprised the OTU corresponding to PAO1. To assess the consistency of such noninput sequencing signals, we compared the Bray-Curtis distances for each dilution between isolation kits (i.e., interreplicate distances) and within technical replicates (i.e., intrareplicate). We plotted the results for each dilution set as a density distribution estimate of the intra-and interreplicate Bray-Curtis values (Fig. 1C). Again, in the model where contamination (in this case, kit contamination) is the major driver of sequencing noise, the two peaks should stay together at high concentrations but separate with dilution. Additionally, intrareplicate distance should stay low (technical replicates should look similar). As expected, the distribution of the interreplicate distances (in blue) increased, centering on higher values (i.e., the kits became more dissimilar from each other) as the DNA was diluted. However, contrary to expectation, the intrareplicate distances (in red) also increased as the DNA was diluted and overlapped the interreplicate distances. This finding shows that the replicates from a single kit do not look like themselves, indicating a stochastic source of variation.
Reagent contamination is well documented but was not observed in our initial experiments, so we next tested reagents from seven separate DNA isolation kits by carrying out the DNA isolation process on reagents alone. These controls were sequenced in triplicate, the resulting OTUs were plotted as a heat map (Fig. 1D), and the intrareplicate Bray-Curtis distances were determined (Fig. 1E). Two of the seven controls demonstrated low intrareplicate Bray-Curtis distances and communities consistent with reagent contamination; however, the remaining five controls demonstrated diverse communities that had little or no similarity between technical replicates. Collectively, these data demonstrate that sequencing of dilute samples or reagents can result in a nonreplicable, stochastic, positive signal (sequencing noise) even in the absence of detectable DNA contamination within the control reagents.
Effect of sequencing noise on samples ranging from high to low bacterial biomass. We next examined a group of clinical samples from healthy human volunteers (Table 1) to determine how sequencing noise might affect result interpretation. Respiratory samples were chosen because they span the range from high to ultralow bacterial biomass. To that end, we analyzed samples from the following sources: oral rinse, posterior nasopharyngeal swab, bronchoalveolar lavage (BAL) fluid from right (R) and left (L) lungs, a sample containing an equal mix of RBAL and LBAL (CBAL), and an exhaled breath condensate (EBC) sample. EBC is of considerable interest because, as a noninvasive sample, it could greatly expand the type of microbiological questions that We extracted DNA from both samples and controls (including experimental controls [e.g., sterile saline passed through the bronchoscope before bronchoscopy] and isolation controls [e.g., purified DNA from DNA isolation reagents]) as described in Materials and Methods, and from this we amplified the V4 region of the bacterial 16S rRNA gene in triplicate using a dual-barcoding strategy and following a low-biomass protocol (19). Finally, we sequenced each sample using an Illumina MiSeq instrument.
First, we examined whether biological samples were different from their respective controls. In each of these comparisons, the samples were found by permutational multivariate analysis of variance (MANOVA) (adonis) to contain significantly different communities than the controls (oral rinse versus saline, P ϭ 0.001; nasal swab versus control swabs, P ϭ 0.001; right BAL fluid versus scope prewash, P ϭ 0.001; left BAL fluid versus scope prewash, P ϭ 0.001; CBAL fluid versus scope prewash, P ϭ 0.001; EBC versus EBC control, P ϭ 0.003). In the model where the only sources of signal were either genuine or contamination, such results suggest that the signal in every instance should be considered a real community. However, when the similarity or dissimilarity of technical replicates (as measured by intrareplicate Bray-Curtis distance) was examined, it became clear that the communities in certain sample types (e.g., oral) were more reproducible than others (e.g., BAL fluid) and that some (e.g., EBC) were as irreproducible as reagent controls ( Fig. 2A). This finding suggested a relationship between the number of 16S rRNA gene copies in each sample (Fig. 2B) and reproducibility. When this was explicitly tested (Fig. 2C), it became clear that not only was there a definite relationship between bacterial biomass and sample reproducibility but also that one can predict whether a sample will need technical replication to separate signal from noise. We found that there exists a very narrow range (10 3 to 10 4 16S rRNA gene copies per sample) within which samples went from being reproducible between technical replicates to instead being irreproducible and dominated by noise. The majority of control samples fell within or below this range (Fig. 2B).
Finally, we applied what we had learned to the question of whether EBC can be used as an accurate replacement for BAL fluid in determination of the lung microbiota. The  combined BAL fluid (CBAL) and the EBC were separable from their controls ( Fig. 3A and  B); traditionally, this result would suffice to consider each to have a detectable microbiome that could be compared. However, as has been shown, the intrareplicate distances of technical replicates suggest that noise can play a large role in these samples. To visualize the effect of this noise on the communities in question, we plotted each replicate set against the other two for both CBAL and EBC ( Fig. 3C and D).
A sample composed of all specific signal and no noise would be expected to have a perfect correlation between all OTUs detected within each technical replicate (i.e., a diagonal line across a 3D box of the 3 replicates); in contrast, a sample with only noise would have no correlation (i.e., data distributed along the axes of the 3D box).
Examination of the CBAL sample (Fig. 3C) shows that, while noise is present, a consistently reproducible signal exists across all the CBAL samples. The same cannot be said for the EBC samples, as very few points appear off the axes (Fig. 3D). In this situation, the standard method of taking the mean of the abundance of community members would be an inaccurate representation of the community due to the preponderance of outliers. A more accurate method to assess the reproducible portion of the community across multiple samples is to take the mean of replicate medians, which deemphasizes the stochastic noise between samples. We applied this approach to the EBC and CBAL samples and plotted the results according to the average rank abundance of the CBAL samples (Fig. 3E). The EBC samples both lacked consistent community structure and did not resemble the CBAL samples (Fig. 3E, top). In contrast, across all CBAL samples, we reproducibly observed 30 to 40% of the BAL fluid community to be made up of Prevotella spp. and Veillonella spp., with smaller but real abundances of other community members (Fig. 3E, bottom). This is an important and validating finding, as Prevotella spp. and Veillonella spp. have been found to be dominant community members in most lung microbiome studies. Neither sample set appeared to be significantly affected by reagent contamination (Fig. 3E, insets).

DISCUSSION
Collectively, our data analyzing both a defined bacterial community and real human clinical samples demonstrate that (i) stochastic noise occurs in real-world samples of low bacterial biomass; (ii) below a total of 10 4 copies of 16S rRNA gene in a sample (as determined by droplet digital PCR), one needs to worry about noise dominating real sequences; (iii) critical examination of the composition of technical replicates can be used to separate signal from noise; and (iv) EBC is not a satisfactory sample for sampling lower respiratory bacterial microbiota from healthy individuals using the sequencing method that we tested.
These data extend the now-foundational study by Salter and colleagues (15), which demonstrated that when low-biomass samples are being sequenced, failure to sequence relevant controls simultaneously can cause even the most carefully designed study to be skewed by DNA contamination within reagents. The current study demonstrates that stochastic sequencing noise is an additional critical element to consider when low-biomass samples are being sequenced, as it can produce results that can be easily mistaken for real signal. We show that quantifying 16S rRNA gene copy number before sequencing can determine which samples will be most susceptible to being dominated by noise. To that end, we recommend that any sample containing Ͻ10 5 16S rRNA gene copies/sample be treated as likely to be significantly affected by noise. Furthermore, because sequencing of technical replicates allows one to set up a necessary baseline (20), replicate sequencing of vulnerable samples can efficiently allow one to differentiate real signal from noise and can aid in detecting contaminants.
Replicate sequencing also introduces an element of precision into an analysis that frequently possesses no true negative controls. Sequencing techniques cannot prove that a sample for microbiome analysis is negative (i.e., contains absolutely no microbial DNA). However, one can say within precision of x number of replicates (e.g., three) that no consistent signal was found. Using this approach, which must obviously balance Stochastic Effects on Low-Biomass Gene Sequencing degree of precision versus cost, we concluded that EBC is a poor sample for 16S rRNA gene assessment. We base this conclusion not on an assertion that it was negative or that it did not differ from its controls but on the fact that within the number of replicates that we ran on each sample, we were unable to find a consistent signal. This approach should be applied to other low-biomass sample types that remain controversial, such as peripheral blood and placenta (21).
Our data suggest that the origin of this stochastic noise is related to the sequencer itself, most likely due to underloading of the flow cell and very low cluster densities. While future studies will be needed to uncover the specific problems within the system, it is not hard to imagine a situation where the weak signal produced from very low cluster densities could result in an increase in the sensitivity (gain) of the detectors, causing an increase in detector noise and cross talk between channels. Likely added to these problems is the sum of many sample-side low-frequency events, such as PCR error (22), index switching (17), and chimera formation (22), each of which likely is magnified in the absence of a strong signal. Indeed, while this study employed state-of-the-art methods for error reduction for sample processing, it seems very likely that the error associated with the above-mentioned factors is different, and likely higher, under low-biomass conditions (23). However, we do not believe that differences in these sources of error alone can explain the extent of the irreproducible differences seen in the low-biomass context. Changes in the overall error rate could be monitored by following replicate dilutions of a mock community, but this cannot separate the factors contributing to the errors. Changes in the frequency of index switching could potentially be monitored by using a combination of sample replication using standard dual indexing barcodes compared to the unique dual index barcodes (17). A recent study identified well-to-well contamination as a likely contributor to the unexpected sequencing results (18); however, the extent of the effect varied greatly depending on library preparation method, run, and location, so it is unlikely that well-to-well crossover would be the only noncontamination cause of sequencing noise. Regardless of the cause, the methods for identifying and dealing with noise described here should improve reproducibility. An interesting corollary to the idea that the noise is associated with the sequencer is that running the same samples on a sequencing platform based on a different underlying technology, such as Oxford Nanopore Technologies MinION, would result in a completely different behavior from sequencing controls and low-biomass amplicons. In summary, low-biomass sequencing experiments push the lower limits of sequencers' capacity to detect signal accurately, so it should be unsurprising that they exceed the threshold of noise that exists in every detection system.
Conclusions. We have identified a stochastic sequencing noise which occurs in low-biomass samples and which is experimentally separable from reagent contamination. Additionally, we identified the range of bacterial DNA concentrations beyond which technical replicates, although expensive, are likely to aid in separating signal from noise and hence improve confidence in results. Finally, we tested these methods on respiratory samples and used our knowledge of sequencing noise to inform our decision that EBC is a poor sample to assess the lung microbiota.

MATERIALS AND METHODS
Pseudomonas aeruginosa DNA dilutions. We obtained cells for isolation of bacterial DNA by growing P. aeruginosa strain PAO1 in a 125-ml disposable plastic Erlenmeyer flask containing 20 ml LB broth, which was agitated overnight on an orbital shaker at 125 rpm in a 37°C incubator. Three 1-ml aliquots, collected in 2.0-ml dolphin-nose microcentrifuge tubes, were spun down in a microcentrifuge at a relative centrifugal force (rcf) of 18,000 at 4°C for 30 min, after which the supernatant was discarded. Each tube was then subjected to DNA extraction using a separate DNeasy kit from Qiagen (MD, USA) according to manufacturer protocols with minor modifications (19) and with the addition, after resuspension of the bacterial pellet in lysis buffer, of a 30-s bead-beating step (using garnet beads [Qiagen, MD, USA]). Once DNA was obtained (referred to as neat), it underwent five serial dilutions (1:10 each) in kit elution buffer. For sequencing, each of the samples from each of the dilutions was sequenced in quadruplicate.
Clinical samples and controls. We recruited 20 healthy volunteers to provide EBC, oral wash, posterior nasopharyngeal swab, and BAL fluid samples for microbial analysis. Participants had a mean age of 53 Ϯ 15 years, were 50% female, were 20% black and 80% white, and were all of non-Hispanic ethnicity (Table 1). They were a mixture of persons who had never smoked, former smokers, and current smokers. With the exception of two participants with undiagnosed moderate airflow obstruction, they were without evidence of lung disease, and none was taking any inhaled medications. Prior to bronchoscopy, all participants underwent a complete history and physical examination by a pulmonologist, spirometry, chest imaging, prospective collection of medication history, and complete blood count with differential, coagulation studies, and chemistry panel.
On the morning of the bronchoscopy visit, with the participants having taken nothing by mouth for at least 8 h, EBC samples were obtained using the RTube collection system from Respiratory Research, Inc. (Charlottesville, NC), according to the manufacturer's instructions. Consistent with American Thoracic Society recommendations (24), participants wore a nose clip and were coached to breath naturally without hyperventilation for approximately 20 min through a one-way valve into a cooled chamber that condenses and collects up to 2 ml of expired vapors, aerosols, and moisture.
Next, before any other procedures, participants provided an oral rinse specimen, as described previously (25). They then underwent bronchoscopy under moderate conscious sedation, using an orally inserted fiber-optic bronchoscope. We performed bilateral BAL in the right middle lobe and lingula, followed by collection of a posterior nasopharyngeal swab, as described in detail elsewhere (2,25,26), but omitting protected specimen brushing. We also created a sample called combined BAL (CBAL) fluid from an equal mix of left-lung and right-lung BAL fluid samples. We collected control samples specific to the experimental samples wherever possible: sterile saline for the oral washes, sterile saline passed as a prewash through the bronchoscope for the BAL fluid specimens, unused swabs for nasal swabs, and EBC control samples obtained by passing 1 ml of sterile double-distilled water over a fresh RTube and collecting the water. Isolation controls were generated by carrying out the DNA isolation procedure without the addition of any sample.
DNA isolation. All liquid samples were processed as described previously (19). Briefly, we aliquoted samples into dolphin-nose Eppendorf tubes and spun them at an rcf of 18,000 at 4°C in a microcentrifuge for 30 min to pellet the cells (bacterial and host). We then extracted DNA from the pellet using the Qiagen DNeasy kits with the addition of bead beating. Brushes or swabs were placed directly in the bead beating tubes, and we carried out the remaining steps of DNA isolation according to manufacturer protocols with minor modifications (19). DNA extraction controls were used for DNA contamination within extraction reagents. In addition to extraction controls that related to the kits used for DNA isolation for the clinical samples, extraction controls were generated from seven total separate DNA isolation kits to examine reagent contamination across multiple lots.
Library preparation and sequencing. We prepared DNA libraries by amplifying the V4 region of the bacterial 16S rRNA gene using a low-bacterial-biomass protocol as described previously (19) with a dual-indexing strategy (27). Sequencing was performed on a MiSeq instrument (Illumina, San Diego, CA) at the Microbial Systems Molecular Biology Laboratory at the University of Michigan. Samples were sequenced in three runs: one comprised the dilution replicates and extraction control experiments (Fig. 1), and the other two runs were composed of the clinical samples. Where physically possible, all samples and replicates from an individual were run on the same plate.
Bacterial load quantification. Bacterial DNA was quantified using a QX200 Droplet digital PCR system (Bio-Rad, Hercules, CA). Quantitation was performed as described previously (28).
Sequence processing and analysis. Fastq files were obtained, and the sequences were processed using the open source software mothur v.1.36 according to the MiSeq SOP minor alterations. We used the SILVA bacterial database for alignment and binned operational taxonomic units (OTUs) at 97% similarity. We generated taxonomies using the RDP taxonomy. After generation of the .shared file and the .cons.taxonomy file, these files were imported into R for final analysis. The R packages that we used for analysis relied heavily upon base-R, vegan (29), ComplexHeatmap (30), ggplot2 (31), dplyr (32), tidyr (33), mvabund (34), and RColorBrewer (35). Additionally, we used Prism 8 (GraphPad Software, San Diego, CA) to generate figures. Final figures were assembled in Photoshop CS5 (Adobe Inc.). Analysis of the controls (Fig. 1C and Fig. S1 to S9 in the supplemental material) indicated that three OTUs were likely contaminants. These OTUs were removed from all clinical samples to eliminate elements that could artificially make samples look more similar to one another. To ensure that removal of the OTUs did not remove all viable counts from a sample, the count totals before and after removal were compared; an average of roughly 90% (ϳ19,000 reads/sample) of the reads were retained. It should be noted that how to deal with backgrounds is a complex issue given the compositional nature of microbiome data and should be evaluated with care.
Data availability. Sequences have been deposited in the SRA under accession numbers PRJNA549253 and PRJNA552077. Code and other files related to the project can be found at https:// github.com/Tetrakis/low-biomass-noise.