**DOI:**10.1128/mBio.00324-13

## ABSTRACT

The site-to-site variability in species composition, known as β-diversity, is crucial to understanding spatiotemporal patterns of species diversity and the mechanisms controlling community composition and structure. However, quantifying β-diversity in microbial ecology using sequencing-based technologies is a great challenge because of a high number of sequencing errors, bias, and poor reproducibility and quantification. Herein, based on general sampling theory, a mathematical framework is first developed for simulating the effects of random sampling processes on quantifying β-diversity when the community size is known or unknown. Also, using an analogous ball example under Poisson sampling with limited sampling efforts, the developed mathematical framework can exactly predict the low reproducibility among technically replicate samples from the same community of a certain species abundance distribution, which provides explicit evidences of random sampling processes as the main factor causing high percentages of technical variations. In addition, the predicted values under Poisson random sampling were highly consistent with the observed low percentages of operational taxonomic unit (OTU) overlap (<30% and <20% for two and three tags, respectively, based on both Jaccard and Bray-Curtis dissimilarity indexes), further supporting the hypothesis that the poor reproducibility among technical replicates is due to the artifacts associated with random sampling processes. Finally, a mathematical framework was developed for predicting sampling efforts to achieve a desired overlap among replicate samples. Our modeling simulations predict that several orders of magnitude more sequencing efforts are needed to achieve desired high technical reproducibility. These results suggest that great caution needs to be taken in quantifying and interpreting β-diversity for microbial community analysis using next-generation sequencing technologies.

**IMPORTANCE** Due to the vast diversity and uncultivated status of the majority of microorganisms, microbial detection, characterization, and quantitation are of great challenge. Although large-scale metagenome sequencing technology such as PCR-based amplicon sequencing has revolutionized the studies of microbial communities, it suffers from several inherent drawbacks, such as a high number of sequencing errors, biases, poor quantitation, and very high percentages of technical variations, which could greatly overestimate microbial biodiversity. Based on general sampling theory, this study provided the first explicit evidence to demonstrate the importance of random sampling processes in estimating microbial β-diversity, which has not been adequately recognized and addressed in microbial ecology. Since most ecological studies are involved in random sampling, the conclusions learned from this study should also be applicable to other ecological studies in general. In summary, the results presented in this study should have important implications for examining microbial biodiversity to address both basic theoretical and applied management questions.

## Introduction

Microorganisms appear to be the most diverse group of life presently known, inhabiting almost every imaginable environment on Earth (1). They play integral and unique roles in ecosystem functions and biogeochemical cycling of carbon (C), nitrogen (N), sulfur (S), phosphorus (P), and various metals. Understanding the structure, functions, interactions, stability, and adaptations of microbial populations/communities is crucial for basic science discovery (2, 3), biotechnology (4), agriculture (5), energy (6), the environment (7), and human health (8). However, due to their extremely high diversity and as-yet-uncultivated status, characterizing microbial diversity and establishing the linkages between microbial diversity and ecosystem function are very challenging (9). Understanding the mechanisms controlling microbial diversity and functions is even more difficult. The recent advances in metagenomics, which has emerged as a cutting-edge 21st century science (10), and associated metagenomics technologies such as high-throughput sequencing and microarrays (10, 11) provide revolutionary tools to address these challenges. Large-scale high-throughput sequencing-based metagenomics is providing unprecedented views of the taxonomic diversity, metabolic potential, and ecological roles of microbial communities in various habitats (12–32). Various studies clearly demonstrate that large-scale sequencing approaches are powerful in studying microbial community diversity and activity (23, 33–43).

β-Diversity, the site-to-site variability in species composition, is crucial in understanding patterns of species diversity across various spatial and temporal scales. It can provide insight on the mechanisms controlling community compositions and structure (44). β-Diversity is widely used for investigating the mechanisms controlling biodiversity (45–47) and the responses of biological communities to environmental changes (45). In microbial ecology, high-throughput metagenomics sequencing and associated technologies are the major tools for examining the site-to-site variability in species composition and its response to environmental changes (48, 49). However, the amplicon-based approach has shown limited reproducibility, especially when examining low-abundance taxa (50). For instance, based on the overlap of operational taxonomic units (OTUs), a very low reproducibility (<20% between two technical replicates and <10% among three technical replicates) was obtained (50), which is far away from the theoretical expectation of 100% overlap among technical replicates. Similar results were recently obtained with mock communities (51) and human microbiomes (13).

The high numbers of variations in technical replicates are most likely due to the sampling artifacts associated with random sampling processes (50, 52), because many steps in the pyrotag-based sequencing analysis are associated with random sampling, e.g., PCR amplification of target genes, ligation of amplified PCR products to sequencing adaptors, emulsion and immobilization of beads, and bead deposition (50). Since microbial communities under natural settings are extremely complex and generally have abundance curves with very long tails, that is, large portions of OTUs exist in extremely low abundance, the probability of sampling such rare OTUs in a sampling event is low. The chances of resampling them are even lower, especially with limited sampling efforts (i.e., percentages of total individuals sampled in a community). It is expected that the severity of such sampling artifacts on community comparison is dependent on community complexity and sampling efforts. As the complexity of a microbial community increases, such an artifact will become more severe. Increasing sampling efforts will help to ameliorate such a problem (50). However, there is no theoretical foundation to support such a speculation.

In this study, we hypothesize that a random sampling process is the main cause for high numbers of variations among technical replicates. To test this hypothesis, the main objective of this study is to provide a theoretical foundation on understanding such sampling artifacts associated with a random sampling process. We first developed a theoretical framework to simulate the random sampling processes based on general sampling theory and to predict sampling efforts for achieving a desired reproducibility. We then illustrated the sampling artifacts associated with random sampling processes using an analogous example. We also examined whether the developed framework could be used to predict the low percentage of overlap of OTUs among technical replicates. Our results indicated that high numbers of variations in technical replicates were due to the artifacts associated with random sampling processes.

Mathematical framework. (i) Sampling individuals from a large regional community.The pattern of species abundances across different spatial-temporal scales is a central issue in ecology. However, it is generally impossible to directly measure the abundance of all species at ecologically relevant scales. Thus, it is important to understand the relationship between the underlying species abundance distribution of a large regional community and the observed abundance distribution in a small sample from the large regional community. Various statistical sampling theories are developed to describe the relationships of the species abundance between the sampled community and the large regional community (53). Here, we will use general sampling theory to simulate and predict the sampling artifacts associated with random sampling processes.

Assuming that a species (or OTU) occurs in a large regional community, the number of its individuals that exist in a small sample of the large community is dependent on the total abundance of that species in the large community, the size of the sample, and the spatial distribution of the individuals (54). In this study, we assume that all individuals in the large regional community are located randomly in space. Let *N* represent the total number of individuals (e.g., 16S rRNA gene sequences) and *n* be the number of different species (e.g., OTUs), each with abundances *x*_{1}, *x*_{2}, …, *x*_{n}. If an individual is randomly sampled from the community, the probability of it belonging to the *i*^{th} species is *x*_{i}*/N*. If *m* individuals are randomly sampled from this large community, the expected number of individuals from the *i*^{th} species is *mx*_{i}*/N*. With replacement, then, the probability of the *i*^{th} species with abundance *x*_{i} to be encountered by *k* individuals in the sample is given by the following binomial function expression (55, 56):

To estimate the probability that at least one individual of the *i*^{th} species is present in the sample, we generally calculate the probability that the *i*^{th} species is absent in the sample, i.e., *k* = 0. If *k* = 0, then the probability of the *i*^{th} species with abundance *x*_{i} in a community to be absent in the sample is expressed as

Let *x*_{i}/*N* equal *ax*_{i}/*m*, where *a = m/N*, the sampling ratio. Then, equation 2 is approximated to the exponential form of the Poisson distribution:

According to the Poisson distribution, the probability of the *i*^{th} species with abundance *x*_{i} in the community to have at least one individual in the sample can be expressed as follows (54):
*a* of the large regional community can be expressed as
_{a}(*m*) is the observed species abundance distribution in a sample with *m* individuals sampled. φ(*x*) represents the species abundance distribution with abundance *x* in the large regional community, in which θ is the vector of parameters (57).

(ii) Expected species overlap among samples with the size of the large community known.Given a large community with a known underlying species abundance distribution, φ(*x*), the expected proportion of species in common between samples can be theoretically predicted. In the case in which the species abundances in these samples are perfectly correlated, the expected number of the species shared between two samples is given by (54)
*a*_{1}, *a*_{2}, and *a*_{3} are sample ratios from samples 1, 2 and 3, and *a*_{1} = *m*_{1}/*N*,*l* = 1, 2, 3. *m*_{1} is the number of individuals (i.e., 16S rRNA gene sequences in this study) obtained from the *l*^{th} sample.

Various similarity metrics are used for assessing overlap among different samples (58). In this study, the two popular similarity metrics, Jaccard’s incidence-based and Bray-Curtis’s abundance-based methods, are used. Based on the Jaccard similarity index, the proportion of species (i.e., OTUs in this study) overlap between two samples

The proportion of species overlap among three samples

There are seven commonly used continuous species abundance distributions, including lognormal, exponential, gamma, truncated hyperbolic, continuous log series (54), inverse gamma, and inverse Gaussian distribution. When the total number of individuals of the regional community (*N*) is known, the Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or three samples are summarized in Tables S1A and S1B in the supplemental material.

### Table S1

Copyright © 2013 Zhou et al.This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

(iii) Expected species overlap among samples with the size of the large community unknown.Using equations 8 and 9 to estimate the percentages of species overlap between two or three samples requires information about the total individuals (*N*) of the communities examined. However, in most cases, the number of total individuals in a community is unknown. In the following, we will consider the situation in which *N* is unknown.

Since most of the species abundance distributions are scale invariant under a Poisson sampling process (53, 57, 59, 60), the expected sample abundance distribution can be obtained by rescaling the community abundance distribution (*x*) to the sample abundance distribution (*y*): *y = px*, where *p* is the proportion of the community sampled. Thus, equation 8 can be rewritten as
*N* is unknown, which are the rescaled parameters and functions of *p*. In the case of two random samples from the same community with parameters θ, we set *p* = *a*_{1} + *a*_{2} and then obtain *a*_{1}/(*a*_{1} + *a*_{2}) and *a*_{2}/(*a*_{1} + *a*_{2}). Let *m*_{1} and *m*_{2} be the total number of individuals observed in sample 1 and sample 2, respectively. Now we can substitute *a*_{1} = *m*_{1}/*N* and *a*_{2} = *m*_{2}/*N* into and*m*_{1} and *m*_{2}, i.e., *m*_{1}/(*m*_{1} + *m*_{2}) and *m*_{2}/(*m*_{1} + *m*_{2}). θ* is the vector of parameters of the species abundance distribution in the samples. Consequently, the expected species overlap can be obtained by fitting data to the parameters θ* without knowing the total number of individuals of the community, *N*.

Similarly, in the case of three samples randomly sampled from one community, the parameter *p* can be set to the sum of sampling ratios, *p* = *a*_{1} + *a*_{2} + *a*_{3}. Then, equation 9 can be rewritten as
*a*_{1}/(*a*_{1} + *a*_{2} + *a*_{3}) = *m*_{1}/(*m*_{1} + *m*_{2} + *m*_{3}), *a*_{2}/(*a*_{1} + *a*_{2} + *a*_{3}) = *m*_{2}/(*m*_{1} + *m*_{2} + *m*_{3}), *a*_{3}/(*a*_{1} + *a*_{2+} + *a*_{3}) = *m*_{3}/(*m*_{1} + *m*_{2} + *m*_{3}).

Now, according to equations 10 and 11, the species overlap for two and three samples can be estimated based on sample abundance *y* rather than community abundance *x*, so that there is no need to know the community size *N*. When the total number of individuals of the regional community is unknown, the Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or among three samples are summarized in Tables S1C and S1D in the supplemental material.

(iv) Predicting sampling efforts for achieving a desired overlap among replicate samples.One important question in practice is what levels of sampling efforts are needed for achieving desired species overlap when the number of total individuals of the community is unknown. In the following, we will address this practical question by assuming that and _{m̅2̅′̅} individuals are needed for sampling to achieve a desired overlap between two samples. To simplify the situation, let *A* = *N*. Based on equation 13, when *N* is known, the predicted overlap between two samples is given by

Similarly, when *N* is unknown, we can take the transformation of the abundance variable *y* = *px* = (*a*_{1} + *a*_{2})*x*. By comparing equations 8 and 10, the relationship of the expected number of the shared species when *N* is known or unknown is given by
*A** as the expression of the number of sequences, *A** = *m*′/(*m*_{1} + *m*_{2}), the predicted overlap model is given by

Similarly, the predicted overlap model for three samples is given by

By solving equations 15 and 16, the sampling efforts *m*′ can be estimated based on sample abundance *y* without knowing the community size *N*. The Jaccard similarity-based explicit expression functions of the proportion of species overlap between two or among three samples for predicting sampling efforts are summarized in Tables S1C and S1D in the supplemental material.

## RESULTS

Simulation with an analogous example.To better illustrate the effects of random sampling processes on the OTU overlap among technical replicates (3), we use an analogous example by randomly sampling balls from three jars containing the same number of balls of different colors (Fig. 1). Assume that three identical jars contain *N* balls of *n* different colors. Their abundance distributions vary among different colors of balls but are identical among these jars. Here, individual balls are equivalent to individual 16S rRNA gene sequences, while the types of balls with different colors are equivalent to individual OTUs. To simplify the situation, we assume that the same numbers of balls, *m*, are randomly sampled from these three jars, yielding samples 1, 2, and 3 (Fig. 1). Theoretically, if all balls from the whole jars are sampled (i.e., *m* = *N*), 100% ball overlap will be expected among these samples. However, in reality, the percentages of ball overlap will be less than 100%, because they depend on the sampling efforts, ball abundance distribution, and complexity of the community. The differences of the percentages of ball overlap between the theoretically predicted and the observed values among different sampling events are entirely due to random sampling processes, because there is no difference in the ball compositions and abundances among these three jars.

With the assumption that the ball abundance distributions in these three jars follow any of the five continuous species abundance distributions as listed in Table S1A in the supplemental material, we simulated the effects of the random sampling processes on the ball overlap based on Jaccard and Bray-Curtis indexes for five different distributions: exponential, gamma, lognormal, inverse Gaussian, and inverse gamma. Under each specific ball abundance distribution, the average observed overlap through simulations of 100 repeated samplings (see Materials and Methods for details) was calculated and compared to the theoretically predicted overlap between two samples based on equation 8 or 10, when *N* is known or unknown, respectively. Similar analyses were carried out for comparing ball overlaps among three samples based on equation 9 or 11. Although the simulation results vary considerably with the parameters selected, the following generalizations can be drawn. First, no significant differences of Jaccard similarities were observed between the theoretically predicted and the observed overlap values for all five species abundance distributions (Table 1 and Fig. 2; see also Fig. S1 and S2 in the supplemental material), indicating that the theoretically predicted percentages of ball overlap match well to the observed percentages of ball overlap for all ball abundance distributions examined. Second, there were no significant differences of the predicted overlap percentages derived from equation 8 with known *N* and equation 10 with unknown *N* for two samples or from equation 9 with known *N* and equation 11 with unknown *N* for three samples (see Table S2 in the supplemental material), suggesting that accurate predictions of the overlap percentages between two or among three samples can be obtained when the size of the regional communities is unknown. In addition, low ball overlap percentages were observed with low sampling efforts under different ball abundance distributions (Fig. 2; see also Fig. S1 and S2). For instance, under exponential distribution, when 1% of the community was sampled, 50% overlap between two samples (Fig. 2A) and 34% overlap among three samples (Fig. 2B) were obtained. As the sampling efforts increase, the ball overlap among samples increases under different ball abundance distributions (Fig. 2; see also Fig. S1 and S2). For example, under exponential distribution, when 10% of the community was sampled, 91% ball overlap between two samples (Fig. 2A) and 82% among three samples were obtained (Fig. 2B). All of the above results suggested that accurate predictions of the overlap could be obtained between two samples with equation 8 or 10 and among three samples with equation 9 or 11.

### Figure S1

*a*

_{1}=

*a*

_{2}) were selected from the same community with 10

^{6}individuals and 10

^{4}types of balls with different colors. In each sampling event, we calculated the theoretically predicted Jaccard overlap (blue line) by equation 8, predicted overlaps (red line) by equation 10, and the average Jaccard observed overlap (dots) of 100 repeated simulations. Distribution parameters are set as follows: (A) Gamma distribution, α = 5, β = 20; (B) Lognormal distribution, μ = 3.912, σ = 1.2; (C) Inverse Gaussian distribution, μ′ = 100, λ′ = 100; (D) Inverse gamma distribution, α′ = 6, β′ = 500. Download Figure S1, DOCX file, 0.4 MB.

Copyright © 2013 Zhou et al.This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

### Figure S2

Copyright © 2013 Zhou et al.This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

### Table S2

### Figure S3

Because the general explicit form of the ball overlap for quantitative similarity index could not be derived, numerical simulations were performed (see Materials and Methods) for all five ball abundance distributions (see Fig. S3 in the supplemental material) for comparisons based on Jaccard similarity, all of the numerical simulations indicated that low percentages of overlap were also observed among samples with low sampling efforts. For instance, under exponential distribution, when 1% of the community was sampled, 53% of ball overlap based on the Bray-Curtis index was obtained between two samples and 35% among three samples. All of these results, based on both Jaccard and Bray-Curtis similarities, indicate that substantially high numbers of variations can be obtained among communities of identical compositions and abundance distributions when sampling effort is low. Such variation is entirely due to random sampling processes.

Empirical examples. In our previous studies, the composition and structure of 24 microbial communities from a long-term global change experimental site in Oklahoma were analyzed using the amplicon-based sequencing detection approach (49). Each community was amplified with 2 or 3 bar-coded primers, followed by sequencing using both forward and reverse primers. Thus, three types of datasets were available: sequences from forward and reverse primers and combined sequences. Since the sequences of individual technical replicates within an experimental plot are derived from the same community, conceptually, they should obey the same species abundance distribution. Thus, the sequences from all technical replicates within a plot (a community) were pooled for fitting species abundance models described in Table S1A in the supplemental material. The best-fit model and associated parameters for each plot (i.e., community) were shown in Table S3A for the soil communities amplified with two tags and in Table S3B for the soil communities amplified with three tags. Either the exponential abundance distribution or the inverse gamma distribution is the preferred OTU abundance distribution for all soil samples (see Tables S3A and S3B).

### Table S3

Once the OTU abundance distribution models are determined for all communities examined, the predicted overlap percentages in terms of Jaccard and Bray-Curtis similarities were calculated based on the formula provided in Tables S1C and S1D in the supplemental material. The results were listed in Tables S4A and S4B for the communities amplified with two tags and in Tables S5A and S5B for those with three tags. Overall, no significant differences were observed between observed and predicted Jaccard overlaps (Table 2).

### Table S4

### Table S5

Since Bray-Curtis overlap is a quantitative similarity index, it is generally higher than the incidence-based overlap (see Tables S5A and S5B in the supplemental material). Overall, no significant differences were observed between observed and predicted Bray-Curtis overlap percentages (Table 2). All of the above-reported results indicated that the predicted overlap percentages match well to the observed results. Therefore, the variations among technical replicates can be best explained by the artifacts associated with random sampling processes.

Predictions of sampling efforts for the desired species overlap.Based on equations 15 and 16, on the explicit expression functions presented in Tables S1C and S1D, and on sample abundance and sequence information for various tags in Tables S3A and S3B, sampling efforts for achieving various degrees of OTU overlaps between two (see Table S6A) and among three technical replicates (see Table S6B) were predicted. To achieve 20% OTU overlap for two tags, an average of 2,367 sequences are needed (see Table S6A; Fig. 2A, 2UC), which is consistent with the number of sequences obtained in this experiment. To achieve 90% overlap between two technical replicates, an average of 60,900 sequences are needed (see Table S6A). For instance, for the community of 2UC, a total of 71,400 of the combined sequences would be needed to achieve 90% OTU overlap between two technical replicates (Fig. 3A), which is about 32 times more sequence reads needed than we sequenced previously (see Table S4 to S6).

### Table S6

Much more sequencing effort is needed to achieve desired OTU overlaps among three technical replicates than between two technical replicates. To have 10% OTU overlap for three tags, an average of 3,310 sequences are needed, which is consistent with the number of sequences obtained in this experiment. To reach 90% overlap between three technical replicates, an average of 63,770 sequences are needed (see Table S6B). For example, for the community of 1UC, about 60,500 sequences are required to obtain 90% of OTU overlap (Fig. 3B). The current sampling efforts (2,018 sequences) are far less than the desired 90% OTU overlap. Our results also suggested that most of the work published, especially with soils, is severely undersampling if the goal is to determine significant changes in β-diversity among sampling sites.

## DISCUSSION

One of the major technical challenges for the amplicon-based sequencing detection approach is low reproducibility (50), which is a central issue in comparative studies (61). This issue has recently been examined intensively, but it is still a matter of debate (13, 31, 51). Results from several recent studies supported (13, 50, 51, 62), disputed (31, 63–65) or both supported and disputed (66) our previous observations. A number of factors can contribute to such divergent observations, e.g., the complexity of the systems examined (50, 51, 64), differences in sequencing depths (50, 62, 63), and/or variations in sequencing and sequence preprocessing approaches (51, 67). Since most of these factors act with each other, isolating individual factors influencing sequencing reproducibility is extremely difficult, especially when natural communities of unknown diversity background are examined. Using artificial communities of known diversity, Pinto and Raskin (51) provided explicit evidences for the poor reproducibility of the amplicon sequencing-based detection approach, even with the simple community and relatively deep sequencing. Thus, poor reproducibility is a problem inherent in the amplicon sequencing-based detection approach (50).

Such poor reproducibility among technical replicates could result from PCR amplification biases (67–72), sequencing errors (51, 67, 71, 72), and/or the artifacts associated with random sampling processes during sample preparation and sequencing (50, 52). In this study, using analogous ball examples, we showed that very low percentages of overlap were observed among replicate samples from the community with identical ball types and numbers when the sampling effort was low, which is very consistent with what we observed experimentally (50). In addition, under different OTU abundance distributions, the obtained overlap percentages among two or three technical replicates were very consistent with the theoretical predicted values under the assumption of random sampling processes. The simulation results presented in this study provided explicit evidences of the contributions of random sampling processes to the high numbers of variations observed among technical replicate samples. It should be noted that spurious OTUs due to sequencing errors could also contribute to such technical variations, but they will be indistinguishable from rare abundant OTUs which can be detected only sporadically in replicate samples (51).

The low reproducibility of the amplicon sequencing-based detection approach associated with random sampling processes raises a concern of comparing the β-diversity of microbial communities across different samples for amplicon sequencing. Although the inherent high numbers of variations associated with random sampling processes have less effect on α-diversity, it is problematic in estimating β-diversity (50, 51), which presents a significant challenge for comparative studies across different spatial and temporal scales. However, the degrees of such effects on estimating β-diversity are dependent on the complexity of the community examined and sampling efforts. In general, as the complexity increases, such problems will be more severe (50) and greater sampling efforts will be needed (50, 62). Nevertheless, great caution needs to be taken for estimating community based on OTUs when the sequences examined are not enough to represent community diversity (51, 62).

Determining the patterns and distribution of species abundance is a central issue in ecology (53), because they are important in studying both basic ecological theory and applied biodiversity conservation. However, direct measurement of species abundance in ecologically relevant scales is difficult, if not impossible. Instead, the distribution of species abundance in ecology is generally inferred based on limited samples under Poisson sampling. Therefore, the artifacts associated with random sampling processes observed in this study for microbial communities should also be applicable to the other ecological studies in plants and animals, although such problems could be less severe in macroecology. However, to the best of our knowledge, such an issue has not been addressed in the ecological literature, but it is of critical importance in studying species distribution, especially species-area relationships (52, 73). Along with our previous efforts (50, 52), this study clearly demonstrates the importance of random sampling processes in estimating microbial biodiversity, especially β-diversity. The general conclusions learned from this study should be applicable to other ecological studies in general.

In conclusion, the factors causing variations in β-diversity are among the most important but poorly understood issues in ecology, because they are the key mechanisms influencing global variation in biodiversity. Next-generation sequencing technologies, such as PCR amplicon sequencing-based detection approaches, have been rapidly used for characterizing microbial biodiversity, but they also suffer from several inherent drawbacks, such as a high number of sequencing errors, biases, and poor reproducibility and quantitation. Through mathematical modeling and simulation based on general sampling theory, this study provides explicit evidences of random sampling processes as the main factor causing high percentages of technical variations and develops a framework for predicting sampling efforts for achieving the desired technical reproducibility. Since most ecological studies are involved in random sampling, the artifacts associated with random sampling processes observed in microbial ecology should also be applicable to macroecology, although such problems could be less severe there. Because such artifacts greatly overestimate β-diversity, great caution should be taken when the amplicon sequencing-based detection is used for drawing quantitative conclusions about β-diversity. Increasing sampling efforts and/or the number of sample replicates (both technical and biological) should be the most effective ways to ameliorate technical reproducibility for drawing more reliable quantitative conclusions, but how to balance the sampling efforts and number of samples analyzed per sequencing run is dependent on biological questions and objectives, as well as the complexity and similarity of communities examined.

## MATERIALS AND METHODS

The details for all materials and methods used in this study are provided in the supplemental material (see Text S1). Briefly, based on general sampling theory (54–56), a mathematical framework was first developed under 7 different OTU abundance distributions for simulating the effects of random sampling processes on quantifying β-diversity when the community size is known or unknown. Second, an analogous ball example was used to explicitly illustrate the effects of random sampling processes on the OTU overlap among technical replicates. Third, the theoretical models were fitted with the empirical experimental sequencing data from our previous studies (50), in which a total of 24 soil communities from a long-term climate change experiment facility (74) were sequenced with 60 tags. An average of 1,121 ± 390 OTUs were obtained for each tag based on the combined samples. In addition, a χ^{2} test was employed to determine whether the predicted OTU overlaps were consistent with the observed values.

### Text S1

## ACKNOWLEDGMENTS

This work has been supported through contract DE-AC02-05CH11231 (as part of ENIGMA, a Scientific Focus Area) and contract DE-SC0004601, by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Genomics: GTL Foundational Science, by the Biological Systems Research on the Role of Microbial Communities in Carbon Cycling Program (DE-SC0004601), by the U.S. National Science Foundation under contract NSF EF-1065844, and by the State Key Joint Laboratory of Environment Simulation and Pollution Control (grant 11Z03ESPCT) at Tsinghua University.

## FOOTNOTES

- Received 2 May 2013
- Accepted 7 May 2013
- Published 11 June 2013

- Copyright © 2013 Zhou et al.