Identifying Small Proteins by Ribosome Profiling with Stalled Initiation Complexes.

Proteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the functions of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification, and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.

IMPORTANCE Proteins comprised of 50 or fewer amino acids have been shown to interact with and modulate the functions of larger proteins in a range of organisms. Despite the possible importance of small proteins, the true prevalence and capabilities of these regulators remain unknown as the small size of the proteins places serious limitations on their identification, purification, and characterization. Here, we present a ribosome profiling approach with stalled initiation complexes that led to the identification of 38 new small proteins.
KEYWORDS Ribo-seq, small protein, alternate ORFs, antisense, genome annotation, leader peptide P rotein-protein interactions play an essential role in a variety of cellular processes, such as signal transduction and gene regulation. Small proteins, here considered to be 50 amino acids or fewer and encoded by small open reading frames (smORFs), have been shown to interact with and modulate the functions of larger proteins (reviewed in references 1 to 3). These regulators have been identified in organisms spanning the phylogenetic tree of life, and important roles have been characterized for small proteins in bacteria and eukaryotes. In Escherichia coli, for example, the absence of the 49amino-acid protein AcrZ renders cells more susceptible to specific antibiotics (4), and cells lacking the 31-amino-acid protein MgtS are sensitive to low magnesium concentrations (5,6). In humans and other mammals, the small proteins myoregulin, sarcolipin, and phospholamban regulate muscle activity by affecting calcium transport (7,8).
Despite the possible importance of small proteins, the true numbers of these regulators remain unknown, as their small size limits their identification. ORF-finding algorithms traditionally employ a size limit for the scoring of genes (9) and apply a penalty for overlapping other ORFs (10). Their small size also often prevents these proteins from being accurately detected with protein gels, as they may run in the dye front and be poorly bound by SDS or protein dye (11). Traditional methods of purification are also biased against small proteins (12,13), which have insufficient charge to bind ion-exchange columns and insufficient size to interact with non-reversephase hydrophobic columns or be retained during dialysis. Additionally, small membrane proteins can bind nonspecifically to many column matrices due to their hydrophobicity. Finally, the few peptide fragments generated by proteolysis of small proteins limit detection by shotgun proteomics (14). These challenges have stifled the detection of this class of proteins by standard methods.
As the importance of small proteins is being recognized, more focused searches for these proteins are being carried out (reviewed in reference 15). Early genome-wide studies in E. coli utilized conservation of intergenic DNA sequences and the strength of ribosome binding sites as a starting point for finding new small genes (16,17). Similar approaches have been applied in eukaryotic organisms (18,19), though the computational methods are more difficult, as both the increased size of the genome and the use of alternative splicing can mask small protein genes. In addition to the smORFs found in intergenic regions, there is growing recognition that transcripts can encode proteins in more than one ORF in the same region (reviewed in references 20 and 21); these alternative ORFs (altORFs) generally code for smaller proteins than the originally annotated ORF, with some reported altORF-encoded proteins as small as 14 amino acids (22). Despite the success of computational methods in identifying new smORFs, it is likely that many small proteins have been missed (false-negative results). Conversely, it is critical that the synthesis of predicted small proteins be verified to avoid false-positive results.
Integration of data from large transcriptome analyses can improve the success of computational searches for smORFs. Ribosome profiling, a method that involves deep sequencing of ribosome-protected mRNA fragments, reveals the position of ribosomes throughout the transcriptome, clarifying which smORFs are translated under the conditions examined (reviewed in reference 23). This approach has led to the identification of many small proteins (24)(25)(26)(27)(28), but again, there are limitations. Signals corresponding to altORFs encoded inside the confines of other genes can be swamped by the signal of the annotated gene. Another issue is that ribosome binding to an RNA does not prove that it leads to the production of a polypeptide (29). In eukaryotes, several signatures of profiling data that argue for translation are the presence of strong start and stop codon peaks, as well as three nucleotide periodicity arising from the translocation of ribosomes one codon at a time (30). In bacteria, however, these signatures are weaker and more variable due to the lower resolution of the method, further complicating the discrimination of which transcripts are translated and which are merely bound to the ribosome.
Although peaks in ribosome density at start and stop codons are the most useful in identifying new ORFs, the vast majority of ribosome-protected footprints in profiling data correspond to elongating ribosomes. In eukaryotes, the antibiotics harringtonine and lactimidomycin have been used to trap newly initiated 80S ribosomes at start codons and identify initiation sites (31,32); elongating ribosomes are not inhibited by these antibiotics and continue elongation, terminating normally at stop codons. However, these compounds do not work in bacteria. Mori and coworkers (33) found that treating E. coli cultures with tetracycline, an antibiotic that blocks tRNA binding in the ribosomal A site, leads to the accumulation of ribosome density at start codons. Using ribosome profiling of tetracycline-treated cells, they were able to reannotate the N termini of many known ORFs and discover candidate smORFs in intergenic regions (33). However, tetracycline traps ribosomes imperfectly at start codons. Only half the ribosomes on genes map to their start sites, blurring the signal. One promising alternative, Onc112, prevents initiation complexes from entering into the elongation phase (34,35). Another promising substitute, retapamulin, a small molecule member of the pleuromutilin class, was previously shown to have a similar ability to specifically inhibit the first steps of elongation (36). The recent application of retapamulin in profiling experiments showed strong ribosome density at known start codons and little density attributable to elongating ribosomes; these data allowed the identification of start codons of altORFs within the coding sequences of other genes (37).
Here we present a strategy for identifying small protein genes in E. coli by combining traditional ribosome elongation data with information about initiation sites gleaned from profiling experiments conducted with Onc112 and retapamulin. We sought to verify the synthesis of a subset of the predicted small proteins by assays to detect tagged derivatives and observed expression of 38 of the 41 genes tested. These results demonstrate that ribosome profiling with stalled initiation complexes provides a high-confidence prediction of new small proteins in bacteria. Finally, the presence and location of these new smORFs reveal the density of information encoded by bacterial genomes.

RESULTS
Onc112 traps ribosomes at start codons but does not interfere with elongating ribosomes. The identification of initiation sites in eukaryotes has been aided by the use of antibiotics that enrich for ribosome density at start codons in ribosome profiling experiments. Since such antibiotics have not been available for bacteria, we tested a promising candidate, Onc112, a proline-rich antimicrobial peptide (PrAMP) that binds in the exit tunnel and blocks aminoacyl-tRNA binding in the ribosomal A site (34,38). Toeprinting analyses showed that Onc112 traps ribosomes at start codons, blocking elongation (35). We hypothesized that Onc112 should be selective for newly initiated 70S complexes because elongating ribosomes contain a nascent polypeptide that should prevent antibiotic binding. To test this possibility, we performed ribosome profiling on an untreated E. coli culture, as well as one treated with 50 M Onc112 for 10 min. As shown in Fig. 1a, ribosome density on the highly expressed lpp gene is spread across the coding sequence in the untreated sample but is found almost exclusively at or near the start codon in the Onc112-treated sample. This effect holds genome-wide; in plots of ribosome density averaged over thousands of genes aligned at their start codons, a strong peak appears at the start codon, while there is little or no density attributable to elongating ribosomes within coding sequences (Fig. 1b). These data show that like harringtonine and lactimidomycin in eukaryotes, Onc112 specifically traps ribosomes at start codons, while allowing elongating ribosomes to complete protein synthesis and terminate normally.
Ribosome profiling signals for Onc112 and retapamulin are similar. A recent study used retapamulin and ribosome profiling to identify sites of noncanonical initiation within annotated ORFs (37). Like Onc112, retapamulin traps newly initiated 70S ribosomes at start codons while allowing elongating ribosomes to complete protein synthesis, such that ribosome density is strongly enriched at start codons (Fig. 1a). In the Vázquez-Laslop and Mankin study (37), 12.5 g/ml retapamulin was added to a culture 5 min prior to harvesting the cells; ribosome profiling was then performed following the standard protocol (39). Since the gene encoding the efflux pump TolC was deleted in the strain assayed (BW25113), 12.5 g/ml retapamulin corresponds to approximately 100 times the MIC. To compare the effects of Onc112 and retapamulin treatment, we calculated the intensity of start codon peaks on annotated ORFs, finding 3,020 genes with any detected ribosome density in the start codon region in both samples. There is a strong correlation between start peak intensity in these two antibiotic-treated samples (Spearman's r ϭ 0.83), arguing that both methods capture initiating ribosomes in a reproducible way (Fig. 1c).
Subtle differences may arise from variations in gene expression due to the different culture conditions; the retapamulin-treated sample was cultured in LB media, whereas the Onc112-treated sample was obtained from a culture in complete synthetic MOPS media (see Fig. S1 in the supplemental material). However, the primary relevant difference in the sample preparation is that our protocol with Onc112 includes an additional step not found in the standard protocol: the samples are pelleted over a sucrose cushion prior to nuclease treatment. This step depletes tRNAs from the ribosomal A site, allowing nucleases to cleave within the ribosome, thus shortening ribosome footprints. As a result, while the distance from the P-site codon to the 3= boundary of the ribosome is reliably 15 nucleotides (nt) in the retapamulin-treated library, it is more variable in our Onc112-treated library. Most often, the peak is 6 to 10 nt downstream of the start codon; it is 7 nt in the lpp example (Fig. 1a). This difference is useful in annotating novel initiation sites: a start codon 6 to 10 nt upstream of density in the Onc112 data and 15 nt upstream of density in the retapamulin data has a high chance of being a bona fide start site and not a sequencing artifact.
Onc112 and retapamulin can be used to identify putative translated smORFs. Given the ability of Onc112 and retapamulin to trap ribosomes at start codons at most annotated ORFs, we combined the information from these data sets to create a high-fidelity screening method for identifying new smORFs likely to be translated (Fig. 2a). We first generated a list of 160,995 smORFs of eight codons or longer whose start codons (AUG, GUG, or UUG) are 18 nt or more away from annotated coding regions (either protein-coding sequences or functional RNA genes).
We computed the ribosome density associated with each start site, including ribosome footprints 0 to 18 nt downstream of the first nt in the start codon. A broad window (18 nt) was used because the distance from the 3= end of footprints to the start codon can vary depending on the sequence context and the manner in which libraries were prepared, and we wanted to capture all the relevant footprints. A plot of the cumulative distribution function (CDF) of these data is shown in Fig. 2b (left y axis); the y value reflects the percentage of predicted smORFs that have a start peak less than or equal to the x value. This plot shows that ϳ96% of the putative smORFs have no associated density at their start sites (x ϭ 0). This means that ϳ4% have start peaks greater than zero. Only 0.25% of the predicted smORFs had more than 5 reads per million mapped reads (rpm), as delineated by the broken line in Fig. 2b. Thus, the vast majority of the putative smORFs likely are not translated.
To calibrate our method for identifying new candidate smORFs, we examined the ribosome density on the start codons of smORFs previously shown to encode proteins. The test set included two different groups. The first group was comprised of 44 small proteins annotated initially together with small proteins identified by sequence conservation and strong matches to ribosome binding site models (16). The ribosome density after Onc112 or retapamulin treatment varied by 4 orders of magnitude (see Table S1 in the supplemental material): ϳ80% of this group had detected signal at start sites and ϳ60% of known smORFs had start peaks above 5 rpm (Fig. 2b, right y axis). The second group of proteins had less conservation and weaker matches to ribosome binding site models but were shown to be synthesized as tagged derivatives in a recent study (17). Of the 36 proteins in the second set, ϳ70% showed signal but only 20% had Onc112 or retapamulin reads above 5 rpm at the start site (Table S1), possibly due to the lower level of expression of these smORFs. Given that the majority of the small proteins in the first set of annotated smORFs have start peaks 5 rpm or higher (Fig. 2b), we used this threshold to eliminate false-positive results in our list of putative smORFs; 412 novel smORFs above this threshold were selected for further consideration. on the y axis less than or equal to the ribosome density near the start site (x axis) compared with candidate smORFs (n ϭ 160,995) (left y axis, red and orange). Candidates with an average of Ͼ5 rpm were selected for further screening (broken line). (c) The proper spacing of ribosome density at start codons in treated samples helps to identify bona fide small protein-coding genes such as ORF22/yqgH. (d) In cases where several start codons could explain the ribosome density, spacing helps determine the correct site. ORF9/yhiY likely initiates with the second AUG codon of the three shown. (e) Many candidates were rejected because the start site does not align properly with the density observed.
Small Protein Identification with Stalled Ribosomes ® An important caveat in treating cells with Onc112 and retapamulin is that these antibiotics could enhance ribosome density on some initiation sites that are not normally used. The antibiotics dramatically increase the concentration of free 30S and 50S subunits given that they allow elongating ribosomes to complete protein synthesis and be recycled but block entry into the elongation cycle. The recycled subunits are free to initiate at less optimal start codons, where they will be trapped by the antibiotics. To remove these false-positive results, we used traditional ribosome profiling data (from untreated cells) to capture elongating ribosomes along the entire ORF. Of the 412 smORFs with strong start codon peaks, 116 had traditional ribosome profiling density above 8 reads per kilobase per million mapped reads (rpkm).
We next examined the 116 most promising candidates on a genome browser. In our screen for ribosome density at initiation sites ( Fig. 2a), we summed the reads from 0 to 18 nt downstream of the first nucleotide in the start codon, an intentionally broad range. In our visual inspection, we searched for retapamulin peaks ϳ15 nt and Onc112 peaks 6 to 10 nt downstream of the first nucleotide in the start codon as seen for lpp (Fig. 1a). The same spacing is observed for the most promising candidates (e.g., ORF22/yqgH in Fig. 2c). In some cases of multiple possible start codons, we were able to readily predict the correct start based on distance (e.g., ORF9/yhiY [ Fig. 2d]). For most of the candidates that were rejected, the predicted start site did not align with the Onc112 or retapamulin ribosome density (Fig. 2e). Another source of false-positive results were smORFs close to highly translated genes, such as ribosomal proteins, where the noise is high enough to pass the cutoff for start peaks and normal profiling density (data not shown). Based on these criteria, 67 candidates were rejected, leaving 49 candidates. Visual inspection proved helpful in refining the data, but in the future, our algorithms can be further developed to incorporate additional criteria for largescale screens for candidate smORFs.
We also inspected 50 additional smORFs with strong start peaks (Ͼ5 rpm) for which we were unable to calculate rpkm values for elongating ribosomes because the smORFs overlap an annotated gene and the ribosome density cannot be assigned to one gene or the other. Upon inspection, 36 of these smORFs were rejected due to incorrect start site selection or high levels of noise due to adjacent highly translated genes, leaving 14 of interest. In addition to these 14 candidates and the 49 discussed above, another three were discovered as the correct start sites for candidates that were rejected, and two more were discovered in a preliminary screen using similar cutoffs but a different collection of traditional ribosome profiling data.
Together, this workflow yielded 68 candidate smORFs with high start codon peaks and some level of traditional ribosome profiling data, including both independent genes and altORFs (Table S3). Initially, the smORFs were assigned numbers but were renamed if we obtained evidence of small protein synthesis (see below). As expected, the majority of these candidates start with AUG codons (n ϭ 50), although GUG (n ϭ 9) and UUG (n ϭ 9) codons were also observed. A histogram of the predicted protein lengths is shown in Fig. S2: the majority of the predicted small proteins are 40 amino acids or fewer, although the analysis also identified seven candidates that were longer than 50 residues. A few of the candidates are overlapping in that they correspond to different possible start codons in the same frame.
The majority of predicted small proteins are synthesized. To validate that the corresponding small proteins are synthesized, a sequential peptide affinity (SPA) tag (comprised of the 3ϫ FLAG tag and calmodulin binding protein, adding 8 kDa [41]) was integrated into the chromosome upstream of the stop codon of the 38 putative smORF genes with the highest ribosome density in the presence of the inhibitors and deemed the strongest candidates by the visual inspection. The tag allowed immunoblot analysis on the basis of the 3ϫ FLAG epitope (Fig. 3). While the exposure needed to detect the small proteins varied significantly (as reflected in different levels of the background bands), 36 of the 38 tagged small proteins were detected in cells grown to exponential or stationary phase in LB at 37˚C, conditions comparable to those used in the ribosome profiling experiments. The inability to detect the remaining two chromosomally tagged smORFs (ORF24 and ORF56, as well as ORF33 as shown below) could stem from these smORFs yielding false-positive results in the screen or from the degradation of the tagged derivatives. Nonetheless, we have observed the expression of the majority of the predicted genes, validating the predictive capability of utilizing multiple ribosome profiling data sets.
Several previously detected small proteins are expressed only under very specific growth conditions (17,42). As shown in Fig. 3, we observe that 10 newly detected small FIG 3 Western analysis confirms synthesis of 95% of predicted small proteins tested. E. coli MG1655 strains with chromosomally tagged, putative smORFs were grown to exponential (E) and stationary (S) phase in rich media (LB). Gel samples were prepared to load equivalent numbers of cells based on OD 600 . Immunoblot analysis was conducted against the 3ϫ FLAG motif included in the SPA tag using HRP-conjugated, anti-FLAG antibodies. Wild-type MG1655 was included as a negative control. Blots requiring a longer exposure to show tagged proteins have more background bands. Bands corresponding to small proteins are marked with an asterisk.
Small Protein Identification with Stalled Ribosomes ® proteins are present at Ͼ2-fold higher levels in exponential phase and four are present at Ͼ2-fold higher levels in stationary phase. The majority of the small proteins appear at roughly equal levels during both of these growth phases but may be induced under other conditions.
The levels of tagged small proteins span a wide range. As indicated above, the ability to detect the small proteins varied. To directly compare the overall levels of the proteins, both among themselves and with previously identified small proteins, we analyzed stationary-phase samples of several examples of each group of proteins (Fig. 4). Among the newly identified proteins, the levels of YnfU are highest, but these levels fall between the levels of the characterized multidrug efflux pump regulator AcrZ (diluted fivefold in Fig. 4a) and the uncharacterized protein YoaK, which, respectively, are among the better-and worse-expressed small proteins identified in initial searches (16). The levels of the remaining small proteins cover a wide range, as is seen when comparing the samples loaded on two different gels as a reference, YsgD in Fig. 4a and b and YthB on Fig. 4b and c. These blots also show that most of the other newly identified small proteins are expressed at levels below the level of YoaK under the conditions tested.
We also compared the levels of the new small proteins to five (YnaM, YnfS, YgbU, YddY, and YmjE) of the 36 small proteins identified more recently (17). Three of the proteins (YnaM, YnfS, and YbgU) are observed at levels comparable to most of the newly identified small proteins, while two (YddY and YmjE) are more comparable to the least-abundant small proteins identified in this study (Fig. 4c). It is interesting to note that YnaM, which had no ribosome density at start codons in the presence of Onc112 or retapamulin, was detected at higher levels than most of the newly detected small proteins, while YnfS, which has strong start peaks in both antibiotic-treated samples, was detected at lower levels.
Some small proteins are encoded antisense to genes encoding expressed proteins. Given that antisense transcription in bacteria frequently is a means of gene silencing (reviewed in references 43 and 44), we were surprised to note that eight of the newly detected proteins are encoded antisense to annotated protein-coding genes (Table 1). Additionally, one predicted smORF, yoaM, could not be tagged as it is found antisense to the operon of the essential nrdA and nrdB genes (encoding ribonucleoside-diphosphate reductase 1) (Fig. 5a). To test for expression of YoaM, we generated a translational fusion at the lacZ locus. Consistent with translation of this antisense-encoded small protein, we detect higher ␤-galactosidase expression for the yoaM-lacZ fusion than for an out-of-frame control fusion (Fig. 5e). Given that a clear transcriptional start was noted 174 nucleotides upstream of the YoaM start codon (45), it is possible that the synthesis of this protein is under posttranscriptional regulation.
We wanted to determine whether annotated proteins and the newly identified small proteins encoded by transcripts on opposing strands are both synthesized. We therefore introduced chromosomal SPA tags upstream of the stop codons of the previously annotated genes yqgC (antisense to yqgG) (Fig. 5b), yghE (antisense to yqhJ) (Fig. 5c), and waaL (antisense to yibX and yibY) (Fig. 5d). For YqgC (a protein of unknown function), the gene does not have any associated ribosome density in either treated or untreated cells, and the corresponding tagged protein is not observed under these conditions (Fig. 5f). YghE (another protein of unknown function), while detected, appears to be present at lower levels than YqhJ (Fig. 5g), consistent with its low levels of normal ribosome density (not visible at the scale used in Fig. 5c). WaaL (an O-antigen ligase) was clearly detected under the same growth conditions as YibX and YibX-S (Fig. 5h). We suggest the appearance of a smear for WaaL may be due to bound oligosaccharide substrates. In general, our results confirm that proteins can be encoded by both strands of the same region of DNA and expressed under the same growth conditions.  Fig. 3 (black) were compared to each other and to similarly prepared samples of previously detected small proteins (gray) with the same chromosomal tag (17,42). Immunoblot analysis for cells grown to stationary phase was conducted as described in the legend to Fig. 3 with E. coli MG1655 as a negative control. All samples are in the MG1655 background and equally loaded, except for AcrZ, where the sample was diluted 1:5. Ponceau S staining for the same region is shown below each immunoblot.  YibX is translated as two isoforms. The yibX gene was also interesting as the profiling data suggested translation could initiate from two different start codons. While most bacterial ORFs encode a single protein, there are some examples where different isoforms of the same protein are generated by different translation starts in the same frame, as has been found for the E. coli proteins ClpB, IF-2, and MrcB (46)(47)(48). Frequently, the longer polypeptide is expressed at higher levels than the shorter isoform. A broad peak near the start codon for the ribosome profiling data suggests that several small proteins are potentially translated as different isoforms. Although most of the potential isoforms vary by only a few codons and would be indistinguishable on immunoblots, the YibX alternative start sites lead to proteins of substantially different sizes. The stronger signal corresponds to the 24-aminoacid (aa) YibX-S protein, while a second signal at a GTG codon upstream and in frame with YibX-S yields an 80-aa protein, adding ϳ6.1 kDa (Fig. 5h). Both bands are detected in Fig. 3 and 5, but in contrast to other known primary isoforms, the 80-amino-acid protein is detected at lower levels than the shorter isoform. A second protein for which there are possible isoforms is YqhJ, which shows two bands in Fig. 5g. YqhJ initiates at a GTG codon and is 19 residues long; initiation at a downstream TTG codon would yield a 13-residue protein (Table S3). Ribosome density in Onc11-and retapamulin-treated samples is consistent with both of these initiation sites being used (data not shown).
Multiple smORFs are encoded by different, overlapping frames. There are a growing number of bacterial examples where more than one protein is encoded in the same region in different frames, as has been found for rzoD encoded within rzpD, which are homologous to the rz/rz1 lysis cassette of bacteriophage (49,50). A similar gene arrangement of nested start codons and substantial overlap is also found for two sets of newly identified small proteins: YhgO/YhgP (Fig. 6a) and YriA/YriB (Fig. 6b). Additionally, the smORFs encoding two other new proteins, YbgV and MgtT, overlap the 3= ends of the previously identified smORFs ybgU (Fig. 6c) and mgtS (Fig. 6d), respectively. We sought to compare the levels of the paired small proteins under the same conditions by assaying cells with one or the other smORF tagged (Fig. 6e to h).
Although there generally appears to be limited correlation between ribosome density and observed protein levels, for each of these pairs, the small protein corresponding to the smORF with the higher ribosome density with either Onc112 or retapamulin treatment (YhgP, YriB, YbgV, and MgtS) was present at higher levels. Perhaps there is a better correlation between ribosome density and observed protein levels for cotranscribed genes.
smORFs overlap the 5= ends of larger protein-coding genes. The genes of several new small proteins detected by immunoblot analysis (Fig. 3) were found to overlap the 5= ends of annotated larger genes in a different frame including baxL-baxA, evgL-evgA, and argL-argF. Two additional smORFs predicted by ribosome profiling, ORF33 and pssL, also overlap the 5= end of the neighboring gene in a different frame, but we were unable to SPA tag these predicted proteins because the downstream genes, accD (acetyl-CoA carboxyltransferase subunit ␤) (Fig. 7a) and pssA (phosphatidylserine synthase) (Fig. 7b), are essential. To investigate the expression of ORF33 and PssL, the 5= UTR and the first few codons of the smORFs were translationally fused to lacZ on the chromosome (40). While there was no measurable ␤-galactosidase activity for the ORF33-lacZ fusion (Fig. 7f), there was clear expression of the pssL-lacZ fusion, which was diminished by the introduction of a stop codon at the start codon position (Fig. 7g).

FIG 5 Legend (Continued)
chromosomal fusions of the 5= UTR and initial codons of yoaM fused to lacZ as well as out-of-frame control fusion (e), which were grown in rich media (LB) with 0.2% arabinose. (f to h) Protein levels for chromosomally SPA-tagged yqgC and yqgG (f), yghE and yqhJ (g) and waaL and yibX (h) genes. Gel samples were prepared from MG1655 strains grown to exponential (E) and stationary (S) phase in LB. Immunoblot analysis was conducted as described in the legend to Fig. 3 with MG1655 as a negative control. Bands corresponding to small proteins are marked with an asterisk, and bands corresponding to antisense-encoded larger proteins are marked with two asterisks. Weaver et al. , ybgU/ybgV (c), and mgtS/mgtT (d), with previously identified small protein genes in gray, newly identified small protein genes in blue, and small RNA gene mgrR in green. (e to h) Levels of corresponding proteins. Gel samples were prepared from MG1655 strains grown to exponential (E) and stationary (S) phase in LB. Immunoblot analysis was conducted as described in the legend to Fig. 3 with MG1655 as a negative control. Bands corresponding to small proteins are marked with an asterisk.

Small Protein Identification with Stalled Ribosomes
These results indicate that although we could not construct a pssL-SPA fusion at the endogenous location of the genome, the protein is translated.

Role of smORFs regulating expression of larger protein encoded downstream.
Given other examples where smORFs overlapping downstream genes serve as leader peptides involved in modulating the translation of the larger gene (51; reviewed in reference 52), we next sought to investigate whether translation of the smORFs overlapping larger ORFs described above affects translation of the downstream ORF. To test this, the entire 5= UTR, including the smORF together with the first codons of the downstream gene was fused to lacZ at the endogenous lacZ locus. We also generated a second version of these constructs by introducing amber or ochre stop codons into the smORF as a replacement for the start codon. If translation of the two ORFs is coupled, the stop codon, which blocks the expression of the upstream smORF, should impact translation of the downstream gene. In the case of the ORF33-accD pair, for which we did not see any expression of ORF33, the stop codon had no impact on accD-lacZ expression (Fig. 7f). In contrast, introduction of a stop codon into pssL led to a 30% decrease in the expression of the pssA-lacZ fusion (Fig. 7g), while introduction of a stop codon into yoaL, a recently identified smORF (17), led to strongly decreased expression of yoaE-lacZ (Fig. 7h). An increase in the expression of the downstream gene is observed when stop codons are introduced into baxL (Fig. 7i) and argL (Fig. 7j). Together, these results indicate that translation of these upstream smORFs may be playing a regulatory role.

DISCUSSION
Fundamentally, the challenge of identifying expressed small proteins stems from the great number of putative smORFs, with ϳ161,000 possible smORFs in intergenic regions of E. coli alone. The key question is how best to identify and validate candidate smORFs in a manner that prevents the annotation of uncorroborated genes. Rather than relying solely on bioinformatic approaches, as has been done previously, we demonstrated that an approach that utilizes multiple ribosome profiling data sets can identify translated smORFs with a high degree of accuracy. The expression of 36 of these smORFs was verified by immunoblot analysis of the chromosomally tagged genes, and the expression of two other genes that could not be tagged at the endogenous loci was observed as chromosomal lacZ fusions. We noted a number of interesting gene arrangements, including small proteins encoded on the strand opposite larger, annotated proteins, as well as smORFs in the 5= UTRs of known genes.
Limitations of approach. While we were able to identify many new small proteins, we are cognizant of some limitations. One important caveat is that start codon peak intensity in profiling experiments is not a truly quantitative measure of initiation rates, given that reads at a single site are prone to sequence-specific artifacts (53). Examination of the profiling data of previously identified smORFs illustrates this limitation, as there is not a strong correlation between ribosome density in the presence of the initiation complex inhibitors and the band intensity observed by immunoblot analysis. While the degradation of some tagged small proteins may explain ribosome density without corresponding protein bands, other smORFs yield strong bands without any sequencing reads in the profiling experiments. Determining the factors that contribute to the perceived mismatch between ribosome density and observed protein levels would allow for a more accurate prediction of expression. It also must be considered that, although only occurring for a short duration, treatment with Onc112 or retapamulin represents stress on the bacteria that can cause changes in the expression profile. codons of ORF33 (f) and pssL (g) fused to lacZ. ␤-Galactosidase activity was assayed for cells carrying lacZ chromosomal fusions to the 5= UTR and initial codons of the downstream gene with a wild-type start codon for the upstream smORF or with a stop codon replacing the start codon (f to j). For all ␤-galactosidase assays, cells were grown in LB with 0.2% arabinose.
Small Protein Identification with Stalled Ribosomes ® One other major limitation regarding the general application of this approach is that the microbes must be susceptible to these initiation complex inhibitors. Retapamulin is a member of the pleuromutilin class of antibiotics that show activity against a broad spectrum of Gram-positive bacteria, though some derivatives show activity against Gram-negative bacteria as well (54,55). To increase susceptibility to retapamulin, the group of Vázquez-Laslop and Mankin (37) used a tolC mutant strain of E. coli, an approach that may need to be employed in other bacteria. Onc112, a member of the PrAMP family of peptide antibiotics, is actively transported into Gram-negative bacteria by proteins such as the SbmA transporter (56). It may be possible to extend the range of compounds like Onc112 by exogenous expression of transporters such as SbmA in bacteria that otherwise lack them.
Advantages of approach. Despite the possible limitations, the ability to identify start codons through ribosome profiling with inhibitors is a powerful approach with broad applications. As shown here, translated smORFs are more prevalent than previously believed and are found in contexts that would be difficult to distinguish by other methods, including bioinformatic approaches that have been successfully employed previously (16,17). While traditional ribosome profiling can guide the prediction (57,58) or, in conjunction with experiments to verify protein synthesis, even support the annotation of intergenic smORFs in bacteria (27), ribosome profiling with stalled initiation complexes allows for the identification of protein-coding sequences in contexts that are generally ignored, including within or overlapping other genes as shown here and by the group of Vázquez-Laslop and Mankin (37). These new, internal altORFs may represent new classes of functional and regulatory proteins that comprise an ever-expanding proteome.
Interestingly, we noted relatively poor overlap between our predicted smORFs and those reported in the other ribosome profiling studies (27,33), suggesting that many small proteins remain to be discovered. Of the 328 smORFs predicted by Mori and coworkers (33) in intergenic regions in E. coli based on ribosome enrichment at start codons after treatment with tetracycline, only 20 overlap with our list of 68 likely candidates (Table S3). The fact that Onc112 and retapamulin are more specific than tetracycline for newly initiated ribosomes, providing higher resolution for start codon identification, may partially explain the limited overlap with our predicted smORFs. We also looked for overlap between our 68 likely candidates and the 130 smORFs predicted in Salmonella enterica in a recent study using traditional ribosome profiling (27). Only one exact match and three close matches were found between these related species.
In addition to facilitating the identification of new smORFs, the profiling data with inhibitors provide valuable information about known ORFs and suggest the need to reannotate some genes (see Table S1 in the supplemental material). One example is the smORF ymiA, which is annotated both as beginning with MLISDGDYMRLAMPSGNQEP (59) and as beginning with the third methionine at MPSGNQEP (16) but likely initiates with MRLAMPSGNQEP (Fig. S3). Another example is the ymdG protein. Although it is annotated as 40 residues, our data show that a later start codon is used and that the smORF is only eight codons long (Fig. S3). Finally, yoaL, which was herein examined for function as a leader peptide (Fig. 7), was originally annotated as initiating on a methionine 13 codons upstream of its likely start site (17) (Fig. S3). Our data provide the first experimental evidence of where translation begins in these three smORFs, but we cannot rule out the possibility that alternate start sites are used under different biological conditions. Small protein function. Many of the small proteins are expected to have functions that involve the binding to other, larger proteins. However, the primary structures of the small proteins are often too short for bioinformatic tools to identify motifs or domains that may offer insights into their functions in the cell. Of the newly identified small proteins, only YnfU, which is encoded within the Qin prophage region of the E. coli genome, had an identifiable motif. The protein contains a pair of zinc knuckles, a motif with two copies of the CPXC sequence that together chelate a zinc ion (reviewed in reference 60). Homology modeling of YnfU using PSIPRED (61) also revealed a moderate match to the zinc-binding domain of PA0128, a protein of unknown function from Pseudomonas aeruginosa.
Although motif identification is often not available for smORFs, multiple previously identified proteins were predicted to contain transmembrane helices and were later experimentally shown to localize to the cellular membrane (16). When we examined the sequences of the new smORFs using the Phobius or ExPASy TMpred algorithms (62,63), none of the newly identified proteins were predicted to contain transmembrane helices. This analysis shows that the skew toward hydrophobic ␣-helices overall is not as strong as observed for the small E. coli proteins identified in the first systematic search for these proteins (16). In general, the next challenge will be to determine functions for the large numbers of newly identified proteins.
Four new smORFs and one previously annotated smORF were examined for possible roles as leader peptides, as these small protein genes overlap the downstream coding sequences of larger proteins in alternate frames (Fig. 7). For each of the expressed genes, either an increase or decrease in the translation of the downstream gene was observed when the upstream smORF was not translated. For genes where expression decreases, this drop may stem from a loss of translational coupling from the upstream gene, while for genes with improved expression, translation of the smORF may impede translation of the downstream gene. It is interesting to note that a mutation (pssR1) that leads to increased expression of pssA mapped to the anti-Shine-Dalgarno sequence of the 16S rRNA encoded by rrnC (64). Further characterization will be required to distinguish smORFs that are simply translated in operons versus those that specifically serve to control the translation of downstream genes and to elucidate the regulatory mechanisms.
Complex gene organization. Beyond the expanded presence of smORFs as possible upstream leaders of other genes, our analysis also pointed to other forms of complex gene organization. We found several smORFs that overlap other new or known smORFs. We also discovered small proteins encoded antisense to larger proteins, as well as at least one small protein that is translated as two isoforms. We hypothesize that the pairs of bacterial genes encoded in overlapping regions have related functions.
Since we think we have not yet identified the complete set of small protein genes, we suggest that antisense genes and translational regulation by upstream smORFs may be far more prevalent than currently thought. Full annotation of translated regions of the chromosome will be required to obtain a more comprehensive picture of cellular regulation. Additionally, more complete annotation of translation will provide a better understanding of the roles of the many seemingly orphan transcription start sites observed in transcriptome data (45). The use of ribosome profiling with initiation complex inhibitors revealed 38 new protein-coding genes in E. coli, an organism already known to express nearly 100 small proteins. For less-well-characterized bacteria, the ability to define the small proteome accurately and in an unbiased manner opens new doors to uncovering the regulation that allows the growth and survival of these organisms.

MATERIALS AND METHODS
Onc112 ribosome profiling. A culture of E. coli MG1655 was grown overnight at 37˚C in MOPS EZ Rich Defined media (Teknova) with 0.2% glucose, diluted 1:100 into 150 ml of fresh medium, and grown to an optical density at 600 nm (OD 600 ) of 0.3. The culture was treated with 50 M Onc112 for 10 min and harvested by rapid filtration and freezing in liquid nitrogen. Ribosome profiling libraries were prepared and sequenced as previously described (65) with the following modifications. Normally, the standard lysis buffer contains chloramphenicol to arrest translation in the lysate. We omitted chloramphenicol and added 1 M NaCl to the lysis buffer because we have found that high salt concentrations arrest translation better than chloramphenicol. Use of high-salt buffers necessitates a buffer exchange prior to nuclease digestion: 25 AU of RNA in the lysate was pelleted over a 1-ml sucrose cushion (20 mM Tris [pH 7.5], 500 mM NH 4 Cl, 0.5 mM EDTA, 1.1 M sucrose) using a TLA 100.3 rotor at 65,000 rpm for 2 h. Pellets were resuspended in 200 l of the standard lysis Small Protein Identification with Stalled Ribosomes ® buffer, and the RNA was digested with MNase following the standard protocol. We anticipate that the standard protocol for harvesting cells and preparing libraries would give equally good results after Onc112 or retapamulin treatment.
Analysis of ribosome profiling data. Raw reads were filtered and trimmed using Skewer v0.2.2. Reads were mapped uniquely to the E. coli MG1655 genome NC_000913.3 (allowing two mismatches) using Bowtie v 0.12.7 after reads mapping to tRNA and rRNA were discarded. Ribosome density was assigned to the 3= ends of reads. We identified novel open reading frames eight sense codons or longer starting with ATG, GTG, or TTG codons at least 18 nt away (on either side) from any annotated genes. For each potential site, the ribosome density in Onc112-or retapamulin-treated samples was summed 0 to 18 nt downstream of the first nucleotide in the start codon to calculate the initiation peak intensity. Note that a single peak of Onc112 or retapamulin density may correspond to multiple start codons in the 18-nt window; this redundancy was eliminated by inspecting the top candidates in a genome browser and looking for the optimal spacing as described in the text and Fig. 2. We also calculated rpkm values for normal ribosome profiling data for each candidate smORF unless any part of it comes within 15 nt of an annotated gene. The candidate smORFs and their scores are reported in Table S2 in the supplemental material. The retapamulin treatment data can be found at GSE122129 and the Onc112 treatment data can be found at GSE123675. Our code is available at https://github.com/greenlabjhmi. Strain construction. All strains generated for this study are listed in Table S4 together with the sequences of the oligonucleotides used to construct the strains. smORFs were tagged on the chromosome following published procedures (66). In short, an SPA-kan cassette was inserted at the C-terminal end of each ORF using the Red recombination system in E. coli NM400 and moved into E. coli MG1655 by P1 transduction. All insertions were verified by sequencing.
Construction of the lacZ reporter strains followed a published procedure (40). Briefly, DNA including the 5= UTR and several codons of each ORF, along with flanking homology regions, were transformed into E. coli PM1205, which utilizes the Red-mediated recombination system, and selected for sucrose resistance. All insertions were verified by sequencing.
Immunoblot analysis. For all expression experiments, Luria broth (LB) was inoculated 1:200 with overnight culture of various strains and grown at 37˚C. One milliliter of culture was taken during exponential growth (2 h postinoculation, OD 600 of 0.5 to 0.7) and during stationary phase (3.5 h postinoculation, OD 600 of 2.5 to 3). To normalize for total cells (number/density/count), the cell pellet collected for each sample was resuspended according to the OD 600 . Samples were analyzed by SDS-PAGE, transferred to nitrocellulose membranes, and blotted using anti-FLAG(M2)-HRP (Sigma).
Assays of ␤-galactosidase activity. For all experiments, LB with 0.2% arabinose was inoculated 1:200 with overnight culture of PM1205 strains carrying various lacZ fusions. These cultures were grown at 37˚C for 2.25 h (OD 600 of 0.75 to 1.0). Culture (10, 50, or 100 l depending on the sample) was added directly to Z buffer (800-l total volume) in 1.5-ml microcentrifuge tubes. SDS (0.00184%) and chloroform (3.5% vol/vol) were added, and samples were vortexed for 30 s. The samples were incubated at 28˚C for 15 min before the addition of ortho-nitrophenyl-␤-galactoside (ONPG) (0.875 mg/ml). Incubation at 28˚C continued until a visible color change occurred, at which time sodium carbonate (353 mM) was added to quench the reaction. All reactions were quenched by 75 min, even if no color change was observed. Samples were centrifuged at maximum speed in a table-top microcentrifuge (ϳ21,000 ϫ g) for 2 min. The absorbance at 550 nm and 420 nm was measured for 1 ml of supernatant, and Miller units were calculated using the established formula (67).

ACKNOWLEDGMENTS
We thank Nora Vázquez-Laslop and Alexander Mankin for sharing the retapamulintreated ribosome profiling data prior to publication and Matthew Hemm for sharing strains prior to publication. We also thank N. Vázquez-Laslop, A. Mankin, P. Adams, and M. Wu Orr for comments on the manuscript.
Work in the laboratory of G.S. was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development. Work in the laboratory of A.R.B. was supported by grants from the National Institute of General Medical Sciences (GM110113 and GM105816).