Next-Generation High-Throughput Functional Annotation of Microbial Genomes

ABSTRACT Host infection by microbial pathogens cues global changes in microbial and host cell biology that facilitate microbial replication and disease. The complete maps of thousands of bacterial and viral genomes have recently been defined; however, the rate at which physiological or biochemical functions have been assigned to genes has greatly lagged. The National Institute of Allergy and Infectious Diseases (NIAID) addressed this gap by creating functional genomics centers dedicated to developing high-throughput approaches to assign gene function. These centers require broad-based and collaborative research programs to generate and integrate diverse data to achieve a comprehensive understanding of microbial pathogenesis. High-throughput functional genomics can lead to new therapeutics and better understanding of the next generation of emerging pathogens by rapidly defining new general mechanisms by which organisms cause disease and replicate in host tissues and by facilitating the rate at which functional data reach the scientific community.

M icrobial genome sequencing efforts continue at an increasing rate, resulting in an expanding catalogue of genes of unknown function for important pathogens (Fig. 1). In some cases, up to 40% of the annotated genes in fully sequenced microbial genomes have no known or predicted function. In many cases, these uncharacterized genes are highly conserved and are implicated in the pathogenic process through either their timing of expression or requirement for microbial replication. Such data indicate that these genes execute important, general biological functions. The interpretation of existing and emerging microbial genomics data will require the scientific community to uncover new paradigms hidden within these sequences. This point is highlighted by a recent publication by Hutchison et al. focused on building a minimal genome of Mycoplasma mycoides (1). The 473 genes in this genome include many required for known essential functions, such as DNA replication, transcription, and translation, but the genome also contains 149 genes (~31%) of unknown function. Likewise, the known coding capacity of DNA viruses has been doubled over the past few years, an achievement made possible by using novel high-throughput sequencing and proteomic methods. The discovery of these genes of unknown function not only promises to affect our understanding of microbial pathogenesis but also provides an undiscovered wealth of new therapeutic targets for antibiotic, antiviral, and vaccine development for improved global and economic health.
Unlike large-scale genome-sequencing or structural-genomics efforts, the functional annotation of uncharacterized genes is not well developed technologically, and therefore, the scientific community cannot rely on a well-defined, mature set of experimental approaches. Simple deletion or overexpression of an uncharacterized gene often fails to yield any discernible phenotype in standard laboratory contexts. In the same vein, a purely biochemical approach that relies on purification and in vitro biochemical characterization of uncharacterized genes often fails to yield fruitful functional data. Furthermore, while sequence-based bioinformatic analyses and computational models may provide clues about functional genetic interactions, such predictions are often limited by our current biological knowledge and databases. As a consequence, functional annotation is incorrect in many cases because unvalidated information is propagated across species. In cases where gene function is experimentally assigned and validated, the information is often not broadly propagated or may not apply to another organism. Therefore, a successful functional annotation endeavor requires a multipronged approach that involves open collaboration among scientists with expertise in genetics, bioinformatics, molecular biology, biochemistry, cell physiology, host-microbe interactions, and data management. Such a program is outside the scope of traditional funding mech- anisms and requires integrated teams of experts and new highthroughput experimental methodologies. The NIAID has implemented a program aimed at assigning functions to open reading frames (ORFs) and small noncoding RNAs (ncRNAs) that have been discovered by large-scale sequencing efforts to begin addressing this gap in our understanding of bacterial and viral gene function. This timely initiative from the NIAID is highly significant because it is the institute's first attempt to incorporate functional annotation into its genomics and advanced technologies program, which is focused on developing genomics, proteomics, and bioinformatics resources to advance our understanding of infectious and immune-mediated diseases. The NIAID Functional Genomics (FG) program builds on previous functional annotation efforts, such as COMBREX (2), and brings together multidisciplinary groups to work on shared goals. Its mission is to probe new biology in pathogenic bacteria and viruses by assigning functions to new genes. The FG program is unique in its approach in that it enables principal investigators to follow up on "omics" and phenotypic screening data with targeted experiments to establish gene function. Most pathogens are evolving constantly, and such an integrated approach will provide the database to allow us to work toward predicting new emerging infectious diseases and respond to new pandemics.
Projects across the NIAID FG program are driven by the idea that the assignment of gene function must be rooted in experimentation in vivo and in vitro. The challenge to the experimentalist is to develop systematic approaches to gene function discovery that have a reasonable chance of success over a large number of genes. FG Centers are taking a variety of approaches to uncover the functions of protein-coding genes and small ncRNAs. Some approaches, which perhaps will have the highest yield, require at least some prediction of function, while others presuppose no specific functional knowledge. Current state-of-the-art technologies, such as genome synthesis, next-generation nucleic acid sequencing, ribosomal profiling, high-resolution mass spectrometry, metabolomics, and molecular and systems modeling, enable high-throughput approaches to characterize genes of unknown function. Though these technologies are clearly valuable, they are most effectively leveraged when combined with functional assays, such as measurements of microbial permeability and antimicrobial susceptibility or assays of host response to infection, which have been scaled up through recent developments in robotics, whole-genome synthesis, and high-content cellular imaging. In this commentary, we provide an overview of our coordinated efforts that are aimed at filling the gaps in our understanding of microbial gene function. These efforts are restricted to a few organisms and technologies but could be attempted on a much larger scale. Such functional analyses could provide the scientific basis for rational approaches to the development of new therapeutic products to combat future and current gaps in the treatment of bacterial and viral infection.

IN VITRO AND BIOINFORMATIC APPROACHES
Improving the quality of genome maps. Most protein-coding genes are annotated by either their similarity to known genes or generalized guidelines for identifying translational start sites. Such predictions are often incorrect. For example, we have demonstrated that a protein highly homologous to a secreted bacterial amidase is actually an intracellular regulator of peptidoglycan synthesis (3). Moreover, predictive algorithms to define the locations of non-protein-coding genes (e.g., small ncRNAs, out-of-frame ORFs, etc.) are generally poor. Defining the boundaries of genes of unknown function is an important component of experimental functional annotation. We are using combinations of directed biochemical, computational, and next-generation-sequencing approaches to produce improved maps of gene boundaries in bacterial and viral pathogens. These efforts enable improved functional genetic and biochemical experiments.
Leveraging existing protein structure data. Molecular structure data can provide tremendous insight into the biochemical functions of proteins. A high-quality structural model enables one to map conserved residues on the molecule and to develop hypotheses regarding active-site chemistry, ligand binding sites, protein docking sites, etc. We are taking advantage of the large amount of structural data available in the PDB to build homologybased protein models. In addition, we are using hidden Markov Model (HMM)-based approaches to build predictive structural models in cases where homology in the PDB is low. Experimental structures and high-confidence structural models generated by the FG program are available to the community and have been used to develop and test specific functional hypotheses in vitro and in vivo and to define protein function (4).
High-throughput protein production for functional biochemical analysis. Structural models of proteins of unknown function are informing functional biochemical hypotheses that are being tested in vitro. Specifically, we are leveraging existing high-throughput structural genomics infrastructure to produce expression clones of proteins of unknown function. This approach is yielding milligram quantities of many targets of unknown function, which are being assayed for a range of biochemical activities and are being used to produce antibodies for cellular studies. One specific use for these purified proteins is in activitybased metabolomic profiling (5), an unbiased approach to finding substrates and products for putative enzymes.
Defining gene function by biochemical association. A wellestablished approach to define function is to test for physical associations of proteins and RNAs with other proteins or transcripts of known function. We are employing such biochemical association strategies. For protein-coding genes, we are measuring protein-protein associations using quantitative proteomic methods. For example, using a proteomics approach, we reported that Kaposi's sarcoma-associated herpesvirus (KSHV) viral interferon regulatory factor 1 (vIRF1) can bind the cellular interferonstimulated gene 15 (ISG15) E3 ligase, HERC5. Interaction of vIRF1 with HERC5 inhibits the conjugation of ISG15 to cellular proteins, thereby dampening the IFN response to the virus (6). Using a functional genomics screen, coupled with synthetic genome design, we demonstrated that the S glycoprotein genes of several severe acute respiratory syndrome coronavirus (SARS-CoV)-like bat coronaviruses can bind human receptors for entry and replicate efficiently in primary human airway epithelial cells and that they are resistant to existing SARS vaccines and immunotherapeutics (7,8). Our early results suggest that an efficient way to discover function using this approach is to target protein complexes of known function and define associated proteins of unknown function. These efforts have required us to develop new bioinformatics methods to understand the complex data produced by these experiments.
Prediction of ncRNA targets. ncRNAs represent a particular challenge in bacterial and virus research, as they can vary tremen-Perspective dously among different bacterial species and viruses and little is known about their functions. We have generated catalogues of the ncRNAs for the organisms (for example, in mycobacteria) (9) that we study and are applying large-scale methods to identify their targets. These methods start with bioinformatics, though the predictive algorithms are not yet particularly robust. In addition, we are applying cross-linking-, ligation-, and sequencing-based approaches to systematically link small ncRNAs of unknown function to their transcript targets. Thus, while many of our approaches to investigating ncRNAs are still at the developmental stage, defining catalogues of ncRNAs provides the community with a road map of targets for downstream studies and analyses.
We are also using next-generation-sequencing technologies, such as transcriptome sequencing (RNAseq) and selective 2=hydroxyl acylation analyzed by primer extension (SHAPE) analysis of RNA, to investigate RNA structure-function correlations. Paired with these analyses, we are investigating the RNA transcriptome to correlate conserved and unique RNA structure elements with pathogen virulence factors, identify previously uncharacterized and/or rare translation initiation sites, and associate protein structure elements with virus biology, pathogenesis, and host range.

CELLULAR AND IN VIVO APPROACHES
Genetic approaches: identifying phenotypes. We are using multiple strategies to identify phenotypes that are associated with the deletion or overexpression of target genes of unknown function. For bacterial pathogens, we have initially focused on genes that, when disrupted, produce measurable growth phenotypes or alterations in cellular barrier function under specified conditions, including antibiotic treatment. These assays are facilitated by wholegenome-mutant defined mutant libraries that allow more efficient high-throughput screening for specific phenotypes (10).
We are assaying the growth of mutant strains in axenic culture and in cell and animal infection models. The Biolog screening platform provides a high-throughput approach to phenotype identification. We are growing individual deletion strains in parallel with wild-type control strains under~2,000 defined conditions. Differences between strains identify specific medium conditions under which a particular gene is required for wild-type growth and have provided clues for target gene function (11). In a particular case, a functional genetic study of mutants harboring deletions of genes of unknown function is informing the development of a new live attenuated Brucella abortus vaccine strain (12). In addition, we are applying genome-wide screening approaches to find genes that interact genetically to modify phenotypes.
In our studies of viral pathogens, we have developed a highthroughput, multiarmed screening program to investigate both RNA and DNA virus-encoded host evasion functions. To accomplish this, we are applying modular cloning technology to easily shuttle candidate genes of unknown function into replicon expression and/or lentiviral vectors for alternate applications, such as the expression of toxic or otherwise challenging proteins and antibody generation. Using these screening assays, we are defining viral counterdefense mechanisms mediated by known, novel, and noncanonical viral ORFs and ncRNAs, whose products modulate the host innate immune response and cellular defense machinery to the virus's advantage. Screening assays include assessments of beta interferon (IFN-␤) antagonism, NF-B, Toll-like receptor, and apoptosis modulation, inflammasome signaling, cGAS-STING and p53 pathway modulation, cellular localization, global protein synthesis, and mTOR inactivation. For example, we have reported multiple herpesviral proteins that modulate the cGAS-STING DNA sensing pathway and Middle East respiratory syndrome coronavirus (MERS-CoV) and closely related bat MERSlike virus phosphodiesterase proteins that antagonize RNase L activation during infection (13)(14)(15). In parallel, we also seek to identify viral entry proteins that program efficient infections across multiple species (7,8). Our synthetic approach allows us to rapidly test hypothetical proteins-proteins whose expression has not yet been verified in the context of infection. Genes of interest are then deleted or mutated using reverse genetic platforms, and virus mutant growth is examined both in primary human targets, such as airway epithelial cells and various immune cells, and during infection.

Synergies across the four centers funded by the FG program.
Although the four centers funded by the FG program have different specific goals, we share approaches and are experiencing common challenges, which we are working to solve together. Shared solutions drive research forward. For example, each center includes a small RNA (sRNA) discovery project. There is no generally agreed upon approach to elucidating sRNA gene function. It has been useful to compare experiences and potential solutions across centers. sRNA discovery projects as a group are converging on approaches that may work generally in bacteria. In addition, the data management groups at the four centers share common priorities for public dissemination of data and were able as a group to begin the process of defining new priorities for updating capabilities for the PATRIC and VIPR BRC public resources. An example of this collaboration is the specification of the mode and format for transfer of transposon-sequencing (Tn-Seq) experimental data and Biolog phenotyping data to PATRIC, along with the computational tools to analyze these data (16). This effort will standardize the process of data transfer from independent centers and will define the format for public display of these and other data sets generated by FG Centers.
Another interaction is occurring with the Seattle Structural Genomics Center for Infectious Diseases and the Center for Structural Genomics of Infectious Diseases. Using validated overexpression platforms and novel uncharacterized and/or hypothetical ORFs, the collaboration is designed to determine the structures of high-priority candidate proteins that antagonize host antimicrobial defense pathways in the host. It is likely in the future that the data we generate will be used by the systems biology centers to further create more complex models of host-pathogen interactions.

CONCLUSIONS
As described above, we have used a multipronged investigation strategy to directly evaluate unknown and hypothetical genes from a diverse array of pathogens to characterize the biological functions encoded by these genes (Table 1). A particular strength of this approach is that this systematic workflow, which can be adapted to all pathogens with sequenced genomes, can ensure rational, directed, and rapid response times for vaccine and therapeutic design in answer to emerging and reemerging epidemics. The work of this program is defining a future blueprint to perform functional analysis of new pathogens as they emerge and to more rapidly respond to the need for knowledge of emerging organisms.