Skip to main content
  • ASM
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Eukaryotic Cell
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems
  • Log in
  • My alerts
  • My Cart

Main menu

  • Home
  • Articles
    • Latest Articles
    • COVID-19 Special Collection
    • Archive
    • Minireviews
  • Topics
    • Applied and Environmental Science
    • Clinical Science and Epidemiology
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Therapeutics and Prevention
  • For Authors
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics Resources and Policies
  • About the Journal
    • About mBio
    • Editor in Chief
    • Board of Editors
    • AAM Fellows
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
  • ASM
    • Antimicrobial Agents and Chemotherapy
    • Applied and Environmental Microbiology
    • Clinical Microbiology Reviews
    • Clinical and Vaccine Immunology
    • EcoSal Plus
    • Eukaryotic Cell
    • Infection and Immunity
    • Journal of Bacteriology
    • Journal of Clinical Microbiology
    • Journal of Microbiology & Biology Education
    • Journal of Virology
    • mBio
    • Microbiology and Molecular Biology Reviews
    • Microbiology Resource Announcements
    • Microbiology Spectrum
    • Molecular and Cellular Biology
    • mSphere
    • mSystems

User menu

  • Log in
  • My alerts
  • My Cart

Search

  • Advanced search
mBio
publisher-logosite-logo

Advanced Search

  • Home
  • Articles
    • Latest Articles
    • COVID-19 Special Collection
    • Archive
    • Minireviews
  • Topics
    • Applied and Environmental Science
    • Clinical Science and Epidemiology
    • Ecological and Evolutionary Science
    • Host-Microbe Biology
    • Molecular Biology and Physiology
    • Therapeutics and Prevention
  • For Authors
    • Submit a Manuscript
    • Scope
    • Editorial Policy
    • Submission, Review, & Publication Processes
    • Organization and Format
    • Errata, Author Corrections, Retractions
    • Illustrations and Tables
    • Nomenclature
    • Abbreviations and Conventions
    • Publication Fees
    • Ethics Resources and Policies
  • About the Journal
    • About mBio
    • Editor in Chief
    • Board of Editors
    • AAM Fellows
    • For Reviewers
    • For the Media
    • For Librarians
    • For Advertisers
    • Alerts
    • RSS
    • FAQ
Observation

Protein Domains of Unknown Function Are Essential in Bacteria

Norman F. Goodacre, Dietlind L. Gerloff, Peter Uetz
Claire M. Fraser, Editor
Norman F. Goodacre
aGeorgetown University, Washington, DC, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Dietlind L. Gerloff
bFoundation for Applied Molecular Evolution, Gainesville, Florida, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Peter Uetz
cCenter for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Claire M. Fraser
University of Maryland, School of Medicine
Roles: Editor
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
DOI: 10.1128/mBio.00744-13
  • Article
  • Figures & Data
  • Info & Metrics
  • PDF
Loading

ABSTRACT

More than 20% of all protein domains are currently annotated as “domains of unknown function” (DUFs). About 2,700 DUFs are found in bacteria compared with just over 1,500 in eukaryotes. Over 800 DUFs are shared between bacteria and eukaryotes, and about 300 of these are also present in archaea. A total of 2,786 bacterial Pfam domains even occur in animals, including 320 DUFs. Evolutionary conservation suggests that many of these DUFs are important. Here we show that 355 essential proteins in 16 model bacterial species contain 238 DUFs, most of which represent single-domain proteins, clearly establishing the biological essentiality of DUFs. We suggest that experimental research should focus on conserved and essential DUFs (eDUFs) for functional analysis given their important function and wide taxonomic distribution, including bacterial pathogens.

IMPORTANCE The functional units of proteins are domains. Typically, each domain has a distinct structure and function. Genomes encode thousands of domains, and many of the domains have no known function (domains of unknown function [DUFs]). They are often ignored as of little relevance, given that many of them are found in only a few genomes. Here we show that many DUFs are essential DUFs (eDUFs) based on their presence in essential proteins. We also show that eDUFs are often essential even if they are found in relatively few genomes. However, in general, more common DUFs are more often essential than rare DUFs.

Observation

Most proteins are built of one or several domains that serve as the key mediators for their function(s). Given the ease of sequence acquisition today, the classic definition of a domain as an independently folding, and largely independent, tertiary structural unit is often replaced by a sequence-based “domain” concept, outside structural biology (1, 2). Segmenting proteins based on homology alone (3) is powerful because it does not require a representative with a known structure, and the initial predictions are largely automatable. Over time, structure determination can refine the domain boundaries. However, a large proportion of protein functional insights today are derived experimentally before three-dimensional (3D) structural information becomes available.

A variety of sequence-based domain collections exists; however, there is substantial overlap among databases (3). InterPro, which integrates Pfam as well as other sequence signatures, covers a large proportion of the protein sequences in the UniProt database and offers a good initial understanding of domain diversity (see Table S1A in the supplemental material). The Pfam database (if one includes Pfam B which contains automatically generated domain annotations) currently lists about 15,000 protein families (4). For example, the genome of Escherichia coli K-12 (MG1655) encodes 5,475 recognizable domains that are classified into 2,407 families in Pfam 26.0. The most highly represented domain in E. coli, the ATP-binding domain of the ABC transporter family (Pfam accession no. PF00005), is detected in 78 proteins with a total of 95 copies in the K-12 strain.

Table S1

(A) Primary resources used in this analysis. (B) InterPro statistics for all domains and DUFs across all kingdoms of life. (C) DUFs ranked by conservation in sequenced bacterial genomes. (D) Total protein counts for common DUFs in selected taxonomic groups. (E) Top 20 DUFs with experimentally determined 3D structures. (F) List of essential DUFs (eDUFs). (G) Top 50 DUFs in proteins with predicted function in six sample organisms. (H) Number of domains/DUFs in selected pathogens and model organisms. Table S1, XLS file, 0.8 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

Sequence-based domain assignment requires detectable homology between several protein fragments. However, very few proteins or domains are universally conserved across all species. In 2010 (Pfam release 23.0), only 16% of all characterized domains were found in all kingdoms of life (but not necessarily in all species) (5). The number of recurrent domains by the sequence-based definition is about 3 orders of magnitude smaller than the number of species (thousands versus millions). A majority of these recurrent domains can be presumed to correspond to independently folding fragments that are more likely tractable in the laboratory than full-length proteins, especially in medium- or high-throughput experiments (6).

Domain assignments have become an effective starting point for studying and understanding molecular biology across the bewildering multitude of species. However, despite decades of research, more than 20% of all domains in the Pfam database, the ~3,600 so-called domains of unknown function (DUFs) (4) remain poorly understood (5). Pfam’s DUF families are composed entirely of functionally uncharacterized protein fragments when they are assigned by the curators. New information about individual members may emerge before the next time assignment is reconsidered. However, in most instances, DUFs are in need of further study before they can be as informative as other Pfam domains. Taxonomically, about 9% of the DUFs in Pfam release 23.0 spanned all domains of life (Bacteria, Archaea, and Eukarya), while nearly half (43%) had been detected only in bacteria. Another 19% were only found in eukaryotes, and 3% were restricted to archaea (5).

The importance of prioritizing DUFs has been recognized in various experimental and/or computational characterization efforts (4, 5, 7). Bateman et al. (5) discussed DUFs from a structural perspective without providing specific information or prioritization for experimental study. In contrast, Dessailly et al. (8) prioritized the most phylogenetically common domains for crystallization but did not focus on DUFs in their approach. While many conserved domains have been preferred targets in previous studies (7), there has been no global attempt to provide a priority list for bacterial proteins. Related projects such as CALIPHO (Computer and Laboratory Investigation of Proteins of Human Origin) (9) focus on the approximately 5,000 human proteins with unknown function. However, highly conserved proteins may yield insights into the biology of many processes and species. Here we examine DUFs from a microbiological perspective and focus on the prospects of targeting DUFs found in bacteria. Not only is sequence information from culturable and unculturable bacterial isolates increasing faster than for other taxonomic groups, but bacterial proteins are also more tractable by high-throughput experiments in the laboratory, not the least because of their availability as complete clone sets (10). The bacterial kingdom also makes a substantive contribution to human infectious disease burden and death (11), which calls for a better understanding of the protein complement of pathogenic species. Here we attempted to identify DUFs that should be rewarding targets for experimental analysis in bacteria, and bacterial pathogens in particular. We identified DUFs that are not only highly conserved but that are essential in at least one species. Many of these uncharacterized bacterial domains are also found in eukaryotes, hence experimental analysis of these prokaryotic representatives should also shed light on the biology of higher life forms.

Results.For the domain survey presented here, we focused on Pfam even though we have used several other databases (see Table S1A in the supplemental material). The phylogenetic diversity of domains was investigated using the NCBI taxonomy, iToL, and the PATRIC database (see Methods for details).

(i) Phylogenetic diversity of DUFs.Domains of unknown function occur in large numbers in all kingdoms of life, ranging from about just over 1,500 in eukaryotes to 2,704 in bacteria (see Table S1B and Fig. S1 in the supplemental material). However, DUFs represent a much greater proportion of domains in bacteria than they do in other kingdoms with about one-third of all detected domains being DUFs. This is surprising, as a large number of DUFs are shared between bacteria and other domains of life (Fig. S1). There are nearly 900 DUFs in common between bacteria and eukaryotes. In fact, more than 300 DUFs are found in all three kingdoms of life (Fig. S1). It is noteworthy that over three times as many DUFs have been defined in bacteria as in plants, the kingdom with the next-highest DUF count, although the difference may be explained largely by the different numbers of completely sequenced genomes from the two kingdoms. According to InterPro (36.0) (12), 2,702 bacterial domains are also present in animals, including 311 DUFs (Table S1B).

Figure S1

Comparison of domain repertoires of the 4 superkingdoms of life. (Left) Annotated domains. (Right) Domains of unknown function. Not all overlaps shown as numbers. Download Figure S1, PDF file, 0.1 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

Among bacterial phyla, we observe a trend for larger phyla to have proportionally more DUFs, reflecting their larger genetic diversity (see Fig. S2A in the supplemental material). For example, 31% of proteobacterial domains are DUFs, while this fraction is only 25% for Actinobacteria and 21% for Spirochaetes. For these three phyla, the total numbers of domains annotated in Pfam 26.0 are 6,203, 4,029, and 2,966, respectively. This trend is evident despite the fact that the discovery of new domains and DUFs inevitably tapers off as more strains in a phylum are sequenced (Fig. S2B). It is also curious that DUFs tend to occur in relatively larger proteins in eukaryotes but relatively smaller proteins in bacteria (Fig. S3). The size distribution of DUF-containing proteins in bacteria appears skewed away from larger proteins compared to other bacterial proteins (Fig. S3, top left); therefore, this observation is not merely a reflection of generally different protein lengths in eukaryotes and bacteria. Moreover, the majority of eukaryotic proteins contain numbers of annotated domains that are comparable to those found in bacterial proteins (Fig. S3, right).

Figure S2

Phylogeny of DUFs. (A) Phylogenetic tree with the 16 most species-rich bacterial phyla and representatives from the 5 other kingdoms, denoted by color. Pie charts for each phylum/kingdom show the total number of domains (size of pie) and the relative fraction of DUFs/other domains (red and blue). The tree is from iTOL (27). (B) Domain and DUF counts versus phylum size (number of sequenced genomes). Download Figure S2, PDF file, 0.1 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

Figure S3

Protein length in bacteria and eukaryotes. Proteins from 1,540 bacterial and 290 eukaryotic reference proteomes (UniProt September 2012) were classified as DUF-containing or non-DUF-containing proteins. Download Figure S3, PDF file, 0.1 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

We have compiled 3,427 DUFs in Table S1C in the supplemental material, ranked by the number of fully sequenced bacterial genomes in which they are present. The first 24 DUFs are present in 500 or more species and usually in both eukaryotes and prokaryotes, as well as distributed over the great majority of bacterial phyla. While these protein domains are less common across archaea or fungi, most of them are present in more than 20% of all genomes. Distribution of the top 50 DUFs across taxa is rather variable, often as high as 80 to 90%, but occasionally as low as 15% of bacterial families are represented (where representation is defined as at least one genome in the family possessing the particular DUF). For example, the top-ranked DUF, DUF933, is present in 1,000 species represented by 1,495 completely sequenced genomes (Table S1C and S1D). In contrast, DUF177, ranked 5th, is missing in archaea and fungi and present in eukaryotes in only a few instances. Nevertheless, the domain is present in most bacteria, including 206 bacterial families and 859 completely sequenced bacterial species.

(ii) Structural representation of DUFs.Currently, structures of about 5,000 (36%) of the nearly 15,000 Pfam domains have been characterized, including 379 (10.5%) of the ~3,600 Pfam DUFs. A table of the top 20 most common DUFs (ranked by the number of sequenced bacterial genomes) for which a structure has been deposited in Protein Data Bank (PDB) (13) is provided in Table S1E in the supplemental material.

(iii) Many DUFs are essential.Across the 19 bacterial species represented in the Database of Essential Genes (DEG) (14), more than 10,000 essential genes have been identified (including redundancies). We found 393 of these proteins to contain at least one of 255 different DUFs (see Table S1F in the supplemental material). While 83 of those proteins contain multiple domains, the remainder appears to contain only the DUF. This clearly establishes these DUFs as essential DUFs (eDUFs). All model organisms that have been analyzed this way contain eDUFs (Fig. 1). Although the total number of domains for these model organisms has slightly decreased over the past 5 years (Pfam v23 versus v26), the numbers of DUFs and eDUFs have markedly increased (from 282 and 77 to 359 and 89, respectively, in E. coli [data not shown]). We explain the substantial increase in DUF numbers by the dramatic increase in available genome sequences which allowed new domains to be recognized by Pfam’s comparative approach. Interestingly, we found three domains that occur both in essential multidomain proteins as well as essential single-domain proteins so that domains are likely to be essential in the multidomain configuration as well: DUF31 (Pfam accession no. PF01732) is a predicted peptidase domain that is also found in two essential Mycoplasma proteins together with another peptidase domain (Pfam accession no. PF00949). Similarly, DUF59 (Pfam accession no. PF01883) is found in a series of proteins of various functions in Mycobacterium and Caulobacter, but usually in combination with PF10609, an ATPase-like domain. Finally, DUF161 (Pfam accession no. PF02588) is found as an essential single-domain and multidomain protein in combination with DUF2179, another DUF (Pfam accession no. PF10035).

FIG 1 
  • Open in new tab
  • Download powerpoint
FIG 1 

Essential domains of unknown function (eDUFs) are common among bacteria. The table shows species for which essential genes have been determined. All numbers were derived using the reference proteome of either the DEG strain or a common (fully sequenced) strain. Different strains may have different numbers. Domains are all Pfam domains that are not DUFs, while eDUFs are a subset of DUFs. Many essential genes encode DUFs as their only domain. This table is based on Pfam v26 (2012). For a complete list of eDUFs, see Table S1F in the supplemental material.

Interestingly, there does not seem to be a strong correlation between phylogenetic conservation and essentiality (Fig. 2). While highly conserved DUFs are more likely to be essential, poorly conserved DUFs (as measured by the number of genomes they are found in) are still essential in many cases. Our data set contains essential proteins that contain both known and unknown domains, but surprisingly, the majority of essential proteins containing DUFs contains only the eDUF (see Table S1F in the supplemental material).

FIG 2 
  • Open in new tab
  • Download powerpoint
FIG 2 

Many essential domains of unknown function (eDUFs) are not highly conserved. Although eDUFs tend to be better conserved (as measured by the number of genomes they are encoded in), the correlation is weak. Even poorly conserved DUFs are often essential. The linear fit was performed using simple linear regression. The figure uses data from DEG version 8.5.

(iv) Functional clues from attributes of DUF-containing proteins.This study is not primarily concerned with protein function prediction but rather relies on existing database annotations. However, we tried to obtain rough estimates of what functions might be associated with specific DUFs, using a simple subtractive protocol on transferred annotations (see Methods). Briefly, we collected potential functional attributes for the DUF-containing proteins found in 10 model bacteria (1,786 of all 3,601 known DUFs). Then, we derived very preliminary speculative clues for 31 of the top 50 bacterial DUFs as a starting point for experimental research by comparing full-length protein annotations in UniProt with domain-specific curated annotations in the Pfam2GO list (15) for all known domains in each protein. Additionally, we extended our view by considering STRING database (version 9.0) (16) predictions. As one has to expect, for functionally uncharacterized families, most predicted attributes remain relatively general and include “functions” such as ATP binding or relate to rather broad biological processes such as transcription. The potential attributes of nine DUFs indicate an integral membrane subcellular location, which may partly explain why the functions of these domains have remained unknown, given the difficulty of studying membrane proteins. Many of the top 50 bacterial DUFs also have functional associations that point to metabolic pathways. Since deeper functional predictions are beyond the scope of this paper, we refrain from a more detailed discussion of the clues shown in Table S1G in the supplemental material and refer the reader to more specialized studies (17–20).

(v) Domains in bacterial pathogens and model organisms.Interestingly, all of the top 50 DUFs (by the number of sequenced bacterial genomes) are found in at least one functionally annotated protein in 13 model organisms, 10 bacterial organisms and 3 eukaryotic organisms (see Table S1H in the supplemental material). In 41 cases, a DUF is found in an annotated protein in more than one of these organisms. Seventeen of the top 50 DUFs occur in 32 proteins as the only identified domains—i.e., proteins that consist entirely of DUFs (data not shown). We speculate that these proteins may well be some of the most interesting targets for future research in this field.

We have compiled domain and DUF counts for 13 model organisms (including Homo sapiens) and important pathogens for which complete open reading frame (ORF) clone sets are available (see Table S1H in the supplemental material). Studies of these few selected organisms will allow researchers to extrapolate functional data to a large number of other genomes and organisms. We have also included Homo sapiens as the target of those pathogens and as a model for a higher eukaryote. As stated above, all species encode dozens, and more frequently, hundreds of DUFs awaiting functional characterization.

Discussion and conclusion.Independently of our study, the Protein Structural Initiative (PSI) has pointed out (8) that many of the domains currently without known function may be widespread in the tree of life, even if found predominantly in bacteria. This survey supports this view, as many of the most prevalent DUFs in bacteria are also found in animals, plants, and other phyla. Thus, studying bacterial DUFs is important for understanding not only microbiology but also molecular biology in general.

Many of the widespread DUFs must have important functions, even if they are not essential in standard mutant screens. For instance, DUF143, one of the most common DUFs, occurring in both bacteria and eukaryotes (but not in archaea), has been placed in the top 10 list of “unknown” proteins by Galperin and Koonin (21). Its deletion in E. coli showed no obvious phenotype (22). However, we recently showed that this protein is essential when cells are starved (7), a situation that is not commonly used in mutant screens in the laboratory. In fact, this function is probably conserved in all bacteria, although its role may be different in eukaryotes (where it is localized to mitochondria) (23, 24).

The functional analysis of DUFs will require concerted efforts, including crystallization, protein interaction screens, phenotyping of mutants, and more-specific functional assays. General predictions should also allow us to determine the experimental direction required to find the precise function of DUFs. For instance, DUFs predicted to be enzymes can be screened for potential substrates or activities while protein interaction domains need to be screened for interaction partners. We hope that our ranking list of DUFs will help the scientific community to find the most interesting, most important, and taxonomically most widespread DUFs to be identified and analyzed.

Methods. (i) Data sources.Domain, protein, and phylogenetic information for all kingdoms of life was obtained from the databases listed in Table S1A and Fig. S4A in the supplemental material. We specifically focused on the 1,540 bacterial, 290 eukaryotic, and 120 archaeal organisms with completely sequenced genomes represented in UniProt (version 2012_06) (25). Domains named DUFxxx where xxx is the number for the DUF or containing “unknown function” in the name were collected from Pfam (version 26.0) and make up the list of DUFs considered in this study. NCBI taxonomic identifiers associated with DUFs versus non-DUFs were obtained from UniProt. Identifiers for strains and species were mapped to higher taxonomic taxa, particularly phyla and kingdoms, for analysis and visualization (Fig. S4B). Essentiality information was obtained from the Database of Essential Genes version 8.5 (14).

Figure S4

Survey of bacterial DUFs. (A) Protein databases used in this study and data integration process. All proteins in UniProt (Swiss-Prot and TrEMBL combined) belonging to organisms with completely sequenced genomes were extracted. These proteins were then marked as either a DUF-containing or non-DUF-containing protein, using annotation from Pfam and InterPro. UniProt taxonomic identifiers were mapped by using the NCBI database. All known bacterial pathogens were then extracted from the PATRIC database. Results of phylogenetic analyses of DUFs were mapped onto the tree of life (27). Functional predictions for DUFs were conducted, using UniProt and STRING, as well as Gene Ontology functional annotation. (B) Overview of methodology. Initially, all DUFs are extracted from Pfam searching for “DUF” and “unknown function” in the description. These DUFs are then separated into nonbacterial, bacterial, and overlapping categories. For the latter two categories, domain membership is broken down by phylum (bottom two cylinders) and also family (not shown). Download Figure S4, PDF file, 0.1 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

(ii) Phylogenetic analysis.DUF and all-domain lists were generated for all kingdoms and phyla. Phylogenetic membership for each protein was defined by strain-specific taxonomic identifiers assigned in UniProt. DUFs/domains found in proteins belonging to a particular (sequenced) bacterial strain were said to be present in the phylum/kingdom containing the strain. Strain to phylum mapping was performed according to the NCBI hierarchy (a summary sheet for this hierarchy can be found on the NCBI taxonomy site ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/). Domain and DUF representation among 1,123 pathogenic bacterial strains recorded in PATRIC (26) was also calculated. This was achieved by adding a filtering step in the script described above, by which only proteins belonging to these PATRIC strains were used to count domains/DUFs. The results were then ranked by prevalence among sequenced bacterial genomes. Subsequent analyses focused on the top 50 DUFs according to this ranking. Representation by total genome count, total bacterial pathogen count, total protein count, structure (PDB), and protein length, was measured. A local version of the UniProt database, consisting of both Swiss-Prot and TrEMBL (UniProt releases from 3 October 2012 and 25 January 2012, respectively), was used. A bacterial pathogen was defined as a member of the 1,123 PATRIC bacterial strains that were linked to at least one disease (May 2012 release). For the PDB analysis, both Pfam-A.full and Pfam-A.seed of the Pfam database version 26.0 were used. Finally, data relating to domain and DUF counts for bacterial phyla were mapped onto pie chart data types on the iTOL website (27) using a definition file with one representative organism (selected somewhat arbitrarily) per phylum.

(iii) Functional clues.For a sample of 13 model organisms (10 bacterial and 3 eukaryotic organisms), any proteins containing one or more of the top 50 DUFs (ranked again as described above) in UniProt with functional annotation were collected. For the same proteins, the functions of partner proteins recorded in the STRING database (version 9.0) (16), a resource of experimentally or highly confidently predicted interactions, were used as a second (indirect) source of functional attributes that might possibly be associated with these DUF-containing proteins. Only STRING partners with at least a score of 700 (of the maximum 1,000) were considered. Our specifically DUF-focused analyses used proteins from only the 10 bacterial model organisms. All Gene Ontology (GO) terms accompanying each of the (full-length) proteins in UniProt were collected, then we removed GO terms associated with any non-DUF domains according to the (largely manually curated) Pfam2GO mapping on the Gene Ontology Consortium website (http://www.geneontology.org/external2go/pfam2go) (28). Any remaining GO terms were considered to be functional clues for the DUF(s) in the protein. The coverage of the Pfam2GO file was limited (~4,000 domains or ~25% of Pfam). Therefore, to avoid ambiguity of GO term assignment, no inferences were drawn from proteins with non-DUF domains not in the mapping file. This inference protocol for DUF-associated GO terms is illustrated in Fig. S5 in the supplemental material.

Figure S5

Mapping GO terms to DUFs. Strategies were applied to collect functional attributes that could be associated with particular domains of unknown function (DUFs) using UniProt and Pfam2GO annotation (A) or STRING (B). Download Figure S5, PDF file, 0.1 MB.
Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

For the STRING-based contribution to the analysis, GO terms were collected from STRING version 9.0 for predicted functional partners of all proteins containing a DUF; no removal of non-DUF-specific GO terms was performed. GO terms found in at least 50% of all functional partners of all proteins with a particular DUF were included as hypothetical functions for that DUF, if they were not too general. To avoid overly general GO term functions (e.g., “molecular function” or “binding”), only GO terms at a depth greater than 3 in the GO hierarchy were included. This functional inference method is illustrated in Fig. S5 in the supplemental material.

(iv) Essentiality analysis—eDUFs.The Database of Essential Genes (DEG) version 8.5 (last updated July 2013) was used to define essential proteins. Entrez GI numbers from DEG were mapped to UniProt accession numbers. UniProt was also used to provide a list of domains/DUFs for each DEG protein. Pfam annotation from a recent Pfam release (v26; September 2012) as well as an earlier release (v23; July 2008) was used to investigate how the numbers of essential DUFs change over time. The 355 DEG proteins with DUFs were analyzed to define essential DUFs. This combinatorial analysis was carried out using the following definitions for cases of essential and nonessential domains. Essential domains were defined using three cases: single-domain essential proteins, unique domains in multiple essential proteins (e.g., cases of the form A-B-C and C-D-E, where C is the inferred essential domain), and by comparison with nonessential proteins of similar domain membership (i.e., cases of the form A-B-C essential, where A and B are nonessential proteins). Nonessential domains were also defined as those that are not present in any essential proteins (case 1) or those in essential proteins only when all other domains are essential (case 2). Because defining nonessential domains helps define essential domains by removing potentially essential domains from each protein’s domain composition, these 5 cases were identified iteratively until no further essential domains could be found. Finally, the DUFs among the essential domains were labeled eDUFs (see Table S1F in the supplemental material).

ACKNOWLEDGMENTS

We acknowledge Ivica Letunic of EMBL Heidelberg for his assistance with iTOL.

P.U. conceived the study. N.G. and D.L.G. carried out the bioinformatics analysis. P.U., N.G., and D.L.G. wrote the manuscript.

FOOTNOTES

    • Received 4 September 2013
    • Accepted 21 November 2013
    • Published 31 December 2013
  • Copyright © 2013 Goodacre et al.

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported license, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original author and source are credited.

REFERENCES

  1. 1.↵
    1. Kessel A,
    2. Ben-Tal N
    . 2011. Introduction to proteins. CRC Press, Boca Raton, FL.
  2. 2.↵
    1. Sigrist CJ,
    2. Cerutti L,
    3. de Castro E,
    4. Langendijk-Genevaux PS,
    5. Bulliard V,
    6. Bairoch A,
    7. Hulo N
    . 2010. PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res. 38:D161–D166.
    OpenUrlCrossRefPubMedWeb of Science
  3. 3.↵
    1. Mulder NJ,
    2. Kersey P,
    3. Pruess M,
    4. Apweiler R
    . 2008. In silico characterization of proteins: UniProt, InterPro and Integr8. Mol. Biotechnol. 38:165–177.
    OpenUrlCrossRefPubMedWeb of Science
  4. 4.↵
    1. Punta M,
    2. Coggill PC,
    3. Eberhardt RY,
    4. Mistry J,
    5. Tate J,
    6. Boursnell C,
    7. Pang N,
    8. Forslund K,
    9. Ceric G,
    10. Clements J,
    11. Heger A,
    12. Holm L,
    13. Sonnhammer EL,
    14. Eddy SR,
    15. Bateman A,
    16. Finn RD
    . 2012. The Pfam protein families database. Nucleic Acids Res. 40:D290–D301.
    OpenUrlCrossRefPubMedWeb of Science
  5. 5.↵
    1. Bateman A,
    2. Coggill P,
    3. Finn RD
    . 2010. DUFs: families in search of function. Acta Crystallogr. Sect. F Struct. Biol. Cryst. Commun. 66:1148–1152.
    OpenUrlCrossRefPubMed
  6. 6.↵
    1. Littler E
    . 2010. Combinatorial domain hunting: solving problems in protein expression. Drug Discov. Today 15:461–467.
    OpenUrlPubMed
  7. 7.↵
    1. Häuser R,
    2. Pech M,
    3. Kijek J,
    4. Yamamoto H,
    5. Titz B,
    6. Naeve F,
    7. Tovchigrechko A,
    8. Yamamoto K,
    9. Szaflarski W,
    10. Takeuchi N,
    11. Stellberger T,
    12. Diefenbacher ME,
    13. Nierhaus KH,
    14. Uetz P
    . 2012. RsfA (YbeB) proteins are conserved ribosomal silencing factors. PLoS Genet. 8:e1002815. doi:10.1371/journal.pgen.1002815.
    OpenUrlCrossRefPubMed
  8. 8.↵
    1. Dessailly BH,
    2. Nair R,
    3. Jaroszewski L,
    4. Fajardo JE,
    5. Kouranov A,
    6. Lee D,
    7. Fiser A,
    8. Godzik A,
    9. Rost B,
    10. Orengo C
    . 2009. PSI-2: structural genomics to cover protein domain family space. Structure 17:869–881.
    OpenUrlCrossRefPubMed
  9. 9.↵
    1. Lane L,
    2. Argoud-Puy G,
    3. Britan A,
    4. Cusin I,
    5. Duek PD,
    6. Evalet O,
    7. Gateau A,
    8. Gaudet P,
    9. Gleizes A,
    10. Masselot A,
    11. Zwahlen C,
    12. Bairoch A
    . 2012. neXtProt: a knowledge platform for human proteins. Nucleic Acids Res. 40:D76–D83.
    OpenUrlCrossRefPubMedWeb of Science
  10. 10.↵
    1. Rajagopala SV,
    2. Yamamoto N,
    3. Zweifel AE,
    4. Nakamichi T,
    5. Huang HK,
    6. Mendez-Rios JD,
    7. Franca-Koh J,
    8. Boorgula MP,
    9. Fujita K,
    10. Suzuki K,
    11. Hu JC,
    12. Wanner BL,
    13. Mori H,
    14. Uetz P
    . 2010. The Escherichia coli K-12 ORFeome: a resource for comparative molecular microbiology. BMC Genomics 11:470. doi:10.1186/1471-2164-11-470.
    OpenUrlCrossRefPubMed
  11. 11.↵
    1. Fonkwo PN
    . 2008. Pricing infectious disease. The economic and health implications of infectious diseases. EMBO Rep. 9(Suppl 1):S13–S17.
    OpenUrlFREE Full Text
  12. 12.↵
    1. Hunter S,
    2. Jones P,
    3. Mitchell A,
    4. Apweiler R,
    5. Attwood TK,
    6. Bateman A,
    7. Bernard T,
    8. Binns D,
    9. Bork P,
    10. Burge S,
    11. de Castro E,
    12. Coggill P,
    13. Corbett M,
    14. Das U,
    15. Daugherty L,
    16. Duquenne L,
    17. Finn RD,
    18. Fraser M,
    19. Gough J,
    20. Haft D,
    21. Hulo N,
    22. Kahn D,
    23. Kelly E,
    24. Letunic I,
    25. Lonsdale D,
    26. Lopez R,
    27. Madera M,
    28. Maslen J,
    29. McAnulla C,
    30. McDowall J,
    31. McMenamin C,
    32. Mi H,
    33. Mutowo-Muellenet P,
    34. Mulder N,
    35. Natale D,
    36. Orengo C,
    37. Pesseat S,
    38. Punta M,
    39. Quinn AF,
    40. Rivoire C,
    41. Sangrador-Vegas A,
    42. Selengut JD,
    43. Sigrist CJ,
    44. Scheremetjew M,
    45. Tate J,
    46. Thimmajanarthanan M,
    47. Thomas PD,
    48. Wu CH,
    49. Yeats C,
    50. Yong SY
    . 2012. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 40:D306–D312.
    OpenUrlCrossRefPubMedWeb of Science
  13. 13.↵
    1. Joosten RP,
    2. te Beek TA,
    3. Krieger E,
    4. Hekkelman ML,
    5. Hooft RW,
    6. Schneider R,
    7. Sander C,
    8. Vriend G
    . 2011. A series of PDB related databases for everyday needs. Nucleic Acids Res. 39:D411–D419.
    OpenUrlCrossRefPubMedWeb of Science
  14. 14.↵
    1. Zhang R,
    2. Lin Y
    . 2009. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 37:D455–D458.
    OpenUrlCrossRefPubMedWeb of Science
  15. 15.↵
    1. Schlicker A,
    2. Huthmacher C,
    3. Ramírez F,
    4. Lengauer T,
    5. Albrecht M
    . 2007. Functional evaluation of domain–domain interactions and human protein interaction networks. Bioinformatics 23:859–865.
    OpenUrlCrossRefPubMedWeb of Science
  16. 16.↵
    1. Szklarczyk D,
    2. Franceschini A,
    3. Kuhn M,
    4. Simonovic M,
    5. Roth A,
    6. Minguez P,
    7. Doerks T,
    8. Stark M,
    9. Muller J,
    10. Bork P,
    11. Jensen LJ,
    12. von Mering C
    . 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 39:D561–D568.
    OpenUrlCrossRefPubMedWeb of Science
  17. 17.↵
    1. Lopez D,
    2. Pazos F
    . 2009. Gene ontology functional annotations at the structural domain level. Proteins Struct. Funct. Bioinformatics 76:598–607.
    OpenUrl
  18. 18.↵
    1. Fang H,
    2. Gough J
    . 2013. A domain-centric solution to functional genomics via dcGO Predictor. BMC Bioinformatics 14(Suppl 3):S9. doi:10.1186/1471-2105-14-S3-S9.
    OpenUrlCrossRef
  19. 19.↵
    1. Burge S,
    2. Kelly E,
    3. Lonsdale D,
    4. Mutowo-Muellenet P,
    5. McAnulla C,
    6. Mitchell A,
    7. Sangrador-Vegas A,
    8. Yong S-Y,
    9. Mulder N,
    10. Hunter S
    . 2012. Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation. Database 2012:bar068.
  20. 20.↵
    1. de Lima Morais DA,
    2. Fang H,
    3. Rackham OJ,
    4. Wilson D,
    5. Pethica R,
    6. Chothia C,
    7. Gough J
    . 2011. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 39:D427–D434.
    OpenUrlCrossRefPubMedWeb of Science
  21. 21.↵
    1. Galperin MY,
    2. Koonin EV
    . 2010. From complete genome sequence to “complete” understanding? Trends Biotechnol. 28:398–406.
    OpenUrlCrossRefPubMedWeb of Science
  22. 22.↵
    1. Baba T,
    2. Ara T,
    3. Hasegawa M,
    4. Takai Y,
    5. Okumura Y,
    6. Baba M,
    7. Datsenko KA,
    8. Tomita M,
    9. Wanner BL,
    10. Mori H
    . 2006. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol. Syst. Biol. 2:2006.0008. doi:10.1038/msb4100050.
    OpenUrlAbstract/FREE Full Text
  23. 23.↵
    1. Rorbach J,
    2. Gammage PA,
    3. Minczuk M
    . 2012. C7orf30 is necessary for biogenesis of the large subunit of the mitochondrial ribosome. Nucleic Acids Res. 40:4097–4109.
    OpenUrlCrossRefPubMedWeb of Science
  24. 24.↵
    1. Wanschers BF,
    2. Szklarczyk R,
    3. Pajak A,
    4. van den Brand MA,
    5. Gloerich J,
    6. Rodenburg RJ,
    7. Lightowlers RN,
    8. Nijtmans LG,
    9. Huynen MA
    . 2012. C7orf30 specifically associates with the large subunit of the mitochondrial ribosome and is involved in translation. Nucleic Acids Res. 40:4040–4051.
    OpenUrlCrossRefPubMedWeb of Science
  25. 25.↵
    UniProt Consortium. 2012. Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 40:D71–D75. doi: 10.1093/nar/gkr981.
    OpenUrlCrossRefPubMedWeb of Science
  26. 26.↵
    1. Gillespie JJ,
    2. Wattam AR,
    3. Cammer SA,
    4. Gabbard JL,
    5. Shukla MP,
    6. Dalay O,
    7. Driscoll T,
    8. Hix D,
    9. Mane SP,
    10. Mao C,
    11. Nordberg EK,
    12. Scott M,
    13. Schulman JR,
    14. Snyder EE,
    15. Sullivan DE,
    16. Wang C,
    17. Warren A,
    18. Williams KP,
    19. Xue T,
    20. Yoo HS,
    21. Zhang C,
    22. Zhang Y,
    23. Will R,
    24. Kenyon RW,
    25. Sobral BW
    . 2011. PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infect. Immun. 79:4286–4298.
    OpenUrlAbstract/FREE Full Text
  27. 27.↵
    1. Letunic I,
    2. Bork P
    . 2007. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23:127–128.
    OpenUrlCrossRefPubMedWeb of Science
  28. 28.↵
    1. Ashburner M,
    2. Ball CA,
    3. Blake JA,
    4. Botstein D,
    5. Butler H,
    6. Cherry JM,
    7. Davis AP,
    8. Dolinski K,
    9. Dwight SS,
    10. Eppig JT,
    11. Harris MA,
    12. Hill DP,
    13. Issel-Tarver L,
    14. Kasarskis A,
    15. Lewis S,
    16. Matese JC,
    17. Richardson JE,
    18. Ringwald M,
    19. Rubin GM,
    20. Sherlock G
    . 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25:25–29.
    OpenUrlCrossRefPubMedWeb of Science
PreviousNext
Back to top
Download PDF
Citation Tools
Protein Domains of Unknown Function Are Essential in Bacteria
Norman F. Goodacre, Dietlind L. Gerloff, Peter Uetz
mBio Dec 2013, 5 (1) e00744-13; DOI: 10.1128/mBio.00744-13

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Print

Alerts
Sign In to Email Alerts with your Email Address
Email

Thank you for sharing this mBio article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Protein Domains of Unknown Function Are Essential in Bacteria
(Your Name) has forwarded a page to you from mBio
(Your Name) thought you would be interested in this article in mBio.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Protein Domains of Unknown Function Are Essential in Bacteria
Norman F. Goodacre, Dietlind L. Gerloff, Peter Uetz
mBio Dec 2013, 5 (1) e00744-13; DOI: 10.1128/mBio.00744-13
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
  • Top
  • Article
    • ABSTRACT
    • Observation
    • ACKNOWLEDGMENTS
    • FOOTNOTES
    • REFERENCES
  • Figures & Data
  • Info & Metrics
  • PDF

Related Articles

Cited By...

About

  • About mBio
  • Editor in Chief
  • Board of Editors
  • AAM Fellows
  • Policies
  • For Reviewers
  • For the Media
  • For Librarians
  • For Advertisers
  • Alerts
  • RSS
  • FAQ
  • Permissions
  • Journal Announcements

Authors

  • ASM Author Center
  • Submit a Manuscript
  • Author Warranty
  • Article Types
  • Ethics
  • Contact Us

Follow #mBio

@ASMicrobiology

       

ASM Journals

ASM journals are the most prominent publications in the field, delivering up-to-date and authoritative coverage of both basic and clinical microbiology.

About ASM | Contact Us | Press Room

 

ASM is a member of

Scientific Society Publisher Alliance

 

American Society for Microbiology
1752 N St. NW
Washington, DC 20036
Phone: (202) 737-3600

Copyright © 2021 American Society for Microbiology | Privacy Policy | Website feedback

Online ISSN: 2150-7511