The CRISPR spacer space is dominated by sequences from the species-specific mobilome

The CRISPR-Cas is the prokaryotic adaptive immunity system that stores memory of past encounters with foreign DNA in spacers that are inserted between direct repeats in CRISPR arrays 1,2. Only for a small fraction of the spacers, homologous sequences, termed protospacers, are detectable in viral, plasmid or microbial genomes 3,4. The rest of the spacers remain the CRISPR “dark matter”. We performed a comprehensive analysis of the spacers from all CRISPR-cas loci identified in bacterial and archaeal genomes, and found that, depending on the CRISPR-Cas subtype and the prokaryotic phylum, protospacers were detectable for 1 to about 19% of the spacers (∼7% global average). Among the detected protospacers, the majority, typically, 80 to 90%, originate from viral genomes, and among the rest, the most common source are genes integrated in microbial chromosomes but involved in plasmid conjugation or replication. Thus, almost all spacers with identifiable protospacers target mobile genetic elements (MGE). The GC-content, as well as dinucleotide and tetranucleotide compositions, of microbial genomes, their spacer complements, and the cognate viral genomes show a nearly perfect correlation and are almost identical. Given the near absence of self-targeting spacers, these findings are best compatible with the possibility that the spacers, including the dark matter, are derived almost completely from the species-specific microbial mobilomes.

One of the burning open questions in the CRISPR area is the origin of the bulk of the spacers. 47 For a small fraction of the spacers, protospacers have been reported, often in viral and plasmid 48 genomes, but the overwhelming majority of the spacers remain without a match 3,4,11-15 . In order 49 to get insight into the origin of this "dark matter", we performed comprehensive searches of the 50 current genomic and metagenomic sequence databases using all identifiable spacer sequences 51 from complete bacterial and archaeal genomes as queries. To this end, a computational pipeline 52 was developed that identified all CRISPR arrays from complete and partial bacterial and archaeal  These searches yielded 2,981 spacer matches (protospacers) in viral sequences and 23,385 58 matches in prokaryotic sequences. We then examined the provenance of the detected 59 protospacers across the diversity of the CRISPR-Cas systems and the prokaryotic phyla. In a 60 general agreement with previous analyses that, however, have been performed on much smaller 61 genomic data sets, protospacers were identified for ~7% of the spacers, with the fractions for 62 different CRISPR-Cas subtypes ranging from 1 to 19% (Table 1). The fraction of detected 63 protospacers was typically higher for type I and II CRISPR-Cas systems, in which it spans the 64 entire range, compared to type III, where this fraction was uniformly low, at 1 to 2% (Table 1). 65 A similar range was detected for the fraction of spacers with matches across the bacterial and 66 archaeal phyla (Table 2) but substantial deviations from the global average of ~7% in several 67 phyla are notable. Thus, anomalously high fractions of spacers with matches were detected in 68 Spirochaetia, Fusobacteria and γ-Proteobacteria. In a sharp contrast, the CRISPR arrays in 69 archaea, especially hyperthermophiles, had low fraction of matching spacers, with none at all 70 detected in Thermococci and Thermoplasmata; furthermore, the only phylum of 71 hyperthermophilic bacteria, for which a large number of CRISPR arrays was identified, also had 72 only 1% of matching spacers (Table 2). A multiple regression analysis shows that both the 73 assignment to a CRISPR subtype and classification into an archaeal or bacterial phylum make 74 substantial and largely independent contributions to the variation of the fraction of spacers with 75 detectable matches; jointly, the two factors explain about 75% of the variance of that fraction 76 (see Supplementary text 1). The paucity of spacer matches in hyperthermophiles is puzzling 77 because all these organisms possess CRISPR-cas loci (as opposed to only a minority among 78 mesophiles) 16 , with the implication that CRISPR activity is essential for the survival of these 79 organisms. The lack of recognizable spacers could be due to under-sampling of the respective 80 virome and/or to preferential utilization of partially matching spacers by the CRISPR-Cas 81 systems of thermophiles. Generally, the aspects of the biology of different groups of prokaryotes 82 that might determine the activity of the CRISPR-Cas systems, and hence the fraction of spacers 83 with matches, remain to be explored.

84
The CRISPR-Cas spacers have been demonstrated to insert in a polarized fashion, mostly in the 85 beginning of arrays, adjacent to the leader sequence (although in some case, internal insertion 86 has been observed as well), resulting in unidirectional growth of the array that, however,  Where do the ~93% of the spacers that comprise the dark matter of CRISPR arrays come from?

118
In an attempt to gain insight into the origin of these spacers, we compared the nucleotide . Given the wide range of the GC-content covered, from ~20 to 127 ~70% and the near indistinguishable features of the three sets of sequence, these observations 128 strongly suggest that they all come from a single, intermixing, species-specific sequence pool.

129
Bacteriophage genomes are generally considered to have a lower GC-content than the host 130 genomes such that prophages form AT-rich genomic islands 23 , which seems to be at odds with 131 the near perfect correlation we observed. To investigate this discrepancy, we compared the GC-132 content of phage and host genomes for several bacteria for which numerous phages have been 133 characterized; all available phage genomes were included in this analysis, regardless whether or 134 not corresponding spacers were detected. In most cases, there was indeed considerable AT-bias 135 in phages but numerous phage genomes had the same composition as the host and spacers 136 ( Figure 4). Conceivably, the spacers come from the most abundant phages that match the hosts in 137 the GC-content. 138 We further investigated the provenance of the dark matter spacers using an alternative approach.  In the present dissection of the CRISPR (proto)spacer space, we made two principal 153 observations. First, the spacers with detectable protospacer matches that persist in CRISPR 154 arrays originate (almost) exclusively from genomes of mobile elements, mostly viruses, but also 155 plasmids. This is not an unexpected finding, being compatible with multiple previous CRISPR-Cas types and subtypes were assigned to CRISPR arrays using previously described procedures 216 16,33 . All ORFs within 10 kb upstream and downstream of an array were annotated using RPS-BLAST 34 217 with 30,953 protein profiles (from the COG, pfam, and cd collections) from the NCBI CDD database 35 218 and 217 custom CRISPR-Cas protein profiles 33 . In cases of multiple CRISPR-Cas systems present in an 219 examined locus, the annotation of the first detected variant was used to annotate the array. 220 Given the frequent misidentification of CRISPR arrays (Supplementary text 3), a filtering procedure for 221 "orphan" CRISPR arrays (i.e. the arrays that are not associated with cas genes) was applied. A set of 222 repeats from CRISPR arrays identified within typical CRISPR-cas loci was collected, and these were 223 assumed to represent bona fide CRISPR (positive set). A BLASTN 36 search was performed for all repeats 224 from orphan CRISPR arrays against the positive set, and BLAST hits were collected that showed at least 225 90% identity and 90% coverage with repeats from the positive set. All arrays that did not produce such 226 hits against the positive set were discarded. The resulting 42,352 CRISPR arrays were used for further 227 analysis. 228 The results obtained with this classification procedure were compared to those obtained with PhiSpy 38 , 255 a commonly used prophage finder tool (default parameters) for the protospacer matches identified in 256 the 4,961 completely assembled genomes. Of the 1,240 spacer matches in complete genomes, 999 hits 257 were identified as (pro)virus-targeting by the ad hoc procedure described above. Using PhiSpy, 902 258 spacers were mapped to proviruses, of which 819 overlapped with the set of 999 viral matches detected 259 by the ad hoc method, indicating high consistence of the predictions by the two approaches. 260

Detection of Protospacers
The distribution of protospacers across CRISPR-Cas types and subtypes was obtained from the unique 261 spacer set. In cases when a unique spacer was identified in CRISPR arrays from different subtypes, only 262 one instance was counted. The same procedure was applied to estimate the distribution of protospacers 263 among the bacterial and archaeal phyla. 264                  Figure 5B