In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search.,[1] alignment,[2] assembly,[3] and metagenomics.[4] They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.
Due to a number of functional and evolutionary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6% (or around 20 million base pairs).[5] Identification of highly similar regions in the genome may indicate functional importance, as mutations in these areas that would result in cessation of function or loss of regulatory ability would be evolutionary unfavorable. More observed differences between two sequences may arise as a result of stochastic sequencing errors. Similarly, when performing assembly of a previously characterized genome, an attempt is made to align the newly sequenced DNA fragments to the existing genome sequence.
In both cases, it is useful to be able to directly compare nucleic acid sequences. Since the sequences are not expected to be exactly identical, however, it is beneficial to focus on smaller subsequences that are more likely to be locally identical. Spaced seeds allow for even more permissive local matching by allowing certain base pairs (defined by the pattern of the specific spaced seed) to mismatch without penalty, thus allowing algorithms that use the general "hit-extend" strategy of alignment to explore additional potential matches that would be otherwise ignored.
As a simple example, consider the following DNA sequences:
CTAAGTCACGUpon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions (in bold, above). However, the sequences still share 80% sequence similarity. The mismatches may be due to a real (biological) change or a sequencing error. In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified. But a spaced seed of
1111001111
The concept of spaced seeds has been explored in literature under different names. One of the early uses was in sequence homology where the FLASH algorithm from 1993 referred to it as "non-contiguous sub-sequences of tokens" that were generated from all combinations of positions within a sequence window.[6] In 1995, a similar concept was used in approximate string matching where "gapped tuples" of positions in a sequence were explored to identify common substrings between a large text and a query.[7] The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring[8] and soon after in 2002, PatternHunter introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST, which was ultimately adopted by newer versions of it. Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunter[9]
Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches.[10] Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit. In a spaced model (such as PatternHunter), the matches are not necessarily consecutive.
More formally, the difference in a spaced seed model as compared to a non-spaced model is the relative positioning (otherwise known as weight,
k
M
k
11111
1101101
The predicted number of hits can be calculated from PatternHunter using the following lemma:
(L-M+1)*pk
Where
L
M
p
k
The type of seed model used for sequence alignment can affect the processing time and memory usage when doing large-scale homology searches – two considerations that have been central in the development of modern homology search algorithms. It may also affect the sensitivity. Using spaced seed models has been demonstrated to allow for faster homology searches as seen with PatternHunter wherein homology searches were twenty times faster and used less memory than BLASTn (using a non-spaced model).
Most aligners first find candidate locations (or seeds) in the target sequence and then inspect those more closely to verify the alignment.[11] [2] [12] [13] Ideally, this first step would find all relevant locations in the target so sensitivity is prioritized but due to computational intensity, many popular algorithms (such as the earlier implementations of BLAST and FASTA) use heuristics to "short-cut" exploring all locations, ultimately missing many but running relatively quickly. One possible way to increase sensitivity, as done in the SHRiMP2[2] algorithm, is to use spaced seeds to allow for small differences between the query and the candidate locations so that somewhat more locations are identified as candidates. SHRiMP2 specifically uses multiple spaced seeds for this and requires multiple matches, increasing sensitivity as it allows different possible combinations of differences while maintaining speed comparable to original methods.
A variation of spaced seeds with a single contiguous gap has been used in de novo sequence assembly.[3] In this instance, the design has an equal number of ones at either end of the sequence with a run of zeroes in between. The reasoning behind this design is that in assemblers that utilize De Bruijn graphs, increasing k-mer size inflates memory usage, as k-mers are more likely to be unique. However, the most important parts of a k-mer are its ends, as they are what are used to extend sequences in a graph. Thus, to circumvent the problem with memory usage, the less-important middle part (covered by the gap) is ignored. This approach has the additional advantage, as in other uses of spaced seeds, of taking into account any sequencing errors that may have occurred in the gap area. It has been noted[3] that increasing the length of the gap also increases the uniqueness of k-mers in both E. coli and H. sapiens genomes.
A metagenomics study will commonly start with the high-throughput sequencing of a mixture of distinct species (e.g. from the human gut), yielding a set of sequences but with unknown origins. As such, one common goal is to identify which genome each sequence is phylogenetically most similar to. One approach could be to take k-mers from each sequence and see which, in a set of genomes, it has most sequence similarity with. Spaced seeds have been successfully utilized for this purpose by finding how many k-mers are found in each genome (hit number) and the total number of positions these k-mers cover (coverage).
An improvement to (single) spaced seeds was first demonstrated by PatternHunter in 2002 where a set of spaced seeds were used, in which a "hit" was called whenever one of the set matched. PatternHunter II, in 2003, demonstrated that this approach could offer higher sensitivity than BLAST while maintaining similar speed. Identifying an optimal set of spaced seeds is an NP-hard problem, however, even finding a "good" set of spaced seeds remains difficult, although several attempts have been made to computationally identify them.[14] [15] [16] Since the speed of the algorithm must decrease with an increasing number of spaced seeds, it makes sense to only consider multiple seeds when all offer some useful contribution. There is ongoing research on how to quickly calculate good multiple spaced seeds, as previous homology search software calculated and hard-coded their seeds – it would be advantageous to be able to calculate purpose-driven multiple spaced seeds instead.