seqtrie

Background

seqtrie is a collection of radix tree algorithms for finding similar strings, with a focus on biological sequence data. Here, “similar” usually means close under Hamming distance, Levenshtein distance, or semi-global alignment.

A radix tree, also called a compressed trie, is an efficient data structure for searching large sets of sequences. It stores shared prefixes once, so a search can skip whole branches instead of checking every sequence one by one.

For bounded-distance searches, this can be sub-linear in the number of stored sequences because the tree can prune whole branches.

Basic usage

One common job is a similarity join: find every pair of sequences within a fixed distance. The example below has 133,033 TCR beta CDR3 DNA sequences (Nolan et al. 2020).

data(covid_cdr3)
results <- dist_search(covid_cdr3, max_distance = 3,
                       nthreads = 8, tree_class = "StarTree")

Trie-based algorithms are a good fit for this kind of search. The tree structure lets seqtrie rule out large groups of targets without checking every sequence against every other sequence.

Tree classes

seqtrie has three tree classes.

dist_search() is a convenience wrapper around these classes. It can search for similarity between query and target, or within query when target is NULL.

Below is a benchmark applying dist_search to the covid_cdr3 dataset.

Understanding Radix Trees

Here is a small radix tree. We insert a few strings, erase one, and plot the result.

The useful part is the shared prefix. cargo, cart, carburetor, and carbuncle all start with car, so the tree stores car once. After that, it only stores the pieces where the strings split.

tree <- radix_tree()
insert(tree, c("cargo", "cart", "carburetor", "carbuncle", "bar", "zebra"))
erase(tree, "zebra")
# plot_tree requires igraph and ggplot2
set.seed(1); plot_tree(tree)

Hamming distance is similar to Levenshtein distance, but it does not allow insertions or deletions. Sequences must be the same length. Because of this restriction, Hamming distance is generally a lot faster.

Use max_fraction to set the distance threshold relative to query length. For example, max_fraction = 0.06 keeps matches within roughly 6% Hamming distance.

results <- align_search(tree, covid_cdr3, max_fraction = 0.035,
                        mode = "hamming", nthreads = 2)
results <- align_search(tree, covid_cdr3, max_fraction = 0.06,
                        mode = "hamming", nthreads = 2)
results <- align_search(tree, covid_cdr3, max_fraction = 0.15,
                        mode = "hamming", nthreads = 2)

Anchored alignment searches

An anchored alignment is a form of semi-global alignment. The query is anchored to the beginning of both the query and target sequences, but the end of either sequence can be unaligned.

tree <- radix_tree()
insert(tree, "CARTON")
insert(tree, "CAR")
insert(tree, "CARBON")
align_search(tree, "CART", max_distance = 0, mode = "anchored")
##   query target distance query_size target_size
## 1  CART    CAR        0          3           3
## 2  CART CARTON        0          4           4

With max_distance = 0, the query "CART" finds "CAR" and "CARTON" but not "CARBON". Anchored search also returns where the alignment ends in the query and target. This is useful when reads have variable length but start from the same genomic position or primer site.

Custom distance searches and affine gap alignment

seqtrie supports custom substitution costs and affine gap penalties.

tree <- radix_tree()
insert(tree, covid_cdr3)

# Define a custom substitution matrix. Use generate_cost_matrix for convenience.
cost_mat <- generate_cost_matrix("ACGT", match = 0, mismatch = 5)
print(cost_mat)
##   A C G T
## A 0 5 5 5
## C 5 0 5 5
## G 5 5 0 5
## T 5 5 5 0
# Set gap penalties via parameters (not in the matrix):
# - Linear gaps: set gap_cost only
# - Affine gaps: set both gap_cost and gap_open_cost

# Linear example
results_linear <- align_search(tree, covid_cdr3, max_distance = 8,
                               mode = "global",
                               cost_matrix = cost_mat,
                               gap_cost = 2,
                               nthreads = 2)

# Affine example
results_affine <- align_search(tree, covid_cdr3, max_distance = 8,
                               mode = "global",
                               cost_matrix = cost_mat,
                               gap_cost = 2,
                               gap_open_cost = 5,
                               nthreads = 2)

results_linear[results_linear$query != results_linear$target, , drop = FALSE]
##                                              query
## 130           TGTGCCAGCAGTGGGGGGAACACTGAAGCTTTCTTT
## 159  TGTGCCAGCAGCTCTAGCGGGGTGAGCACAGATACGCAGTATTTT
## 165  TGTGCCAGCAGCTCTACAGGGGTGAGCACAGATACGCAGTATTTT
## 203  TGTGCCAGTAGTCTCGGACAGGGCCTCTCCTACGAGCAGTACTTC
## 204  TGTGCCAGTAGTCTCGGACAGGGCCTCTCCTACGAGCAGTACTTC
## 258        TGTGCCAGCAGTTATTCGAACACCGGGGAGCTGTTTTTT
## 286  TGTGCCAGTAGTGTCGGACAGGGCCTCTCCTACGAGCAGTACTTC
## 287  TGTGCCAGTAGTGTCGGACAGGGCCTCTCCTACGAGCAGTACTTC
## 292        TGTGCCAGCAGCTCGGGACTGTCCTACGAGCAGTACTTC
## 293        TGTGCCAGCAGCTCGGGACTGTCCTACGAGCAGTACTTC
## 313  TGTGCCAGTAGTATGGGACAGGGGCTATCCTACGAGCAGTACTTC
## 329  TGTGCCAGTAGTATTGGACAGGGGACATCCTACGAGCAGTACTTC
## 330  TGTGCCAGTAGTATTGGACAGGGGACATCCTACGAGCAGTACTTC
## 342           TGTGCCAGCAGCTTCGCTAGCAATGAGCAGTTCTTC
## 345        TGTGCCAGCAGCTACAGGGGCTCCTACGAGCAGTACTTC
## 428           TGTGCCAGCAGCCCTTCGGGAAATGAGCAGTTCTTC
## 542           TGTGCCAGCAGCTTCGGGTACAATGAGCAGTTCTTC
## 544           TGTGCCAGCAGCTTCGGGTACAATGAGCAGTTCTTC
## 560           TGTGCCAGCAGCCCTACGGGGGGTGAAGCTTTCTTT
## 567        TGTGCCAGCAGTTACCGGGGAAGCACTGAAGCTTTCTTT
## 615           TGTGCCAGCAGTTTATACGGAGAGACCCAGTACTTC
## 635        TGTGCCAGTCAGCTTATCAACACCGGGGAGCTGTTTTTT
## 659           TGTGCCAGCAGCTCCTTTACCTACGAGCAGTACTTC
## 667  TGTGCCAGTAGTATTGGACAGGGACTCTCCTACGAGCAGTACTTC
## 687           TGTGCCAGCAGTCCGGGGAACACTGAAGCTTTCTTT
## 688           TGTGCCAGCAGTCCGGGGAACACTGAAGCTTTCTTT
## 728  TGTGCCAGCGGGACTAGCGGGGGGGCCCTAGATACGCAGTATTTT
## 760           TGTGCCAGCAGTTTTAGGGGCTACGAGCAGTACTTC
## 782           TGTGCCAGCAGTCTTAGTGGAGAGACCCAGTACTTC
## 794           TGTGCCAGCAGTTTTGCGGGTTACGAGCAGTACTTC
## 818           TGTGCCAGCAGCTTATATACCTACGAGCAGTACTTC
## 919           TGTGCCAGCAGCTCGGATTCCTACGAGCAGTACTTC
## 929        TGTGCCAGCAGCTCGGGGACCTCCTACGAGCAGTACTTC
## 930        TGTGCCAGCAGCTCGGGGACCTCCTACGAGCAGTACTTC
## 971  TGTGCCAGTAGTCTCGGGACAGGGTTCTCCTACGAGCAGTACTTC
## 1019 TGTGCCAGTAGTAAGGGACAGGGCCTCTCCTACGAGCAGTACTTC
## 1031 TGTGCCAGCAGGGCTAGCGGGGGGGCTCCAGATACGCAGTATTTT
## 1035             TGTGCCAGCAGCCTCGGGGGTGAAGCTTTCTTT
##                                             target distance
## 130           TGTGCCAGCAGTCCGGGGAACACTGAAGCTTTCTTT        8
## 159  TGTGCCAGCAGCTCTACAGGGGTGAGCACAGATACGCAGTATTTT        4
## 165  TGTGCCAGCAGCTCTAGCGGGGTGAGCACAGATACGCAGTATTTT        4
## 203  TGTGCCAGTAGTCTCGGGACAGGGTTCTCCTACGAGCAGTACTTC        8
## 204  TGTGCCAGTAGTGTCGGACAGGGCCTCTCCTACGAGCAGTACTTC        4
## 258        TGTGCCAGTCAGCTTATCAACACCGGGGAGCTGTTTTTT        8
## 286  TGTGCCAGTAGTAAGGGACAGGGCCTCTCCTACGAGCAGTACTTC        8
## 287  TGTGCCAGTAGTCTCGGACAGGGCCTCTCCTACGAGCAGTACTTC        4
## 292        TGTGCCAGCAGCTCGGGGACCTCCTACGAGCAGTACTTC        8
## 293           TGTGCCAGCAGCTCGGATTCCTACGAGCAGTACTTC        6
## 313  TGTGCCAGTAGTATTGGACAGGGGACATCCTACGAGCAGTACTTC        8
## 329  TGTGCCAGTAGTATTGGACAGGGACTCTCCTACGAGCAGTACTTC        8
## 330  TGTGCCAGTAGTATGGGACAGGGGCTATCCTACGAGCAGTACTTC        8
## 342           TGTGCCAGCAGCTTCGGGTACAATGAGCAGTTCTTC        8
## 345        TGTGCCAGCAGCTCGGGGACCTCCTACGAGCAGTACTTC        8
## 428           TGTGCCAGCAGCTTCGGGTACAATGAGCAGTTCTTC        8
## 542           TGTGCCAGCAGCCCTTCGGGAAATGAGCAGTTCTTC        8
## 544           TGTGCCAGCAGCTTCGCTAGCAATGAGCAGTTCTTC        8
## 560              TGTGCCAGCAGCCTCGGGGGTGAAGCTTTCTTT        6
## 567           TGTGCCAGCAGTCCGGGGAACACTGAAGCTTTCTTT        6
## 615           TGTGCCAGCAGTCTTAGTGGAGAGACCCAGTACTTC        8
## 635        TGTGCCAGCAGTTATTCGAACACCGGGGAGCTGTTTTTT        8
## 659           TGTGCCAGCAGCTTATATACCTACGAGCAGTACTTC        8
## 667  TGTGCCAGTAGTATTGGACAGGGGACATCCTACGAGCAGTACTTC        8
## 687        TGTGCCAGCAGTTACCGGGGAAGCACTGAAGCTTTCTTT        6
## 688           TGTGCCAGCAGTGGGGGGAACACTGAAGCTTTCTTT        8
## 728  TGTGCCAGCAGGGCTAGCGGGGGGGCTCCAGATACGCAGTATTTT        8
## 760           TGTGCCAGCAGTTTTGCGGGTTACGAGCAGTACTTC        8
## 782           TGTGCCAGCAGTTTATACGGAGAGACCCAGTACTTC        8
## 794           TGTGCCAGCAGTTTTAGGGGCTACGAGCAGTACTTC        8
## 818           TGTGCCAGCAGCTCCTTTACCTACGAGCAGTACTTC        8
## 919        TGTGCCAGCAGCTCGGGACTGTCCTACGAGCAGTACTTC        6
## 929        TGTGCCAGCAGCTACAGGGGCTCCTACGAGCAGTACTTC        8
## 930        TGTGCCAGCAGCTCGGGACTGTCCTACGAGCAGTACTTC        8
## 971  TGTGCCAGTAGTCTCGGACAGGGCCTCTCCTACGAGCAGTACTTC        8
## 1019 TGTGCCAGTAGTGTCGGACAGGGCCTCTCCTACGAGCAGTACTTC        8
## 1031 TGTGCCAGCGGGACTAGCGGGGGGGCCCTAGATACGCAGTATTTT        8
## 1035          TGTGCCAGCAGCCCTACGGGGGGTGAAGCTTTCTTT        6

StarTree for fixed DNA self-joins

star_tree is a fixed DNA-only search structure based on the Starcode all-pairs search strategy described by Zorita, Cuscó, and Filion (2015). It is a modified version of that strategy, adapted to operate over a radix trie rather than being a direct reimplementation.

This class follows two Starcode ideas:

The Starcode algorithm remains one of the fastest methods for self-similarity joins at small edit distances on DNA.

Unlike the other two classes, the sequences and search parameters are supplied at construction time, the self-join runs immediately, and the object does not support later insertion or deletion.

st <- star_tree(c("ACGT", "ACGA", "AAAA", "AAAT"),
                max_distance = 1,
                mismatch_cost = 1,
                gap_cost = 1,
                nthreads = 2)
result(st)
##   query target distance
## 1  AAAT   AAAA        1
## 2  ACGT   ACGA        1
# Search another query set using the same fixed costs and threshold.
align_search(st, c("ACGT", "AAAC"))
##   query target distance
## 1  AAAC   AAAA        1
## 2  AAAC   AAAT        1
## 3  ACGT   ACGA        1
## 4  ACGT   ACGT        0
# The same path is available through dist_search().
dist_search(c("ACGT", "ACGA", "AAAA", "AAAT"),
            max_distance = 1,
            tree_class = "StarTree")
##   query target distance
## 1  AAAT   AAAA        1
## 2  ACGT   ACGA        1

StarTree supports global/Levenshtein-style, anchored, and Hamming DNA alignment with A, C, G, T, and N. It does not support custom substitution matrices, affine gaps, or per-query distance thresholds. Alignment parameters are fixed when the tree is built; align_search() only accepts query, nthreads, and show_progress for star_tree objects.

StarTree anchored mode for fixed anchored self-joins

star_tree(..., mode = "anchored") is the fixed-tree equivalent for anchored alignment. It is intended for DNA sets where every sequence starts at the same biological position but may end at different positions. The construction-time self-join returns each unordered pair once, and additional query sets can be searched against the same fixed targets.

ast <- star_tree(c("ACGT", "ACG", "AAAA", "AA"),
                 max_distance = 1,
                 mode = "anchored",
                 mismatch_cost = 1,
                 gap_cost = 1,
                 nthreads = 2)
result(ast)
##   query target distance query_size target_size
## 1  AAAA     AA        0          2           2
## 2   ACG     AA        1          1           2
## 3  ACGT     AA        1          1           2
## 4  ACGT    ACG        0          3           3
align_search(ast, c("ACGT", "AA"))
##   query target distance query_size target_size
## 1    AA     AA        0          2           2
## 2    AA   AAAA        0          2           2
## 3    AA    ACG        1          2           1
## 4    AA   ACGT        1          2           1
## 5  ACGT     AA        1          1           2
## 6  ACGT    ACG        0          3           3
## 7  ACGT   ACGT        0          4           4
dist_search(c("ACGT", "ACG", "AAAA", "AA"),
            max_distance = 1,
            mode = "anchored",
            tree_class = "StarTree")
##   query target distance query_size target_size
## 1  AAAA     AA        0          2           2
## 2   ACG     AA        1          1           2
## 3  ACGT     AA        1          1           2
## 4  ACGT    ACG        0          3           3

Anchored StarTree mode supports scalar mismatch and linear gap costs through mismatch_cost and gap_cost. It does not support custom substitution matrices, affine gaps, or per-query distance thresholds.

StarTree Hamming mode for fixed same-length joins

star_tree(..., mode = "hamming") is a substitution-only fixed join: only sequences of the same length can match, and max_distance is the maximum number of mismatched positions. Because no insertions or deletions are considered, mismatch_cost and gap_cost do not apply, and it is typically much faster than global mode on the same data.

hst <- star_tree(c("ACGT", "ACGA", "TCGT", "ACG"),
                 max_distance = 1,
                 mode = "hamming",
                 nthreads = 2)
result(hst)
##   query target distance
## 1  ACGT   ACGA        1
## 2  TCGT   ACGT        1
align_search(hst, c("ACGT", "TTGT"))
##   query target distance
## 1  ACGT   ACGA        1
## 2  ACGT   ACGT        0
## 3  ACGT   TCGT        1
## 4  TTGT   TCGT        1
dist_search(c("ACGT", "ACGA", "TCGT", "ACG"),
            max_distance = 1,
            mode = "hamming",
            tree_class = "StarTree")
##   query target distance
## 1  ACGT   ACGA        1
## 2  TCGT   ACGT        1

N is treated as a regular base that mismatches every base, including another N, so two aligned N positions still count as a mismatch.

References