Suffix trees, suffix arrays, etc.

From MIT 6.851:

Given text upfront, preprocess text T and query pattern P. Goal - query in O(P). O(T) space.

Trie - rooted tree with child branches labeled with letters in Σ.

Task: find predecessor. Solved with tries, but the problem is how to represent the node in the trie. Array: too much memory. BST - too slow. Hash - doesn't preserve order, so doesn't solve the predecessor. There are tricks how to do it in O(P + logΣ).

Can use to sort strings: O(T + klgΣ).

Compressed trie - contract non-branching path into single edge. O(k) nodes.

Suffix trees (or suffix tries) - compressed trie of all |T| suffixes. T[i:] of T with $ appended.

Applications of suffix trees:

Search for P gives subtree whose leaves correspond to all occurrences of P in T.
- Which node representation to use?
  - Hashing => O(P) time. But not in order.
  - Trays => O(P + lgΣ) time.
LCP (T[i:], T[j:]) = LCA
All occurrences of T[i:j] = weighted level ancestor j-i of T[i:] leaf. O(lglgT).
Document retrieval: O(p + #docs) - all documents matching the pattern. With RMQ.

Suffix array - sort suffixes of T (just store their indexes) in O(T) space.

LCP - largest common prefix or neighbour prefixes.

You can build suffix tree using suffix array and LCPs.

O(T + sort(Σ)) construction:

Sort Σ.
Replace each letter by its rank in Σ.
Form T0=<(T[3i], T[3i+1], T[3i+2])> for i =0,1,... T1=<(T[3i+1], T[3i+2], T[3i+3])>. T2=...
Recurse on <T0, T1>
Radix sort of T2's suffixes. T2[i:] = T[3i+2] = <T[3i+2],T[3i+3:]>
Merge suffixes of T0 and T1 with suffixes of T2.

Exercises:

Build suffix tree.
Find substring.
Find longest repeated substring. Try wikipedia!

PreviousFibonacci Heaps NextTechnologies

Last updated 4 years ago

Was this helpful?