# Suffix trees, suffix arrays, etc.

From MIT 6.851:

* Video: <https://www.youtube.com/watch?v=NinWEPPrkDQ>
* <https://courses.csail.mit.edu/6.851/spring12/scribe/lec16.pdf>
* <https://courses.csail.mit.edu/6.851/spring12/lectures/L16.pdf>

Given text upfront, preprocess text T and query pattern P. Goal - query in O(P). O(T) space.

**Trie** - rooted tree with child branches labeled with letters in Σ.

**Task: find predecessor.** Solved with tries, but the problem is how to represent the node in the trie. Array: too much memory. BST - too slow. Hash - doesn't preserve order, so doesn't solve the predecessor. There are tricks how to do it in O(P + logΣ).

Can use to sort strings: O(T + klgΣ).

**Compressed trie** - contract non-branching path into single edge. O(k) nodes.

**Suffix trees** (or suffix tries) - compressed trie of all |T| suffixes. T\[i:] of T with $ appended.

Applications of suffix trees:

* Search for P gives subtree whose leaves correspond to all occurrences of P in T.
  * Which node representation to use?
    * Hashing => O(P) time. But not in order.
    * Trays => O(P + lgΣ) time.
* LCP (T\[i:], T\[j:]) = LCA
* All occurrences of T\[i:j] = weighted level ancestor j-i of T\[i:] leaf. O(lglgT).
* Document retrieval: O(p + #docs) - all documents matching the pattern. With RMQ.

**Suffix array** - sort suffixes of T (just store their indexes) in O(T) space.

LCP - largest common prefix or neighbour prefixes.

You can build suffix tree using suffix array and LCPs.

O(T + sort(Σ)) construction:

1. Sort Σ.
2. Replace each letter by its rank in Σ.
3. Form T0=<(T\[3i], T\[3i+1], T\[3i+2])> for i =0,1,... T1=<(T\[3i+1], T\[3i+2], T\[3i+3])>. T2=...
4. Recurse on \<T0, T1>
5. Radix sort of T2's suffixes. T2\[i:] = T\[3i+2] = \<T\[3i+2],T\[3i+3:]>
6. Merge suffixes of T0 and T1 with suffixes of T2.

Exercises:

* Build suffix tree.
* Find substring.
* Find longest repeated substring. Try wikipedia!
