Suffix trees, suffix arrays, etc.

From MIT 6.851:

Given text upfront, preprocess text T and query pattern P. Goal - query in O(P). O(T) space.

Trie - rooted tree with child branches labeled with letters in Σ.

Task: find predecessor. Solved with tries, but the problem is how to represent the node in the trie. Array: too much memory. BST - too slow. Hash - doesn't preserve order, so doesn't solve the predecessor. There are tricks how to do it in O(P + logΣ).

Can use to sort strings: O(T + klgΣ).

Compressed trie - contract non-branching path into single edge. O(k) nodes.

Suffix trees (or suffix tries) - compressed trie of all |T| suffixes. T[i:] of T with $ appended.

Applications of suffix trees:

  • Search for P gives subtree whose leaves correspond to all occurrences of P in T.

    • Which node representation to use?

      • Hashing => O(P) time. But not in order.

      • Trays => O(P + lgΣ) time.

  • LCP (T[i:], T[j:]) = LCA

  • All occurrences of T[i:j] = weighted level ancestor j-i of T[i:] leaf. O(lglgT).

  • Document retrieval: O(p + #docs) - all documents matching the pattern. With RMQ.

Suffix array - sort suffixes of T (just store their indexes) in O(T) space.

LCP - largest common prefix or neighbour prefixes.

You can build suffix tree using suffix array and LCPs.

O(T + sort(Σ)) construction:

  1. Sort Σ.

  2. Replace each letter by its rank in Σ.

  3. Form T0=<(T[3i], T[3i+1], T[3i+2])> for i =0,1,... T1=<(T[3i+1], T[3i+2], T[3i+3])>. T2=...

  4. Recurse on <T0, T1>

  5. Radix sort of T2's suffixes. T2[i:] = T[3i+2] = <T[3i+2],T[3i+3:]>

  6. Merge suffixes of T0 and T1 with suffixes of T2.

Exercises:

  • Build suffix tree.

  • Find substring.

  • Find longest repeated substring. Try wikipedia!

Last updated