Suffix trees, suffix arrays, etc.
From MIT 6.851:
Given text upfront, preprocess text T and query pattern P. Goal - query in O(P). O(T) space.
Trie - rooted tree with child branches labeled with letters in Σ.
Task: find predecessor. Solved with tries, but the problem is how to represent the node in the trie. Array: too much memory. BST - too slow. Hash - doesn't preserve order, so doesn't solve the predecessor. There are tricks how to do it in O(P + logΣ).
Can use to sort strings: O(T + klgΣ).
Compressed trie - contract non-branching path into single edge. O(k) nodes.
Suffix trees (or suffix tries) - compressed trie of all |T| suffixes. T[i:] of T with $ appended.
Applications of suffix trees:
Search for P gives subtree whose leaves correspond to all occurrences of P in T.
Which node representation to use?
Hashing => O(P) time. But not in order.
Trays => O(P + lgΣ) time.
LCP (T[i:], T[j:]) = LCA
All occurrences of T[i:j] = weighted level ancestor j-i of T[i:] leaf. O(lglgT).
Document retrieval: O(p + #docs) - all documents matching the pattern. With RMQ.
Suffix array - sort suffixes of T (just store their indexes) in O(T) space.
LCP - largest common prefix or neighbour prefixes.
You can build suffix tree using suffix array and LCPs.
O(T + sort(Σ)) construction:
Sort Σ.
Replace each letter by its rank in Σ.
Form T0=<(T[3i], T[3i+1], T[3i+2])> for i =0,1,... T1=<(T[3i+1], T[3i+2], T[3i+3])>. T2=...
Recurse on <T0, T1>
Radix sort of T2's suffixes. T2[i:] = T[3i+2] = <T[3i+2],T[3i+3:]>
Merge suffixes of T0 and T1 with suffixes of T2.
Exercises:
Build suffix tree.
Find substring.
Find longest repeated substring. Try wikipedia!
Last updated