📓
ainzzorl's wiki
  • Intro
  • Pet Projects
    • Good Code
    • LangFlipFlop
    • Soccer Adjectives
    • Word Highlighter
    • News Digest
    • Algorithms in Kotlin
    • rustorcli
  • Tech
    • Tech Articles
    • Algorithms
      • Distributed Hash Table (DHT)
      • RSA
      • Seam Carving
      • Fibonacci Heaps
      • Suffix trees, suffix arrays, etc.
    • Technologies
      • Threads vs Events
      • TLS
      • GPU
      • Web Sockets
      • OSI Model
    • Open Source
      • Homebrew
      • Standard Notes
      • LiChess
    • Bash
    • Raft
    • Quantum Computing
    • GFS
    • MapReduce
    • ZooKeeper
  • Courses
    • MIT 6.824: Distributed Systems
      • Primary-Backup Replication
      • Object Storage on CRAQ
      • Aurora
      • Cache Consistency: Frangipani
      • Takeways
      • Distributed Transactions
      • Midterm
      • Spanner
      • FaRM
      • Spark
      • Cache Consistency: Memcached at Facebook
    • Page 2
    • MIT 6.854/18.415: Advanced Algorithms + Stanford CS168: The Modern Algorithmic Toolbox
  • Math
    • Fourier Transform
    • Probabilities & Statistics
  • Places
    • Moscow
  • Books
    • Page 1
    • Page 1
    • Page 1
    • Page 1
    • Page 1
    • Tobol Mnogo Zvannyh - Ivanov
    • The Twelve Chairs/12 стульев - Ilf, Petrov
    • Beauty is a Wound - Eka Kurniawan
    • The Queen of Spades/Пиковая Дама - Pushkin
    • The Sirens of Titan - Kurt Vonnegut
    • Обитель - Захар Прилепин
    • The Faithful Executioner - Joel Harrington
    • City of Lies: Love, Sex, Death, and the Search for Truth in Tehran - Ramita Navai
    • June/Июль - Dmitry Bykov/Дмитрий Быков
    • East of Eden - John Steinbeck
    • Como Agua Para Chocolate/Like Water for Chocolate - Laura Esquivel Valdés
    • The Kukotski Enigma/Казус Кукоцкого - Lyudmila Ulitskaya/Людмила Улицкая
    • Ancillary Justice - Ann Leckie
    • Career of Evil - JK Rowling
    • The Signal and the Noise - Nate Silver
    • Don't Sleep, There are Snakes - Daniel Everett
    • Оправдание Острова - Eugene Vodolazkin
    • A Place Called Winter - Patrick Gale
    • 1491: New Revelations of the Americas Before Columbus - Charles C. Mann
    • Трудно Отпускает Антарктида - Vladimir Sanin
    • Klara and the Sun - Kazuo Ishiguro
    • The History of My Contemporary/История Моего Современника - Vladimir Korolenko
    • Life at the Speed of Light - Craig Venter
    • Misery - Stephen King
    • And Then There Were None - Agatha Christie
    • A Kim Jong-Il Production - Paul Fischer
  • Cooking
    • Marinated Mushrooms
    • Pea Soup
    • Fried Potato
    • Chimichurri
    • Komendantsky Sauce
    • Spicy mushroom marinara
    • Bruschetta
    • Solyanka with Mushrooms
    • Tofu Scramble
    • Bean Spaghetti
    • Salsa
    • Baked Mushrooms
    • Lentil Soup
    • Веганство в Москве
  • Misc Research
    • Recycling
    • Sailing Upwind
    • Viruses
  • Misc Reading
    • Harry Potter - Rowling Writing
  • Fitness
  • Languages
    • Spanish
    • Language Classification
  • Juggling
  • Life Advice
Powered by GitBook
On this page

Was this helpful?

  1. Tech
  2. Algorithms

Suffix trees, suffix arrays, etc.

PreviousFibonacci HeapsNextTechnologies

Last updated 3 years ago

Was this helpful?

From MIT 6.851:

  • Video:

Given text upfront, preprocess text T and query pattern P. Goal - query in O(P). O(T) space.

Trie - rooted tree with child branches labeled with letters in Σ.

Task: find predecessor. Solved with tries, but the problem is how to represent the node in the trie. Array: too much memory. BST - too slow. Hash - doesn't preserve order, so doesn't solve the predecessor. There are tricks how to do it in O(P + logΣ).

Can use to sort strings: O(T + klgΣ).

Compressed trie - contract non-branching path into single edge. O(k) nodes.

Suffix trees (or suffix tries) - compressed trie of all |T| suffixes. T[i:] of T with $ appended.

Applications of suffix trees:

  • Search for P gives subtree whose leaves correspond to all occurrences of P in T.

    • Which node representation to use?

      • Hashing => O(P) time. But not in order.

      • Trays => O(P + lgΣ) time.

  • LCP (T[i:], T[j:]) = LCA

  • All occurrences of T[i:j] = weighted level ancestor j-i of T[i:] leaf. O(lglgT).

  • Document retrieval: O(p + #docs) - all documents matching the pattern. With RMQ.

Suffix array - sort suffixes of T (just store their indexes) in O(T) space.

LCP - largest common prefix or neighbour prefixes.

You can build suffix tree using suffix array and LCPs.

O(T + sort(Σ)) construction:

  1. Sort Σ.

  2. Replace each letter by its rank in Σ.

  3. Form T0=<(T[3i], T[3i+1], T[3i+2])> for i =0,1,... T1=<(T[3i+1], T[3i+2], T[3i+3])>. T2=...

  4. Recurse on <T0, T1>

  5. Radix sort of T2's suffixes. T2[i:] = T[3i+2] = <T[3i+2],T[3i+3:]>

  6. Merge suffixes of T0 and T1 with suffixes of T2.

Exercises:

  • Build suffix tree.

  • Find substring.

  • Find longest repeated substring. Try wikipedia!

https://www.youtube.com/watch?v=NinWEPPrkDQ
https://courses.csail.mit.edu/6.851/spring12/scribe/lec16.pdf
https://courses.csail.mit.edu/6.851/spring12/lectures/L16.pdf