MIT 6.854/18.415: Advanced Algorithms + Stanford CS168: The Modern Algorithmic Toolbox

These two courses, MIT 6.854/18.415: Advanced Algorithms and Stanford CS168: The Modern Algorithmic Toolbox, overlap a lot, and I studies them together. Where they overlap, they sort of augment each other in terms of materials: the MIT course has video recording of lectures, but the Stanford course has more interesting (IMO) homeworks.
​Videos.
​My homeworks (just the coding questions).
I was familiar with some of the topics in the courses; I skipped the lectures where I thought I already knew the topic sufficiently well.

Hashing, Load Balancing

​
  • Recipe for "universal" HF. Family of [ha, b(x) = (ax+b mod p) mod n, a != 0] works just as good as completely random universal HFs. A family of hash functions H is 2-universal if, for any pair x!=y,Pr[h∈H | h(x) =h(y)] ≤ 1/n
  • Power of 2 choices: if we distribute N tasks to N servers randomly, the max load on one will be Θ(lognlog logn). If we pick 2 and assign to smallest, it's Θ(log logn) which is a lot better.
  • Consistent caching: assign machines randomly to points at [0, 1] interval. Then each request is randomly assigned to a point too, and routed to the machine on the right, with wrapping. The idea is that when we add or remove machines, we don't need to redistribute things too much.
    • Caches and pages are mapped to the same space. There are no "indexes" for caches or anything.
  • To answer queries on big data (streams) or get some statistics, we can sacrifice some accuracy for big gains in memory. It's lossy compression.
    • E.g. Bloom filters, count-min sketch (estimate frequency of element in a stream), algs to estimate number of distinct elements.
  • Count-min sketch - estimate frequency of any element in a steam. Choose l HFs, b values each. Hash each element with each hash and increment corresponding counters. Estimate frequency by checking min value for all hashes.
    • Huge practical improvement can be achieved by conservative updates - if we don't increment the counter if it is already the max in corresponding buckets. Does not miss anything (still guaranteed to overestimate the frequency), but makes the guess a lot closer to the truth. At least this is what happened in the homework.

Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)

​
  • High data dimensionality is bad - even calculating distance between rows is too expensive. But it can be reduced/embedded to lower dimentionality rather easily (Johnson-Lindenstrauss). Basically, we "hash" (estimate) each point and use the result as a proxy for comparison. Repeat X times to boost confidence. JL works only for preserving Euclidian distances! If you have other metrics - not so much.
  • Dimentionality reduction is kind of lossy compression. Like count-min sketch.
  • Every precise algorithm to find nearest neighbour is exponential in dimensions, either runtime or space, or both. Higher dimentional spaces sort of lack geometry.
  • kd-trees - partition space into binary tree to find nearest neighbour. Kinda like Voronoi diagram in arbitrary dimensions. Exponential of dimentions because we can't always discard one branch.
  • LSH - can quickly find nearest neighbor (not precisely, though) by using hashes that are likely to collide for similar elements and unlikely for distant.
    • Use k different hashes, each being locality sensitive. In preprocessing, hash each point with each hash and put to corresponding bucket (different hashes use different buckets).
    • Query: hash, compare to all elements in all buckets.
    • Can tune characteristics by changing the cardinality of each hash (bigger - fewer FPs) and number of hashes (bigger - fewer FNs).
    • Each hash is usually a combination of hashes, this allows us to boost size of each hash and reduce FPs.
    • Min-hash: estimate Jaccard similarity by using random permutations. Use random pertumation, then MinHash(A) = min(perm(A)). Probability of two sets having the same hash is their Jaccard similarity exactly. Repeat N times (that's why we need permutations - so we can hash the same thing many times differently) to boost.
    • What does "min" mean? We just assign some values to points. Some random hash to ints. The idea is that this "min" gives us Jaccard similarity - it is the same in 2 sets iff it belongs to their intersection.

Max Flow, Min Cost

​
  • (Max flow) Any flow can be decomposed into s-t paths and cycles. Cycles can be ignored for the purpose of finding the max.
  • Max flow is equal to the min cut.
  • Ford-Fulkerson - iterate in the "residual" graph (arcs that can be added, including reversing). Find path increasing the flow and add it.
  • This is not guaranteed to converge if capacities are real numbers! If they are ints/rationals, it will finish, but can be very slow and depends on exact arc capacities. Unless we are smart about how we pick the paths (shortest, adding max increment, ...)
  • Capacity Scaling: like FF, but only look for edges in the residual graph with capacity >= D until there's no path. Then divide D by 2, and repeat while it's >= 1. Will finish in O(m2(1 + lgU))
  • Min-cost flow - each edge has a capacity, we are looking for max flow minimizing ∑(a,b)∈Ef(a, b)c(a, b).
  • Can use max flow to find max matching in a bipartite graph.
  • Can use idea similar to FF to find min-cost perfect matching. Construct a different type of residual graph, look for cycles of negative weight, improve. It's called Klein's Cycle Cancelling Algorithm.
  • Goldberd-Tarjan: apply negative cycles to find min-cost max-flow by choosing the negative cost cycle that minimizes cH(C)/|C| (mean cost in the cycle).

Linear-Algebraic Techniques: Understanding Principal Components Analysis

​
  • PCA (Principal Component Analysis): express m n-dimensional vectors as a linear combination of k n-dimensional vectors. Choose "best" such vectors maximizing the variance. Helps visualizing, interpreting the data.
  • PCA components are eigenvectors with biggest eigenvalues of the covariance matrix multiplied on the left by itself transposed. These are the directions of the biggest stretch, geometrically.
  • Each principal component best "fits" the data (minimize avg squared distance to the line) while orthogonal to previous components.
  • PCA is a dimensionality reduction technique that helps visualizing/interpreting/compressing the data.
    ​

Sampling and Estimation

​
  • Reservoir sampling - sample k elements uniformly from a stream of unknown size. In one pass.
  • Markov - "at most 10% of the population can have an income that is more than 10× the average income of the population"
  • Chebyshev - "the probability that a random variable is more than c standard deviations from its expectation, is at most 1/c2"
  • In general, to estimate the expectation of a 0/1 random variable to error ±ε, one needs roughly O(1/ε2) independent samples
  • Importance sampling - oversample "important" subset (eg heavy tail), adjust weights to keep the estimator unbiased. Keeps the expectation, but reduces the variance
  • Good-Turing frequency estimation - "An estimate of the probability that the next word you read is a new word that you haven’t seen before, is the number of words that you have seen exactly once, divided by the total number of words that you have seen."
  • Markov process - initial state, set of states, transition probability. Markov property - the transition depends only on the current state; no memory.
  • Estimate probability of being in certain state after certain number of steps either directly (multiply some matrices) or by random sampling - Markov Chain Monte Carlo (MCMC).
  • Example - PageRank's random walk.
  • Fundamental Theorem of Markov Chain - if there's path from any state to any other, and it's aperiodic, then the probabilities will converge to some distribution, independent of time and initial state. Called the stationary distribution.
​

The Fourier Transform and Convolution

​Fourier Transform​
Last modified 1yr ago