MIT 6.854/18.415: Advanced Algorithms + Stanford CS168: The Modern Algorithmic Toolbox
These two courses, MIT 6.854/18.415: Advanced Algorithms and Stanford CS168: The Modern Algorithmic Toolbox, overlap a lot, and I studies them together. Where they overlap, they sort of augment each other in terms of materials: the MIT course has video recording of lectures, but the Stanford course has more interesting (IMO) homeworks.
My homeworks (just the coding questions).
I was familiar with some of the topics in the courses; I skipped the lectures where I thought I already knew the topic sufficiently well.
Hashing, Load Balancing
Recipe for "universal" HF. Family of [ha, b(x) = (ax+b mod p) mod n, a != 0] works just as good as completely random universal HFs. A family of hash functions H is 2-universal if, for any pair x!=y,Pr[h∈H | h(x) =h(y)] ≤ 1/n
Power of 2 choices: if we distribute N tasks to N servers randomly, the max load on one will be Θ(lognlog logn). If we pick 2 and assign to smallest, it's Θ(log logn) which is a lot better.
Consistent caching: assign machines randomly to points at [0, 1] interval. Then each request is randomly assigned to a point too, and routed to the machine on the right, with wrapping. The idea is that when we add or remove machines, we don't need to redistribute things too much.
Caches and pages are mapped to the same space. There are no "indexes" for caches or anything.
To answer queries on big data (streams) or get some statistics, we can sacrifice some accuracy for big gains in memory. It's lossy compression.
E.g. Bloom filters, count-min sketch (estimate frequency of element in a stream), algs to estimate number of distinct elements.
Count-min sketch - estimate frequency of any element in a steam. Choose l HFs, b values each. Hash each element with each hash and increment corresponding counters. Estimate frequency by checking min value for all hashes.
Huge practical improvement can be achieved by conservative updates - if we don't increment the counter if it is already the max in corresponding buckets. Does not miss anything (still guaranteed to overestimate the frequency), but makes the guess a lot closer to the truth. At least this is what happened in the homework.
Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)
High data dimensionality is bad - even calculating distance between rows is too expensive. But it can be reduced/embedded to lower dimentionality rather easily (Johnson-Lindenstrauss). Basically, we "hash" (estimate) each point and use the result as a proxy for comparison. Repeat X times to boost confidence. JL works only for preserving Euclidian distances! If you have other metrics - not so much.
Dimentionality reduction is kind of lossy compression. Like count-min sketch.
Every precise algorithm to find nearest neighbour is exponential in dimensions, either runtime or space, or both. Higher dimentional spaces sort of lack geometry.
kd-trees - partition space into binary tree to find nearest neighbour. Kinda like Voronoi diagram in arbitrary dimensions. Exponential of dimentions because we can't always discard one branch.
LSH - can quickly find nearest neighbor (not precisely, though) by using hashes that are likely to collide for similar elements and unlikely for distant.
Use k different hashes, each being locality sensitive. In preprocessing, hash each point with each hash and put to corresponding bucket (different hashes use different buckets).
Query: hash, compare to all elements in all buckets.
Can tune characteristics by changing the cardinality of each hash (bigger - fewer FPs) and number of hashes (bigger - fewer FNs).
Each hash is usually a combination of hashes, this allows us to boost size of each hash and reduce FPs.
Min-hash: estimate Jaccard similarity by using random permutations. Use random pertumation, then MinHash(A) = min(perm(A)). Probability of two sets having the same hash is their Jaccard similarity exactly. Repeat N times (that's why we need permutations - so we can hash the same thing many times differently) to boost.
What does "min" mean? We just assign some values to points. Some random hash to ints. The idea is that this "min" gives us Jaccard similarity - it is the same in 2 sets iff it belongs to their intersection.
Max Flow, Min Cost
(Max flow) Any flow can be decomposed into s-t paths and cycles. Cycles can be ignored for the purpose of finding the max.
Max flow is equal to the min cut.
Ford-Fulkerson - iterate in the "residual" graph (arcs that can be added, including reversing). Find path increasing the flow and add it.
This is not guaranteed to converge if capacities are real numbers! If they are ints/rationals, it will finish, but can be very slow and depends on exact arc capacities. Unless we are smart about how we pick the paths (shortest, adding max increment, ...)
Capacity Scaling: like FF, but only look for edges in the residual graph with capacity >= D until there's no path. Then divide D by 2, and repeat while it's >= 1. Will finish in O(m2(1 + lgU))
Min-cost flow - each edge has a capacity, we are looking for max flow minimizing ∑(a,b)∈Ef(a, b)c(a, b).
Can use max flow to find max matching in a bipartite graph.
Can use idea similar to FF to find min-cost perfect matching. Construct a different type of residual graph, look for cycles of negative weight, improve. It's called Klein's Cycle Cancelling Algorithm.
Goldberd-Tarjan: apply negative cycles to find min-cost max-flow by choosing the negative cost cycle that minimizes cH(C)/|C| (mean cost in the cycle).
Linear-Algebraic Techniques: Understanding Principal Components Analysis
PCA (Principal Component Analysis): express m n-dimensional vectors as a linear combination of k n-dimensional vectors. Choose "best" such vectors maximizing the variance. Helps visualizing, interpreting the data.
PCA components are eigenvectors with biggest eigenvalues of the covariance matrix multiplied on the left by itself transposed. These are the directions of the biggest stretch, geometrically.
Each principal component best "fits" the data (minimize avg squared distance to the line) while orthogonal to previous components.
PCA is a dimensionality reduction technique that helps visualizing/interpreting/compressing the data.
Sampling and Estimation
Reservoir sampling - sample k elements uniformly from a stream of unknown size. In one pass.
Markov - "at most 10% of the population can have an income that is more than 10× the average income of the population"
Chebyshev - "the probability that a random variable is more than c standard deviations from its expectation, is at most 1/c2"
In general, to estimate the expectation of a 0/1 random variable to error ±ε, one needs roughly O(1/ε2) independent samples
Importance sampling - oversample "important" subset (eg heavy tail), adjust weights to keep the estimator unbiased. Keeps the expectation, but reduces the variance
Good-Turing frequency estimation - "An estimate of the probability that the next word you read is a new word that you haven’t seen before, is the number of words that you have seen exactly once, divided by the total number of words that you have seen."
Markov process - initial state, set of states, transition probability. Markov property - the transition depends only on the current state; no memory.
Estimate probability of being in certain state after certain number of steps either directly (multiply some matrices) or by random sampling - Markov Chain Monte Carlo (MCMC).
Example - PageRank's random walk.
Fundamental Theorem of Markov Chain - if there's path from any state to any other, and it's aperiodic, then the probabilities will converge to some distribution, independent of time and initial state. Called the stationary distribution.
The Fourier Transform and Convolution
Last updated