# MIT 6.854/18.415: Advanced Algorithms + Stanford CS168: The Modern Algorithmic Toolbox

These two courses, [MIT 6.854/18.415: Advanced Algorithms](http://people.csail.mit.edu/moitra/854.html) and [Stanford CS168: The Modern Algorithmic Toolbox,](http://web.stanford.edu/class/cs168/index.html) overlap a lot, and I studies them together. Where they overlap, they sort of augment each other in terms of materials: the MIT course has video recording of lectures, but the Stanford course has more interesting (IMO) homeworks.

[Videos](https://www.youtube.com/playlist?list=PL6ogFv-ieghdoGKGg2Bik3Gl1glBTEu8c).

[My homeworks](https://github.com/ainzzorl/stanford-cs168-modern-algorithmic-toolbox) (just the coding questions).

I was familiar with some of the topics in the courses; I skipped the lectures where I thought I already knew the topic sufficiently well.

## Hashing, Load Balancing

* <http://people.csail.mit.edu/moitra/docs/6854lec1.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec2.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec3.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec4.pdf>

* <http://web.stanford.edu/class/cs168/l/l1.pdf>

* <https://web.stanford.edu/class/cs168/l/l2.pdf>

* [Akamai Paper](https://www.akamai.com/uk/en/multimedia/documents/technical-publication/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf)

* Recipe for "universal" HF. Family of \[ha, b(x) = (ax+b mod p) mod n,  a != 0] works just as good as completely random universal HFs. A family of hash functions H is 2-universal if, for any pair x!=y,Pr\[h∈H | h(x) =h(y)] ≤ 1/n&#x20;

* Power of 2 choices: if we distribute N tasks to N servers randomly, the max load on one will be Θ(lognlog logn). If we pick 2 and assign to smallest, it's Θ(log logn) which is a lot better.

* Consistent caching: assign machines randomly to points at \[0, 1] interval. Then each request is randomly assigned to a point too, and routed to the machine on the right, with wrapping. The idea is that when we add or remove machines, we don't need to redistribute things too much.
  * Caches and pages are mapped to the same space. There are no "indexes" for caches or anything.

* To answer queries on big data (streams) or get some statistics, we can sacrifice some accuracy for big gains in memory. It's lossy compression.
  * E.g. Bloom filters, count-min sketch (estimate frequency of element in a stream), algs to estimate number of distinct elements.

* Count-min sketch - estimate frequency of any element in a steam. Choose l HFs, b values each. Hash each element with each hash and increment corresponding counters. Estimate frequency by checking min value for all hashes.
  * Huge practical improvement can be achieved by conservative updates - if we don't increment the counter if it is already the max in corresponding buckets. Does not miss anything (still guaranteed to overestimate the frequency), but makes the guess a lot closer to the truth. At least this is what happened in the homework.

## Data with Distances (Similarity Search, Nearest Neighbor, Dimension Reduction, LSH)

* <http://people.csail.mit.edu/moitra/docs/6854lec4.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec5.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec6.pdf>

* <https://web.stanford.edu/class/cs168/l/l2.pdf>

* <https://web.stanford.edu/class/cs168/l/l3.pdf>

* <https://web.stanford.edu/class/cs168/l/l4.pdf>

* High data dimensionality is bad - even calculating distance between rows is too expensive. But it can be reduced/embedded to lower dimentionality rather easily (Johnson-Lindenstrauss). Basically, we "hash" (estimate) each point and use the result as a proxy for comparison. Repeat X times to boost confidence. JL works only for preserving Euclidian distances! If you have other metrics - not so much.

* Dimentionality reduction is kind of lossy compression. Like count-min sketch.

* Every precise algorithm to find nearest neighbour is exponential in dimensions, either runtime or space, or both. Higher dimentional spaces sort of lack geometry.

* kd-trees - partition space into binary tree to find nearest neighbour. Kinda like Voronoi diagram in arbitrary dimensions. Exponential of dimentions because we can't always discard one branch.

* LSH - can quickly find nearest neighbor (not precisely, though) by using hashes that are likely to collide for similar elements and unlikely for distant.
  * Use k different hashes, each being locality sensitive. In preprocessing, hash each point with each hash and put to corresponding bucket (different hashes use different buckets).
  * Query: hash, compare to all elements in all buckets.
  * Can tune characteristics by changing the cardinality of each hash (bigger - fewer FPs) and number of hashes (bigger - fewer FNs).
  * Each hash is usually a combination of hashes, this allows us to boost size of each hash and reduce FPs.
  * Min-hash: estimate Jaccard similarity by using random permutations. Use random pertumation, then MinHash(A) = min(perm(A)). Probability of two sets having the same hash is their Jaccard similarity exactly. Repeat N times (that's why we need permutations - so we can hash the same thing many times differently) to boost.
  * What does "min" mean? We just assign some values to points. Some random hash to ints. The idea is that this "min" gives us Jaccard similarity - it is the same in 2 sets iff it belongs to their intersection.

## Max Flow, Min Cost

* <http://people.csail.mit.edu/moitra/docs/6854lec7.pdf>

* <http://people.csail.mit.edu/moitra/docs/6854lec8.pdf>

* (Max flow) Any flow can be decomposed into s-t paths and cycles. Cycles can be ignored for the purpose of finding the max.

* Max flow is equal to the min cut.

* Ford-Fulkerson - iterate in the "residual" graph (arcs that can be added, including reversing). Find path increasing the flow and add it.

* This is not guaranteed to converge if capacities are real numbers! If they are ints/rationals, it will finish, but can be very slow and depends on exact arc capacities. Unless we are smart about how we pick the paths (shortest, adding max increment, ...)

* Capacity Scaling: like FF, but only look for edges in the residual graph with capacity >= D until there's no path. Then divide D by 2, and repeat while it's >= 1. Will finish in O(m2(1 + lgU))

* Min-cost flow - each edge has a capacity, we are looking for max flow minimizing ∑(a,b)∈Ef(a, b)c(a, b).

* Can use max flow to find max matching in a bipartite graph.

* Can use idea similar to FF to find min-cost perfect matching. Construct a different type of residual graph, look for cycles of negative weight, improve. It's called Klein's Cycle Cancelling Algorithm.

* Goldberd-Tarjan: apply negative cycles to find min-cost max-flow by choosing the negative cost cycle that minimizes cH(C)/|C| (mean cost in the cycle).

## Linear-Algebraic Techniques: Understanding Principal Components Analysis

* <http://web.stanford.edu/class/cs168/l/l7.pdf>

* [http://web.stanford.edu/class/cs168/l/l8.pdf](http://web.stanford.edu/class/cs168/l/l7.pdf)

* PCA (Principal Component Analysis): express m n-dimensional vectors as a linear combination of k n-dimensional vectors. Choose "best" such vectors maximizing the variance. Helps visualizing, interpreting the data.

* PCA components are eigenvectors with biggest eigenvalues of the covariance matrix multiplied on the left by itself transposed. These are the directions of the biggest stretch, geometrically.

* Each principal component best "fits" the data (minimize avg squared distance to the line) while orthogonal to previous components.

* PCA is a dimensionality reduction technique that helps visualizing/interpreting/compressing the data.

## Sampling and Estimation

* <http://web.stanford.edu/class/cs168/l/l13.pdf>

* <http://web.stanford.edu/class/cs168/l/l14.pdf>

* Reservoir sampling - sample k elements uniformly from a stream of unknown size. In one pass.

* Markov - "at most 10% of the population can have an income that is more than 10× the average income of the population"

* Chebyshev - "the probability that a random variable is more than c standard deviations from its expectation, is at most 1/c2"

* In general, to estimate the expectation of a 0/1 random variable to error ±ε, one needs roughly O(1/ε2) independent samples

* Importance sampling - oversample "important" subset (eg heavy tail), adjust weights to keep the estimator unbiased. Keeps the expectation, but reduces the variance

* Good-Turing frequency estimation - "An estimate of the probability that the next word you read is a new word that you haven’t seen before, is the number of words that you have seen exactly once, divided by the total number of words that you have seen."

* Markov process - initial state, set of states, transition probability. Markov property - the transition depends only on the current state; no memory.

* Estimate probability of being in certain state after certain number of steps either directly (multiply some matrices) or by random sampling - Markov Chain Monte Carlo (MCMC).

* Example - PageRank's random walk.

* Fundamental Theorem of Markov Chain - if there's path from any state to any other, and it's aperiodic, then the probabilities will converge to some distribution, independent of time and initial state. Called the stationary distribution.

### The Fourier Transform and Convolution

* <http://web.stanford.edu/class/cs168/l/l15.pdf>

[Fourier Transform](https://wiki.ainzzorl.lol/math-1/fourier-transform)
