Statistical Methods for Machine Learning

Larry Wasserman’s course

review

A review of Larry Wasserman’s course on Statistical Methods for Machine Learning, covering key concepts and lectures on various topics in machine learning and statistics.

Published

Tuesday, June 10, 2025

Keywords

course review

Review

lesson notes

Statistics versus ML
Concentration
Probability
Basic Statistics
Minimaxity
Regression
Classification

Theorem 1 (Hoeffding’s inequality) If Z_1, Z_2, \ldots, Z_n are iid with mean \mu and P(a \le Z_i \le b) = 1, then for any \epsilon > 0, P(|Z_n − \mu| > \epsilon) \le 2e^{-2n\epsilon^2/(b-a)^2}, where Z_n = \frac{1}{n}\sum_{i=1}^n Z_i.

Definition 1 (VC Dimension) Let A be a class of sets. If F is a finite set, let s(A, F) be the number of subset of F ‘picked out’ by A.

Define the growth function: s_n(A) = \sup_{|F|=n} s(A, F).

Note that s_n(A) \le 2^n.

The VC dimension of a class of set A is VC(A) = \sup\{n : s_n(A) = 2^n\}. \qquad \tag{1}

If the VC dimension is finite, then there is a phase transition in the growth function from exponential to polynomial:

Theorem 2 (Sauer’s Theorem) Suppose that A has finite VC dimension d. Then, \forall n \ge d, s(A, n) \le( \frac{en}{d})^d. \qquad

Given data Z_1, \ldots, Z_n \sim P. The empirical measure Pn is Pn(A) = \frac{1}{n}\sum_{i=1}^n I(Z_i \in A).

Theorem 3 (Vapnik and Chervonenkis) Let A be a class of sets. For any t > \sqrt{2/n}, P(\sup_{A \in A} |P_n(A) − P(A)| > t) \le 4 s(A, 2n) e^{-nt^2/8} (4) and hence, with probability at least 1 − \delta, \sup_{A \in A} |P_n(A) − P(A)| \le \sqrt{8/n \log( 4 s(A, 2n)/\delta)} (5). Hence, if A has finite VC dimension d then \sup_{A \in A} |P_n(A) − P(A)| \le \sqrt{8/n (\log( 4/\delta)+ d \log(n e/d))} \qquad

Theorem 4 Bernstein’s inequality is a more refined inequality than Hoeffding’s inequality. It is especially useful when the variance of Y is small.

Suppose that Y_1, \ldots, Y_n are iid with mean \mu, \text{Var}(Y_i) \le \sigma^2 and |Y_i| \le M. Then P(|Y - \mu| > \epsilon) \le 2 \exp\{- n \epsilon^2 / (2\sigma^2 + 2M \epsilon / 3)\} \qquad

It follows that P(|Y - \mu| > t \sqrt{n} \epsilon + \epsilon \sigma^2 (1 - c)) \le e^{-t} for small enough \epsilon and c.