📖 All of Statistics – Oren Bochman’s Blog

The are lies, damn lies, and statistics.. - Mark Twain

After I became a data scientist I wanted to squeeze out more insights and to developer better models from the data. I kept going back to my statistic and probability text book.

A horror story of statistics

But that text book was all theorems and no … practical advice. The author had been a friendly professor. He told us that we should buy his book as it would cover the material in the course. I was broke but I scraped together the money to buy the book. About three weeks into the course he became the department head for CS and Mathematics and never taught us again. He assigned a visiting post doc to teaching us. She was smart, sweet and didn’t speak a word of English or Hebrew only French and the book was in hebrew. I got by reading the book and most of the class failed the first and second exam and had to retake it. I got a mediocre grade and had to do my military service. Years later I wanted to under stand why such a wonderfully printer book was so hard to read and looked at the single page bibliography. It was a very short list. The first title that seemed relevant was a graduate level text book which stated in the intro “The aim of this book is to present a fundamental proof to the central limit theorems. The material is written so as to get the reader to the proof in the fastest way possible.” I looked at the table of contents and realized I had a trimmed translation of this text with a few appendixes for material from other courses. Ok there was some stats, the central limit theorem, the law of large numbers and definitions of expectation and variance as a plethora of other results scattered here and there.

The real disaster for me is that I desperately needed some statistics for my second and third year courses in Physics. But the course was a year too late. I eventually taught myself statistics, but I was forever feeling an Imposter in the field - even after tutoring friends and fellow student on the subject.

When I eventually took another class on statistics some years later. There ware no proofs and the teacher was an emeritus professor charged with calculating academic biblliometrics like h-scores for all the academics in the university. He was shocked every time he came up with a new concept I kept answering his questions without thinking – while my classmates seemed to be having a hard time. I had developed an uncanny intuition for the subject… But in reality I was just as shocked as he was.

So I keep the highlighting copy of the book around at least until I get my money’s worth :-) but I ended up taking courses on line to get to more advanced material.

Eventually I cam across this book which does what I had set out to do. Making an outline of the more practical theoretical results alongside the techniques of statistics, exercises giving some opportunities to practice them so you can really start using them on a daily basis! And last but not least code examples to use.

TL;DR - Too Long; Didn’t Read about Statistics

This is a no nonsense introduction to statistics.
It isn’t a complete treatment as the title suggest. However when it came out it was more up to date then what you might expect to be taught in most statistics courses.

I recommend the chapters on inequalities and causality.

Here is a lighthearted Deep Dive into the book:

Glossary

This book uses lots of big terms so let’s break them down so we can understand them better

Sample Space (\Omega): The set of all possible outcomes of an experiment.
Outcome (\omega): A point or element in the sample space.
Event (A): A subset of the sample space \Omega.
Complement of A (A^c): The event that A does not occur.
Union (A \bigcup B): The event that A or B (or both) occur.
Intersection (A \bigcap B): The event that A and B both occur. Sometimes written as AB or (A,B). For a sequence of sets A_1, A_2, \ldots, \bigcap_{i=1}^{\infty} A_i = \{ \omega \in \Omega : \omega \in A_i \text{ for all } i \}.
Set Difference (A - B): The set of outcomes \omega such that \omega \in A and \omega \notin B.
Subset (A \subset B): Every element of A is also contained in B. Equivalently, B \supset A.
|A|: The number of elements in a finite set A.
Probability (P): A probability measure defined on a \sigma-algebra.
\sigma-algebra (or \sigma-field) (A): A class of subsets of \Omega that satisfies three conditions: (i) \emptyset \in A, (ii) if A_1, A_2, \ldots, \in A then \bigcup_{i=1}^{\infty} A_i \in A, and (iii) A \in A implies that A^c \in A. The sets in A are said to be measurable.
Measurable Space (\Omega, A): A sample space \Omega together with a \sigma-algebra A.
Probability Space (\Omega, A, P): A measurable space (\Omega, A) together with a probability measure P defined on A.
Borel \sigma-field: The smallest \sigma-field that contains all the open subsets of the real line.
Bayes’ Theorem: A theorem that relates the conditional and marginal probabilities of events. For a partition A_1, \ldots, A_k of \Omega with P(A_i) > 0 and an event B with P(B) > 0, P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{k} P(B|A_j)P(A_j)}. P(A_i) is the prior probability, and P(A_i|B) is the posterior probability.
Estimation (Statistics): Learning using data to estimate an unknown quantity. Refers to providing a single “best guess” of some quantity of interest (point estimation).
Learning (Computer Science): Finding a good classifier or; using data to estimate an unknown quantity.
Classification (Statistics): Supervised learning, predicting a discrete Y from X.
Supervised Learning (Computer Science): Predicting a discrete Y from X.
Data (Statistics): Training sample (X_1, Y_1), \ldots, (X_n, Y_n).
Training Sample (Computer Science): (X_1, Y_1), \ldots, (X_n, Y_n).
Covariates (Statistics): The X_i’s.
Features (Computer Science): The X_i’s.
Classifier (Statistics): A hypothesis, a map from covariates to outcomes.
Hypothesis (Computer Science): A map h: X \rightarrow Y (a classifier).
Hypothesis (Statistics): A subset of a parameter space \Theta.
Confidence Interval: An interval that contains an unknown quantity with a given frequency Requires P_\theta(\theta \in C_n) \geq 1 - \alpha for all \theta \in \Theta.
Bayes Net: A directed acyclic graph representing a multivariate distribution with given conditional independence relations.
Bayesian Inference: Statistical methods for using data to update beliefs.
Frequentist Inference: Statistical methods with guaranteed frequency behavior.
Large Deviation Bounds: Uniform bounds on the probability of errors.
PAC Learning: Probably Approximately Correct learning, related to uniform bounds on probability of errors.
Point Estimation: Providing a single “best guess” of some quantity of interest, such as a parameter, cdf F, pdf f, regression function r, or a prediction.
Confidence Sets: A set of values that is believed to contain the true value of a parameter with a certain probability.
Hypothesis Testing: A procedure to decide between two or more competing statements about a population.
Parametric Model: A statistical model where the set of possible distributions is indexed by a finite number of parameters.
Nonparametric Model: A statistical model where the set of possible distributions is not indexed by a finite number of parameters.
CDF (Cumulative Distribution Function) (F_X(x)): For a random variable X, the probability that X takes on a value less than or equal to x i.e. P(X \leq x)
PDF (Probability Density Function) (f_X(x)): For continuous random variables, P(a \leq X \leq b) = \int_a^b f_X(x) dx .
IID (Independent and Identically Distributed) Samples: A sequence of random variables X_1, \ldots, X_n drawn from the same distribution and are independent of each other.
Statistic: A function T(X_n) of the data.
Sufficient Statistic: A statistic that contains all the information in the data about the parameter of interest.
MLE (Maximum Likelihood Estimator): An estimator that maximizes the likelihood function, which is the probability of observing the data given the parameters.
Bayes Estimator: An estimator that minimizes the Bayes risk.
Wald Test: A statistical test for hypothesis testing about parameters based on the asymptotic normality of estimators.
p-value: The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis is true.
Bootstrap: A resampling method used for statistical inference, such as estimating variance and constructing confidence intervals.
Nonparametric Curve Estimation: Methods for estimating functions (like density or regression functions) without assuming a specific parametric form.
Classification Rule (Classifier): A rule or algorithm that assigns an object to one of several predefined categories based on its features.
Bayes Classifier: The classification rule that minimizes the probability of error. For binary classification (0 or 1), h^*(x) = 1 if P(Y=1|X=x) > P(Y=0|X=x), and 0 otherwise. Equivalently, h^*(x) = 1 if \pi f_1(x) > (1-\pi) f_0(x), and 0 otherwise.
LDA (Linear Discriminant Analysis): A classification method where the decision boundary between classes is a linear function of the features.
Logistic Regression: A statistical model that uses a logistic function to model the probability of a binary outcome.
Support Vector Machines (SVM): A class of linear classifiers that aim to find the hyperplane that maximizes the margin between two classes. Can be extended to non-linear classification using kernel functions.
Kernelization: A technique used in SVM and other methods to implicitly map data into a higher-dimensional space to allow for non-linear decision boundaries.
Boosting: An ensemble learning technique that combines multiple weak classifiers to create a strong classifier. Can be thought of as a bias reduction technique.
Bagging: An ensemble learning technique that reduces the variance of a classifier by averaging predictions from multiple classifiers trained on different subsets of the data.
Markov Chain: A sequence of random variables where the future state depends only on the current state, not on the sequence of events that preceded it.
Stationary Distribution (\pi): A probability distribution for a Markov chain that remains unchanged in time. \pi P = \pi, where P is the transition matrix.
Detailed Balance: A condition for a Markov chain with stationary distribution \pi: \pi_i p_{ij} = p_{ji} \pi_j for all states i and j. Detailed balance guarantees that \pi is a stationary distribution.
MCMC (Markov Chain Monte Carlo): A class of algorithms for sampling from a probability distribution by constructing a Markov chain that has the desired distribution as its equilibrium distribution. Examples include Metropolis-Hastings and Gibbs sampling.
Graphical Models: Statistical models that use graphs to represent the conditional dependence structure between random variables. Can be directed (DAGs) or undirected.
DAG (Directed Acyclic Graph): A directed graph with no directed cycles, used to represent probabilistic relationships and conditional independencies.
Undirected Graph: A graph where the edges between nodes have no direction, used to represent symmetric relationships and conditional independencies.
Conditional Independence: A concept where the independence between two random variables holds given a third variable. Represented using notation like X \perp Y | Z. In graphical models, this can be determined by d-separation in DAGs and graph separation in undirected graphs.
Clique: A set of variables in a graph that are all adjacent to each other.
Potential: Any positive function defined on a clique in an undirected graphical model.
Log-Linear Models: Statistical models used to analyze categorical data by modeling the logarithm of the expected frequencies as a linear combination of parameters, often related to fitting undirected graphical models.
Nonparametric Regression: Methods for estimating the relationship between a response variable and one or more predictor variables without assuming a specific parametric form for the regression function.
Kernel Density Estimation: A nonparametric method for estimating the probability density function of a random variable.
Smoothing: Techniques used to estimate an underlying function from noisy data, often used in nonparametric curve estimation.
Orthogonal Functions: A set of functions that are mutually orthogonal with respect to some inner product, used as basis functions in smoothing and density estimation.
Wavelets: A set of orthogonal functions that are localized in both time and frequency, used in smoothing, density estimation, and regression.
Simulation: Using computer-generated random numbers to approximate the properties of a statistical model or to sample from a complex distribution.
Bayes Risk: The expected value of the loss function when the parameter is considered a random variable with a prior distribution.
Minimax Rule: A decision rule that minimizes the maximum possible risk over all possible values of the parameter.
Admissibility: An estimator \delta_1 is admissible if it is not dominated by any other estimator \delta_2, meaning R(\theta, \delta_2) \leq R(\theta, \delta_1) for all \theta and R(\theta, \delta_2) < R(\theta, \delta_1) for at least one \theta, where R is the risk function.
Causal Inference: The process of drawing conclusions about causal relationships from data, often involving concepts like potential outcomes and directed acyclic graphs.
Potential Outcomes (C_0, C_1): Random variables representing the outcome if a subject is not treated (X=0) and the outcome if the subject is treated (X=1).
Average Causal Effect: The average difference in potential outcomes.
Average Treatment Effect: Another term for the average causal effect.
d-separated: In a directed acyclic graph, two nodes A and B are d-separated by a set of nodes C if all paths between A and B are blocked by C. A path is blocked if it contains a collider (V-structure) not in C and none of its descendants are in C, or a non-collider (chain or fork) that is in C.
d-connected: If two nodes are not d-separated.
Markov Equivalent (DAGs): Two DAGs are Markov equivalent if and only if they have the same skeleton (undirected graph obtained by ignoring the directions of the edges) and the same unshielded colliders (V-structures where the parents are not directly connected).
Pairwise Markov Graph: An undirected graph where an edge is omitted between two variables if they are independent given all other variables.
Clique (Maximal): A clique is maximal if it is not possible to include another variable and still be a clique.
Hierarchical Model (Log-Linear): A log-linear model where if a higher-order interaction term is included, then all lower-order terms involving subsets of those variables are also included.
Graphical Model (Log-Linear): A hierarchical log-linear model where the included interaction terms correspond to the cliques of an undirected graph, and the excluded terms correspond to conditional independencies implied by the graph.
Generator (Log-Linear Models): A concise way to represent a hierarchical log-linear model by specifying the highest-order interaction terms to be included. All lower-order terms are then automatically included.
Bias-Variance Tradeoff: The property of a set of statistical models whereby models with a lower bias tend to have a higher variance, and vice versa. Important in model selection and curve estimation.
Cross-Validation: A model evaluation technique where the data is divided into subsets, and the model is trained on some subsets and evaluated on the remaining subsets to estimate its performance on unseen data.
Error Rate (Classification): The proportion of misclassified instances.
Bayes Rule (Classification): The classification rule that achieves the minimum possible error rate.
Discriminant Function (LDA): A function that assigns a score to each class for a given observation, and the observation is classified to the class with the highest score.
Decision Boundary: The surface in the feature space that separates different classes according to a classification rule.
Gini Index: A measure of impurity used in decision tree algorithms.
Hyperplane (SVM): A flat affine subspace of dimension p-1 in a p-dimensional space, used by linear classifiers to separate classes.
Margin (SVM): The distance between the hyperplane and the closest data points from each class.
Kernel (SVM): A function that defines a similarity measure between data points and allows SVMs to learn non-linear decision boundaries by implicitly mapping the data to a higher-dimensional space.
Random Walk Metropolis-Hastings: A specific type of Metropolis-Hastings algorithm where the proposal distribution is centered at the current state.
Gibbs Sampling: An MCMC algorithm where each variable is sampled conditionally on the current values of all other variables.
Posterior Mean: The mean of the posterior distribution in Bayesian inference.
Posterior Interval: An interval that contains the true value of a parameter with a certain probability according to the posterior distribution.
Frequentist Approach: Statistical methods that interpret probability as a long-run frequency of events.
Maximum Likelihood (Frequentist): A method of estimating the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the data.

Part I: Probability

Here is an outline of Chapter 1, “Probability,” based on the provided excerpts from “book.pdf,” with mathematical expressions surrounded by dollar signs:

1 Probability
- 1.1 Introduction
  - Probability is defined as a mathematical language for quantifying uncertainty.
  - This chapter introduces the basic concepts underlying probability theory.
  - It begins with the sample space, which is the set of possible outcomes.
- 1.2 Sample Spaces and Events
  - The sample space \Omega is the set of possible outcomes of an experiment.
  - Points \omega in \Omega are called sample outcomes, realizations, or elements.
  - Subsets of \Omega are called Events.
  - Example 1.1: Tossing a coin twice gives \Omega = \{HH, HT, TH, TT\}. The event that the first toss is heads is A = \{HH, HT\}.
- 1.3 Probability
  - This section introduces the definition and axioms of probability.
  - Definition 1.5: A probability measure P on a sample space \Omega is a function P: A \rightarrow [0, 1] that satisfies three axioms:
    - Axiom 1: P(A) \geq 0 for all events A.
    - Axiom 2: P(\Omega) = 1.
    - Axiom 3: If A_1, A_2, \ldots are disjoint events, then P(\bigcup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} P(A_i).
- 1.4 Probability on Finite Sample Spaces
  - This section discusses how to calculate probabilities when the sample space contains a finite number of equally likely outcomes.
  - If \Omega = \{ \omega_1, \ldots, \omega_n \}, then P(A) = \frac{|A|}{|\Omega|} for any event A \subset \Omega.
- 1.5 Independent Events
  - This section defines independent events.
- 1.6 Conditional Probability
  - This section introduces the concept of conditional probability, P(A|B) = \frac{P(A \cap B)}{P(B)}
- 1.7 Bayes’ Theorem
  - Theorem 1.17 (Bayes’ Theorem) is presented: Let A_1, \ldots, A_k be a partition of \Omega such that P(A_i) > 0 for each i. If P(B) > 0, then for each i = 1, \ldots, k, P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{k} P(B|A_j)P(A_j)}.
  - Remark 1.18: P(A_i) is called the prior probability of A_i, and P(A_i|B) is called the posterior probability of A_i.
  - Example 1.19: An email categorization example is provided, calculating the probability that an email containing the word “free” is spam using Bayes’ Theorem.
- 1.8 Bibliographic Remarks
  - Several books are mentioned as further reading:
  - DeGroot and Schervish (2002),
  - Grimmett and Stirzaker (1982),
  - Karr (1993),
  - Billingsley (1979), and
  - Breiman (1992).
2 Random Variables
- 2.1 Introduction
  - A random variable is a mapping that assigns a real number X(\omega) to each outcome \omega in the sample space.
- 2.2 Distribution Functions and Probability Functions
  - Covers the cumulative distribution function (CDF) F_X(x) = P(X \leq x).
  - Discusses probability functions f_X(x) = P(X=x) for discrete random variables.
  - Example 2.6 provides a probability function for flipping a coin twice: f_X(x) = \begin{cases} 1/4 & x = 0 \\ 1/2 & x = 1 \\ 1/4 & x = 2 \\ 0 & \text{otherwise} \end{cases}.
  - Figure 2.2 illustrates this probability function.
- 2.3 Some Important Discrete Random Variables
  - Point Mass Distribution.
  - The Discrete Uniform Distribution.
  - The Bernoulli Distribution.
  - The Binomial Distribution.
  - The Geometric Distribution.
  - The Poisson Distribution.
  - The text mentions that for continuous random variables, it’s more convenient to define things in terms of the pdf, as the probability of a continuous random variable taking a specific value is zero.
- 2.4 Some Important Continuous Random Variables
  - This section details important continuous distributions.
  - Normal Distribution: Denoted as X \sim N(\mu, \sigma^2), with density f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.
    - Standard Normal distribution Z \sim N(0, 1) has density \phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2} and CDF \Phi(z) = P(Z \leq z).
    - P(a < X < b) = \Phi(\frac{b - \mu}{\sigma}) - \Phi(\frac{a - \mu}{\sigma}) if X \sim N(\mu, \sigma^2).
    - Figure 2.4 shows the density of a standard Normal.
    - Example 2.17: If X \sim N(3, 5), find P(X > 1).
  - Uniform Distribution: X \sim Uniform(a, b) has density f(x) = \frac{1}{b-a} for a < x < b, and 0 otherwise.
    - Example 2.39 involves a Uniform(0, 1) distribution.
  - Exponential Distribution: X \sim Exp(\beta) has density f(x) = \frac{1}{\beta} e^{-x/\beta} for x > 0, \beta > 0, and 0 otherwise.
- 2.5 Multivariate Distributions
  - This section covers the joint distribution of two or more random variables.
  - The concept of a joint density f(x, y) for continuous random variables X and Y such that P(a < X < b, c < Y < d) = \int_a^b \int_c^d f(x, y) dy dx will likely be introduced.
  - Marginal distributions can be obtained by integrating (or summing) the joint distribution over the other variables: f_X(x) = \int f(x, y) dy and f_Y(y) = \int f(x, y) dx.
  - Example 2.27 is mentioned in Example 2.38, where the marginal density f_Y(y) = y + (1/2) is used.
  - Example 2.38 involves finding a conditional probability P(X < 1/4 | Y = 1/3) using the conditional density f_{X|Y}(x|y) = \frac{f_{X,Y}(x, y)}{f_Y(y)}.
- 2.6 Independence
  - Two random variables X and Y are independent if their joint distribution is the product of their marginal distributions: F_{X,Y}(x, y) = F_X(x) F_Y(y) or f_{X,Y}(x, y) = f_X(x) f_Y(y).
- 2.7 Conditional Probability
  - Extends the concept of conditional probability from events to random variables, likely defining conditional distributions and densities.
  - Example 2.38 shows the calculation of conditional probability for continuous random variables.
- 2.8 Bayes’ Theorem for Random Variables
  - Will likely present an extension of Bayes’ Theorem to relate conditional densities: f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x) f_X(x)}{f_Y(y)}.
- 2.9 Multivariate Distributions and iid Samples
  - Covers the concept of multiple random variables considered together.
  - Introduces the idea of independent and identically distributed (iid) samples X_1, \ldots, X_n drawn from the same distribution.
- 2.10 Two Important Multivariate Distributions
  - Discusses specific multivariate distributions.
  - Multinomial Distribution: Mentioned in Chapter 14 and likely introduced here.
  - Multivariate Normal Distribution: Also mentioned in Chapter 14 and likely introduced here.
- 2.11 Transformations of Random Variables
  - Deals with finding the distribution of a function of one or more random variables (e.g., if Y = g(X), how to find F_Y or f_Y).
- 2.12 Transformations of Several Random Variables
  - Extends the transformation techniques to functions of multiple random variables.
3 Expectation
- 3.1 Expectation of a Random Variable
  - Introduces the concept of the expectation of a random variable.
  - The notation \int x dF(x) is used as a unifying notation for both discrete (\sum_x xf(x)) and continuous (\int xf(x)dx) random variables.
  - It is noted that the precise meaning of \int x dF(x) is discussed in real analysis courses.
  - The expectation E(X) is said to exist if \int |x|dF_X(x) < \infty. Otherwise, the expectation does not exist.
- 3.2 Properties of Expectations
  - This section will likely cover properties such as linearity of expectation, E(aX + bY) = aE(X) + bE(Y), and other fundamental rules.
- 3.3 Variance and Covariance
  - Defines variance and covariance, likely with formulas such as Var(X) = E[(X - E[X])^2] and Cov(X, Y) = E[(X - E[X])(Y - E[Y])].
- 3.4 Expectation and Variance of Important Random Variables
  - This section will likely provide the formulas for the expectation and variance of common probability distributions like Poisson, Normal, and Gamma.
- 3.5 Conditional Expectation
  - Introduces the concept of conditional expectation, E(X|Y=y) and E(X|Y).
- 3.6 Moment Generating Functions
  - Covers moment generating functions (MGFs), M_X(t) = E[e^{tX}], and their properties.
- 3.7 Appendix
  - This section might contain proofs or more technical details related to expectation, variance, covariance, conditional expectation, and moment generating functions.
4 Inequalities
- 4.1 Probability Inequalities
  - Introduces the concept that inequalities are useful for bounding quantities that might be hard to compute.
  - Mentions that inequalities will also be used in the theory of convergence (discussed in Chapter 5).
  - Markov’s Inequality: Likely presents the inequality, which provides an upper bound on the probability that a non-negative random variable exceeds a certain value. It is referred to as “Our first inequality”.
  - Chebyshev’s Inequality: Will likely be covered, providing a bound on the probability that a random variable deviates from its mean by more than a certain amount.
  - Jensen’s Inequality: This inequality relates the value of a convex function of the expected value of a random variable to the expected value of the convex function of the random variable.
  - Hoeffding’s Inequality: Mentioned as having a proof in the appendix. This inequality provides bounds on the probability that the sum of independent, bounded random variables deviates from its expected value.
  - Mill’s Inequality is mentioned in the index.
- 4.2 Inequalities For Expectations
  - This section will likely cover inequalities that provide bounds or relationships between the expectations of random variables or functions of random variables.
- 4.3 Bibliographic Remarks
  - This section will provide references to other resources for further reading on inequalities.
- 4.4 Appendix
  - Contains the Proof of Hoeffding’s Inequality.
  - Includes the Proof of Theorem 4.4, which uses Markov’s inequality. The theorem and its context are not explicitly detailed in the chapter outline, but the proof’s starting point is mentioned.
5 Convergence of Random Variables
- 5.1 Introduction
  - This section likely introduces the concept of convergence for sequences of random variables.
- 5.2 Types of Convergence
  - This section will detail different modes of convergence for random variables. These typically include:
    - Convergence in probability: P(|X_n - X| > \epsilon) \to 0 as n \to \infty for all \epsilon > 0.
    - Almost sure convergence (or convergence with probability 1): P(\lim_{n \to \infty} X_n = X) = 1.
    - Convergence in r-th mean: E(|X_n - X|^r) \to 0 as n \to \infty for some r > 0 (commonly r=1 for convergence in mean and r=2 for convergence in mean square).
    - Convergence in distribution (or weak convergence): F_{X_n}(x) \to F_X(x) as n \to \infty for all x at which F_X is continuous, where F_{X_n} and F_X are the cumulative distribution functions of X_n and X, respectively.
- 5.3 The Law of Large Numbers
  - This section will likely cover fundamental theorems related to the convergence of sample averages to the population mean. This usually includes:
    - The Weak Law of Large Numbers (WLLN): For a sequence of independent and identically distributed (iid) random variables X_1, X_2, ..., X_n with mean \mu, the sample mean \bar{X}_n = (1/n) \sum_{i=1}^n X_i converges in probability to \mu.
    - The Strong Law of Large Numbers (SLLN): Under similar conditions, the sample mean \bar{X}_n converges almost surely to \mu.
- 5.4 The Central Limit Theorem
  - This crucial theorem describes the limiting distribution of the standardized sample mean. It typically states that for a sequence of iid random variables X_1, X_2, ..., X_n with mean \mu and variance \sigma^2 > 0, the standardized sample mean \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} converges in distribution to a standard Normal distribution N(0, 1).
- 5.5 The Delta Method
  - The delta method is a technique for approximating the distribution of a function of a random variable when the limiting distribution of the random variable is known (usually Normal). If \sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2) and g is a differentiable function, then \sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{d} N(0, (g'(\theta))^2 \sigma^2). This is also mentioned in the context of parametric inference in Chapter 9.
- 5.6 Bibliographic Remarks
  - This section will provide references to other resources for further reading on the topic of convergence of random variables.
- 5.7 Appendix
  - This section might contain proofs or additional technical details related to the concepts discussed in the chapter.

Part II: Statistical Inference

6. Models, Statistical Inference and Learning
- 6.1 Introduction
  - Discusses the increasing availability of data and the need for statistical tools to analyze it.
  - Highlights the importance of understanding basic probability and mathematical statistics for data analysis, even when using advanced tools.
  - Mentions the book’s aim to cover a broad range of topics quickly, including modern concepts often in follow-up courses.
  - States that the book is for graduate or advanced undergraduate students in computer science, mathematics, statistics, and related disciplines, assuming knowledge of calculus and some linear algebra.
  - Figure 1 illustrates the relationship between probability (data generating process to observed data) and inference/data mining (observed data back to understanding the process).
- 6.2 Parametric and Nonparametric Models
  - Introduces the concept of a statistical model as a set of probability distributions that we assume our data comes from.
  - Defines a parametric model as a model that is indexed by a finite-dimensional parameter \theta belonging to a parameter space \Theta. It is written as F = \{f(x; \theta) : \theta \in \Theta\}.
  - Defines a nonparametric model as a model that is not indexed by a finite-dimensional parameter. The set of all continuous distribution functions is an example.
  - Provides an example of a parametric model: the set of all Normal distributions with mean \mu and variance \sigma^2, where \theta = (\mu, \sigma^2).
  - Explains that in parametric inference, the goal is to estimate \theta.
  - In nonparametric inference, the goal might be to estimate an unknown distribution function F or a regression function r(x) = E(Y|X=x) without assuming a specific parametric form.
  - Mentions frequentist and Bayesian approaches to statistical inference, noting that both will be covered, starting with frequentist inference.
  - Introduces the notation P_\theta(X \in A) = ∫_A f(x; \theta)dx and E_\theta(r(X)) = ∫ r(x)f(x; \theta)dx for parametric models, emphasizing that \theta is fixed, not averaged over. V_\theta is used for variance.
- 6.3 Fundamental Concepts in Inference
  - States that many inferential problems fall into three categories: estimation, confidence sets, or hypothesis testing.
  - 6.3.1 Point Estimation
    - Defines point estimation as providing a single “best guess” of a quantity of interest.
    - The quantity can be a parameter \theta, a cdf F, a pdf f, a regression function r, or a future prediction Y.
  - 6.3.2 Confidence Sets
    - Explains that a confidence set C_n is a random set constructed from data such that it contains the true value of the parameter with a specified probability.
    - For a parameter \theta, a 1 - \alpha confidence set satisfies P_\theta(\theta \in C_n) \ge 1 - \alpha for all \theta \in \Theta, where 1 - \alpha is the coverage probability.
    - Discusses pointwise asymptotic confidence intervals, where lim inf_{n \to \infty} P_\theta(\theta \in C_n) \ge 1 - \alpha for all \theta \in \Theta.
    - Also mentions uniform asymptotic confidence intervals, where lim inf_{n \to \infty} inf_{\theta \in \Theta} P_\theta(\theta \in C_n) \ge 1 - \alpha.
  - 6.3.3 Hypothesis Testing
    - Describes hypothesis testing as a procedure for deciding between two competing claims about a population, the null hypothesis (H_0) and the alternative hypothesis (H_1).
    - Provides an example of testing the fairness of a coin: H_0 : p = 1/2 versus H_1 : p \ne 1/2, where p is the probability of heads.
    - Suggests using a test statistic, such as T = |\hat{p}_n - (1/2)|, to make a decision, where \hat{p}_n is the sample proportion of heads. The null hypothesis would be rejected if T is large.
- 6.4 Bibliographic Remarks
  - Lists several textbooks on statistical inference at elementary, intermediate, and advanced levels.
- 6.5 Appendix
  - Provides definitions for different types of confidence intervals: standard confidence interval, pointwise asymptotic confidence interval, and uniform asymptotic confidence interval.
7 Nonparametric Estimation of a CDF and Statistical Functionals
- 7.1 The Empirical Distribution Function
  - Introduces the empirical cumulative distribution function (ecdf) F̂_n(x) = (1/n) ∑_{i=1}^n I(X_i ≤ x).
  - Discusses the Glivenko-Cantelli theorem, which states that sup_x |F̂_n(x) − F(x)| → 0 almost surely.
  - Mentions the Dvoretzky-Kiefer-Wolfowitz inequality, which provides a bound on the probability that the supremum distance between F̂_n(x) and F(x) exceeds a certain value.
  - Example (Nerve Data) shows the empirical cdf for waiting times between nerve pulses. It mentions estimating the fraction of waiting times between .4 and .6 seconds using F̂_n(.6) - F̂_n(.4).
  - Figure 7.1 shows nerve data with the empirical distribution function and a 95 percent confidence band.
- 7.2 Statistical Functionals
  - Defines a statistical functional as a map T(F) from a distribution function F to a real number.
  - Explains the plug-in principle: estimate T(F) by T(F̂_n).
  - Example (The Mean) illustrates the plug-in estimate for the mean \mu = E_F(X) = ∫ x dF(x), which is the sample mean \bar{X}_n = ∫ x dF̂_n(x) = (1/n) ∑_{i=1}^n X_i.
  - Example (The Variance) shows the plug-in estimate for the variance \sigma^2 = E_F(X − \mu)^2, which is the sample variance S_n^2 = (1/n) ∑_{i=1}^n (X_i − \bar{X}_n)^2.
  - Example (The Median) defines the median m as F(m) = 1/2 and the plug-in estimate as the sample median F̂_n^{-1}(1/2).
  - Example (Trimmed Mean) defines the \alpha-trimmed mean and its plug-in estimate.
  - The pth sample quantile is defined as T(F̂_n) = F̂_n^{-1}(p).
- 7.3 Standard Errors and Confidence Intervals
  - Discusses obtaining standard errors and confidence intervals for nonparametric estimates.
  - Mentions that formulas will be developed for parametric methods, but nonparametric settings require something else.
  - The next chapter will introduce the bootstrap for getting standard errors and confidence intervals.
  - Example (Plasma Cholesterol) uses histograms to compare plasma cholesterol levels between patients with and without heart disease. It raises the question of whether the mean cholesterol is different in the two groups.
8. The Bootstrap
- 8.1 Simulation
- 8.2 Bootstrap Variance Estimation
- 8.3 Bootstrap Confidence Intervals
- 8.4 Bibliographic Remarks
- 8.5 Appendix
  - 8.5.1 The Jackknife
  - 8.5.2 Justification For The Percentile Interval
    - The justification involves a monotone transformation U = m(T) such that U \sim N(\phi, c^2) where \phi = m(\theta).
    - It discusses the relationship between the sample quantiles of the bootstrap replicates U^{*b} and the quantiles of the Normal distribution.
    - It shows that P(\theta^*_{\alpha/2} \le \theta \le \theta^*_{1-\alpha/2}) = 1 - \alpha under the assumption of an exact normalizing transformation.
    - It notes that exact normalizing transformations rarely exist but approximate ones might.
9 Parametric Inference
- 9.1 Parameter of Interest
  - Introduction to the idea of a parameter of interest in a statistical model.
- 9.2 The Method of Moments
  - Explanation of the method of moments for estimating parameters.
- 9.3 Maximum Likelihood
  - Introduction to the principle of maximum likelihood estimation (MLE).
- 9.4 Properties of Maximum Likelihood Estimators
  - Discussion of properties of MLEs.
- 9.5 Consistency of Maximum Likelihood Estimators
  - Conditions under which MLEs are consistent.
  - Hint for proving consistency for Uniform distribution using Y = \max\{X_1, \ldots , X_n\} and P(Y < c) = P(X_1 < c)\ldots P(X_n < c).
- 9.6 Equivariance of the MLE
  - Property that the MLE of a function of a parameter is the function of the mle of the parameter.
- 9.7 Asymptotic Normality
  - Large sample behavior of MLEs tending towards a Normal distribution.
- 9.8 Optimality
  - Discussion of the optimality properties of MLEs.
- 9.9 The Delta Method
  - Method for finding the asymptotic distribution of a function of an estimator if the estimator is asymptotically Normal.
  - Approximation \sqrt{n}(\hat{\theta}_n - \tau) \approx \sqrt{n}(\\hat{\theta}_n - \theta)g'(\theta) where \tau = g(\hat{\theta}) and \hat{\tau}_n = g(\hat{\tau}_a n).
  - Resulting asymptotic distribution \hat{\theta}_n \approx N(\tau, se^2(\hat{\theta}_n)) with se^2(\hat{\theta}_n) = (g'(\theta))^2 / (nI(\theta)).
- 9.10 Multiparameter Models
  - Extension of parametric inference to models with multiple parameters.
- 9.11 The Parametric Bootstrap
  - Using the fitted parametric model to generate bootstrap samples and estimate the sampling distribution of an estimator.
- 9.12 Checking Assumptions
  - Methods for verifying the assumptions of the parametric model.
- 9.13 Appendix
  - 9.13.1 Score Function
    - Definition and Lemma 9.31: E_\theta [s(X; \theta)] = 0.
  - 9.13.2 Sufficiency
    - Definition of a statistic T(X^n) and a sufficient statistic containing all information in the data.
    - Formal definition requiring that the conditional distribution of the data given the sufficient statistic does not depend on the parameter \theta.
    - f(x_n; \theta) = h_n(x_n) g(T(x_n); \theta) for some functions h_n and g.
  - 9.13.3 Information Inequality (Cramér-Rao Lower Bound) (Not explicitly covered in the excerpts)
  - 9.13.4 Computing Maximum Likelihood Estimates
    - Discussion of analytical and numerical methods (Newton-Raphson, EM algorithm) for finding MLEs.
    - The method of moments estimator as a good starting value for numerical methods.
10 Hypothesis Testing and p-values
- 10.1 The Wald Test
  - Introduction of the Wald test.
  - Wald test statistic (form not explicitly given in the excerpts).
  - Mentioned in the context of multiple regression output where the “t-value” is the Wald test statistic for testing H_0: \beta_j = 0 versus H_1: \beta_j \neq 0.
- 10.2 p-values
  - Definition and interpretation of p-values.
  - Mentioned in the context of hypothesis tests.
  - Example 10.28 showing ordered p-values from 10 independent hypothesis tests.
- 10.3 The \chi^2 Distribution
  - Introduction of the \chi^2 distribution.
- 10.4 Pearson’s \chi^2 Test For Multinomial Data
  - Description of Pearson’s \chi^2 test for multinomial data.
- 10.5 The Permutation Test
  - Explanation of the permutation test.
- 10.6 The Likelihood Ratio Test
  - Description of the likelihood ratio test.
- 10.7 Multiple Testing
  - Discussion of the challenges in multiple hypothesis testing.
  - Example 10.28 with ordered p-values.
  - Illustration of the Benjamini-Hochberg (BH) procedure in Figure 10.6.
  - Mention of Bonferroni testing where the rejection criterion is P_i < \alpha/m.
  - The BH procedure rejects when P_i \le T, where T corresponds to the rightmost undercrossing in Figure 10.6.
- 10.8 Goodness-of-fit Tests
  - Overview of goodness-of-fit tests.
11 Bayesian Inference
- 11.1 Introduction
  - Mention of Bayesian methods treating the parameter \theta as a random variable.
  - Focus on making probability statements about \theta given the data.
  - Contrast with frequentist methods where \theta is fixed and unknown.
- 11.2 Bayes Theorem for Parameters
  - Posterior density f(\theta|X_n) = \frac{L(\theta)f(\theta)}{c}.
  - L(\theta) is the likelihood function.
  - f(\theta) is the prior density.
  - c = \int L(\theta)f(\theta) d\theta is the normalizing constant.
- 11.3 Posterior Distributions
  - Example 11.3: Beta prior and Binomial likelihood leading to a Beta posterior for p.
  - Steps for approximating the posterior for \psi = \log(P/(1-P)) without calculus:
    1. Draw P_1, \ldots, P_B \sim Beta(s+1, n-s+1).
    2. Let \psi_i = \log(P_i/(1-P_i)) for i = 1, \ldots, B.
- 11.4 Point Estimation
  - The Bayes estimator is often the posterior mean E(\theta \mid X_n) = \int \theta f(\theta \mid X_n) d\theta.
- 11.5 Credible Sets
  - Definition of a credible set C_n such that P(\theta \in C_n \mid X_n) = 1 - \alpha.
  - Bayesian intervals refer to degree-of-belief probabilities about \theta.
  - Bayesian intervals do not generally trap the parameter 1-\alpha percent of the time, unlike frequentist confidence intervals.
- 11.6 Bayesian Testing (Mentioned in the index)
- 11.7 Strengths and Weaknesses of Bayesian Methods (Mentioned in the index)
- 11.8 Asymptotic Approximations
  - Posterior of \theta is approximately Normal with mean \hat{\theta}_n (MLE) and variance \sigma^2_n = -1/\mathcal{l}''(\hat{\theta}_n) \approx (nI(\hat{\theta}_n))^{-1}.
  - \mathcal{l} = \log f(X|\theta) is the log-likelihood.
  - I(\theta) is the Fisher information.
  - \sigma_n \approx se(\hat{\theta}).
- 11.9 Hierarchical Bayes Models (Not explicitly covered in the excerpts)
- 11.10 Empirical Bayes (Not explicitly covered in the excerpts)
- 11.11 Computation
  - Mention of simulation being important for Bayesian computation (as elaborated in Chapter 24).
12 Decision Theory
- 12.1 Introduction
  - Mention of statistical inference as leading to an estimate or a decision.
  - Concept of a loss function L(\theta, \hat{\theta}) quantifying the loss of estimating \theta by \hat{\theta}.
  - Risk function R(\theta, \hat{\theta}) = E_{\theta}(L(\theta, \hat{\theta})).
  - Example 12.1 with squared error loss L(\theta, \hat{\theta}) = (\theta - \hat{\theta})^2 and risk being the mean squared error (MSE).
  - Example 12.2 comparing two estimators for a Normal mean with squared error loss.
  - Example 12.3 with Bernoulli data and squared error loss for estimating p.
- 12.2 Bayes Rules
  - Introduction of Bayesian decision theory.
  - Prior distribution \pi(\theta) for the parameter \theta.
  - Posterior distribution \pi(\theta|x^n).
  - Bayes risk r(\pi, \hat{\theta}) = E_{\theta|x^n}(L(\theta, \hat{\theta})) = \int L(\theta, \hat{\theta}) \pi(\theta|x^n) d\theta.
  - Bayes estimator \hat{\theta}_{\pi} that minimizes the Bayes risk.
  - With squared error loss, the Bayes estimator is the posterior mean \hat{\theta}_{\pi} = E(\theta|x^n).
  - Example 12.4 finding the Bayes estimator for the mean of a Normal with a Normal prior under squared error loss.
- 12.3 Admissibility
  - Definition of an estimator \hat{\theta}_1 dominating \hat{\theta}_2 if R(\theta, \hat{\theta}_1) \le R(\theta, \hat{\theta}_2) for all \theta, with strict inequality for some \theta.
  - Definition of an admissible estimator as one that is not dominated by any other estimator.
  - Connection between Bayes estimators and admissibility (without full coverage).
  - Mention that Bayes estimators with constant risk are minimax.
- 12.4 Minimax Rules
  - Definition of the maximum risk M(\hat{\theta}) = \sup_{\theta} R(\theta, \hat{\theta}).
  - Definition of a minimax estimator \hat{\theta}_{minimax} that minimizes the maximum risk.
  - Bayes estimators with a constant risk function are minimax.
- 12.5 Point Estimation: A Summary
  - Recap of frequentist (bias, standard error, MSE) and Bayesian (prior, posterior, Bayes estimator, Bayes risk) approaches to point estimation.
- 12.6 Hypothesis Testing (Not covered in the provided excerpts for this chapter)
- 12.7 Relationship between Confidence Sets and Hypothesis Tests (Not covered in the provided excerpts for this chapter)
- 12.8 Bibliographic Remarks
  - References to books on decision theory.

Part III: Statistical Models and Methods**

13 Linear and Logistic Regression
- 13.1 Simple Linear Regression
  - Introduces the basic concepts of simple linear regression.
  - Example 13.2 shows a plot of log surface temperature versus log light intensity for stars with an estimated linear regression line.
  - The model involves an intercept (\beta_0) and a slope (\beta_1).
  - The fitted line is given by \hat{r}(x) = \hat{\beta}_0 + \hat{\beta}_1x.
- 13.2 Least Squares and Maximum Likelihood
  - Discusses methods for estimating the parameters of the linear regression model, including least squares and maximum likelihood.
  - The residual sum of squares (RSS) is defined as RSS(\beta) = (Y - X\beta)^T (Y - X\beta).
  - The least squares estimator for \beta is given by \hat{\beta} = (X^TX)^{-1}X^TY.
- 13.3 Properties of the Least Squares Estimators
  - Examines the statistical properties of the least squares estimators.
- 13.4 Prediction
  - Covers how to use the fitted regression model for prediction.
  - Predicted values or fitted values are \hat{Y}_i = \hat{r}(X_i).
  - Residuals are defined as the difference between the observed and predicted values.
  - Example 13.5 shows the least squares estimates for the star data.
  - The variance of a prediction \hat{Y}^* at a new point x^* is given by \xi^2_n = V(\hat{Y}^*) + \sigma^2.
- 13.5 Multiple Regression
  - Extends the linear regression model to include multiple covariates.
  - Example 13.6 illustrates multiple regression with the 2001 Presidential Election data in Florida, predicting Buchanan’s votes based on Bush’s votes.
  - The output of a multiple regression program typically includes coefficient estimates, standard errors, t-values (Wald test statistics), and p-values.
- 13.6 Model Selection
  - Addresses the problem of choosing which covariates to include in a multiple regression model.
  - Smaller models are more parsimonious and might give better predictions than larger models.
  - Adding more variables decreases bias but increases variance, leading to a bias-variance tradeoff.
  - Underfitting (too few covariates) leads to high bias, while overfitting (too many covariates) leads to high variance.
  - Example 13.14 (not explicitly shown in the excerpts but mentioned) illustrates the problem of having many covariates.
  - AIC (Akaike Information Criterion) is introduced as an approximately unbiased estimate of a measure of prediction accuracy.
  - Theorem 13.18 states that AIC(M_j) is an approximately unbiased estimate of a(f, \hat{f}) (a measure of distance between the true density f and the estimated density \hat{f} for model M_j).
- 13.7 Logistic Regression
  - Introduces logistic regression for predicting a binary outcome variable.
  - The model is r(x) = P(Y = 1|X = x) = \frac{e^{\beta_0 + \sum_j \beta_j x_j}}{1 + e^{\beta_0 + \sum_j \beta_j x_j}}.
  - The maximum likelihood estimate (MLE) \hat{\beta} is obtained numerically.
  - Example 13.17 (referenced in Example 22.11) likely provides an example of logistic regression.
- 13.8 Bibliographic Remarks
  - Provides references for further reading on linear and logistic regression.
14 Multivariate Models
- 14.1 Random Vectors
  - Introduces the concept of random vectors.
  - Discusses notation involving vectors \mathbf{x}, \mathbf{y} and matrices A.
- 14.2 Estimating the Correlation
  - Covers methods for estimating the correlation between multiple variables.
  - Errata mentions a correction for equation (14.6), where the numerator should be divided by n-1, likely related to the sample covariance or correlation.
- 14.3 Multivariate Normal
  - Discusses the Multivariate Normal distribution.
  - The index mentions “positive definite” and a matrix \Sigma on p.231, which is a key property of the covariance matrix in a Multivariate Normal distribution.
  - Example 12.9 shows the Bayes estimator for the mean of a Normal distribution with a Normal prior.
- 14.4 Multinomial
  - Covers the Multinomial distribution.
  - Errata corrects a phrase in Section 14.4, changing “3. ‘balls of the kth color’” to “3. ‘balls of the jth color’,” indicating a discussion of categories within the Multinomial distribution.
  - The index mentions Pearson’s \chi^2 test on p. 241, which is commonly used for analyzing Multinomial data.
15 Contingency Tables
- Introduces the analysis of multivariate discrete data.
- The data can be represented as counts in a r_1 \times r_2 \times \cdots \times r_m table.
- 15.1 Definition. The odds ratio is defined to be \psi = \frac{p_{00}p_{11}}{p_{01}p_{10}} (Equation 15.1).
- The log odds ratio is defined to be \gamma = \log(\psi) (Equation 15.2).
- 15.2 Theorem. The following statements are equivalent:
  - 1. Y \perp Z
  - 1. \psi = 1
  - 1. \gamma = 0
  - 1. For i, j \in \{0, 1\}, p_{ij} = p_{i \cdot}p_{\cdot j}.
- Discusses testing for independence (H_0: Y \perp Z versus H_1: Y \not\perp Z).
- Covers Prospective Sampling (Cohort Sampling), where exposed and unexposed groups are observed to count the number with disease in each group.
  - X_{01} \sim Binomial(X_{0\cdot}, P(D|E^c))
  - X_{11} \sim Binomial(X_{1\cdot}, P(D|E)).
  - Estimating P(D|E) and P(D|E^c) is possible, and \psi can be estimated as a function of these probabilities.
16 Causal Inference
- Introduction
  - The statement “X causes Y” roughly means that changing the value of X will change the distribution of Y.
  - When X causes Y, X and Y will be associated, but association does not necessarily imply causation.
  - The chapter discusses two frameworks for discussing causation: counterfactual random variables and directed acyclic graphs (DAGs) (presented in the next chapter).
- 16.1 The Counterfactual Model
  - Introduces the counterfactual model, where X is a binary treatment variable (X=1 for treated, X=0 for not treated).
  - Introduces potential outcomes (C_0, C_1):
    - C_0: the outcome if the subject is not treated (X=0).
    - C_1: the outcome if the subject is treated (X=1).
  - The observed outcome Y is related to the potential outcomes by:
    - Y = C_1 if X = 1.
    - Y = C_0 if X = 0.
  - The causal effect of the treatment for a single subject is C_1 - C_0.
  - The average causal effect (ACE) or average treatment effect (ATE) is \theta = E(C_1 - C_0) = E(C_1) - E(C_0).
  - It is impossible to observe both C_0 and C_1 for the same subject, leading to the “fundamental problem of causal inference”.
- 16.2 Identifiability
  - Discusses conditions under which causal effects can be estimated from observed data.
  - If treatment X is randomly assigned and independent of (C_0, C_1), then:
    - E(C_1) = E(Y|X=1).
    - E(C_0) = E(Y|X=0).
    - In this case, \theta = E(Y|X=1) - E(Y|X=0) can be estimated.
  - Example 16.2 (not fully described) is mentioned in an exercise.
  - Theorem 16.4 is mentioned in an exercise requiring proof.
- 16.3 Observational Studies
  - Discusses the challenges of estimating causal effects in observational studies where treatment assignment is not random.
  - In observational studies, X and (C_0, C_1) may be dependent due to confounding variables.
  - Figure 17.11 illustrates observational studies with measured and unmeasured confounders, indicating the relevance of DAGs (from Chapter 17) to this topic.
- 16.4 Instrumental Variables (Not explicitly detailed in the excerpts but likely covered)
17 Directed Graphs and Conditional Independence
- 17.1 Introduction
  - Introduces directed graphs consisting of nodes and arrows.
  - Graphs are useful for representing independence relations between variables.
  - They can also be used as an alternative to counterfactuals to represent causal relationships.
  - The term “Bayesian network” is sometimes used for a directed graph with a probability distribution, but this is considered poor terminology.
  - Statistical inference for directed graphs can be performed.
- 17.2 Conditional Independence
  - Discusses the concept of conditional independence.
- 17.3 DAGs
  - Introduces Directed Acyclic Graphs (DAGs).
- 17.4 Probability and DAGs
  - Explains the relationship between probability distributions and DAGs.
  - For a distribution consistent with a DAG, the probability function can be factored according to the graph structure, e.g., f(x, y, z) = f(x)f(y|x)f(z|x, y) for the DAG X \rightarrow Y \rightarrow Z.
- 17.5 More Independence Relations
  - Explores further independence relations implied by DAGs.
  - Introduces the concept of an unshielded collider.
- 17.6 Estimation for DAGs
  - Briefly mentions the two main estimation questions for DAGs:
    - Estimating the distribution f given a DAG G and data.
    - Estimating the DAG G given data.
  - Notes that these are involved topics beyond the scope of the book.
- 17.7 Bibliographic Remarks
  - Provides references for further reading on DAGs and their applications. Mentions texts by Edwards (1995) and Jordan (2004), as well as the early work by Wright (1934) on causal DAGs and modern treatments by Spirtes et al. (2000) and Pearl (2000). Also notes discussions on the challenges of estimating causal structure from data by Robins et al. (2003).
- 17.8 Appendix
  - Causation Revisited: Discusses causation using DAGs as an alternative to counterfactual random variables (from Chapter 16), noting that the two approaches are mathematically equivalent.
  - Introduces the idea of intervention in the context of DAGs.
  - Provides pseudocode for generating data from a distribution consistent with a DAG.
  - Illustrates different study designs (randomized study, observational study with measured confounders, observational study with unmeasured confounders) using DAGs in Figure 17.11.
  - Mentions a way to make the correspondence between the DAG approach and the counterfactual approach explicit using a variable Z defined based on (C_0, C_1) from Chapter 16.
- Theorem 17.13 states that two DAGs G_1 and G_2 are Markov equivalent if and only if (i) skeleton(G_1) = skeleton(G_2) and (ii) G_1 and G_2 have the same unshielded colliders.
- Example 17.14 illustrates Markov equivalent DAGs.
18 Undirected Graphs
- 18.1 Undirected Graphs
  - Definition of an undirected graph G = (V, E) with a finite set of vertices V and a set of edges E (unordered pairs of vertices).
  - Vertices correspond to random variables (X, Y, Z, \ldots).
  - An edge (X, Y) \in E means X and Y are joined by an edge, written as X \sim Y if X and Y are adjacent.
  - Definition of a path as a sequence X_0, \ldots, X_n where X_{i-1} \sim X_i for each i.
  - Definition of a complete graph (an edge between every pair of vertices).
  - Definition of a subgraph (a subset U \subset V of vertices together with their edges).
  - Definition of separation: C separates A and B (three distinct subsets of V) if every path from a variable in A to a variable in B intersects a variable in C.
- 18.2 Probability and Graphs
  - Constructing a graph based on conditional independence: no edge between X and Y \iff X \perp Y | \text{rest} (where “rest” refers to all other variables).
  - The resulting graph is called a pairwise Markov graph.
  - Global Markov property: if A and B are separated by C, then X_A \perp X_B | X_C.
  - Theorem 18.3: The pairwise Markov property and the global Markov property are equivalent: M_{pair}(G) = M_{global}(G).
  - Examples of independence relations implied by different graph structures (Figures 18.3, 18.4, 18.5, 18.6, 18.7, 18.8).
  - Example 18.4: Figure 18.7 implies X \perp Y, X \perp Z, and X \perp (Y, Z).
  - Example 18.5: Figure 18.8 implies X \perp W | (Y, Z) and X \perp Z | Y.
- 18.3 Cliques and Potentials
  - Definition of a clique: a set of variables in a graph that are all adjacent to each other.
  - Definition of a maximal clique: a clique where it’s not possible to include another variable and still be a clique.
  - Definition of a potential: any positive function.
  - The probability function f(x) for a distribution P that is Markov to a graph G can be written as f(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C) (Equation 18.1), where \mathcal{C} is the set of maximal cliques and \psi_C(x_C) are positive potential functions. Z = \sum_x \prod_{C \in \mathcal{C}} \psi_C(x_C) is the normalization constant.
  - Example 18.6: For the graph in Figure 18.1, maximal cliques are \{X, Y\} and \{Y, Z\}, so f(x, y, z) \propto \psi_1(x, y) \psi_2(y, z).
  - Example 18.7: Lists the maximal cliques for the graph in Figure 18.9 and gives the form of the probability function.
- 18.4 Fitting Graphs to Data
  - Briefly mentions the problem of finding a graphical model that fits a given data set.
  - In the discrete case, one way to fit a graph to data is to use a log-linear model (the subject of Chapter 19).
- 18.5 Bibliographic Remarks
  - References thorough treatments of undirected graphs, including books by Whittaker (1990) and Lauritzen (1996).
19 Log-Linear Models
- Introduction to log-linear models for modeling multivariate discrete data.
- Strong connection between log-linear models and undirected graphs.
- 19.1 The Log-Linear Model
  - Let X = (X_1, \ldots, X_m) be a discrete random vector with probability function f(x) = P(X = x) = P(X_1 = x_1, \ldots, X_m = x_m).
  - X_j takes r_j values, assumed to be in \{0, 1, \ldots, r_j - 1\} without loss of generality.
  - Data can be seen as a sample from a Multinomial distribution with N = r_1 \times r_2 \times \cdots \times r_m categories.
  - p = (p_1, \ldots, p_N) denotes the multinomial parameter.
  - Definition of x_A = (x_j : j \in A) for a subset A \subset S = \{1, \ldots, m\}.
  - Theorem 19.1: The joint probability function f(x) can be written as \log f(x) = \sum_{A \subset S} \psi_A(x) (Equation 19.1), where \psi_A(x) only depends on x_A and \psi_\emptyset(x) is a constant.
  - The terms \psi_A(x) are called the log-linear parameters or interactions.
  - Example 19.2: Log-linear model for a Bernoulli random variable.
    - X \in \{0, 1\} with P(X = 1) = p_1 and P(X = 0) = p_2 = 1 - p_1.
    - \log f(x) = \psi_\emptyset(x) + \psi_{\{1\}}(x).
    - \psi_\emptyset(x) = \log(p_2) and \psi_{\{1\}}(x) = x \log(\frac{p_1}{p_2}).
  - Example 19.3: Log-linear model for X = (X_1, X_2) where X_1 \in \{0, 1\} and X_2 \in \{0, 1, 2\}, connected to a multinomial with 6 categories.
- 19.2 Graphical Log-Linear Models
  - Connection between log-linear models and undirected graphs: a log-linear model is graphical with respect to a graph G if \psi_A(x) = 0 whenever the indices in A form a set that is not a clique in G.
  - Theorem 19.4: Let G be an undirected graph with vertex set S = \{1, \ldots, m\}. A probability distribution f factors according to G (i.e., f(x) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(x_C) where \mathcal{C} is the set of maximal cliques) if and only if its log-linear expansion \log f(x) = \sum_{A \subset S} \psi_A(x) satisfies \psi_A = 0 whenever A is not a clique in G.
  - Lemma 19.5: A partition (X_a, X_b, X_c) satisfies X_b \perp X_c | X_a if and only if f(x_a, x_b, x_c) = g(x_a, x_b)h(x_a, x_c) for some functions g and h.
  - Proof of Theorem 19.4 using Lemma 19.5.
  - If a term can be added to the model without changing the graph, the model is not graphical.
  - Example 19.6: Graphical log-linear model corresponding to a path graph.
  - Example 19.7: Graphical log-linear model corresponding to a more complex graph (Figure 19.1).
- 19.3 Hierarchical Log-Linear Models
  - Definition of hierarchical log-linear models: \psi_A = 0 and A \subset B implies that \psi_B = 0.
  - Lemma 19.9: A graphical model is hierarchical, but the reverse is not necessarily true.
  - Example 19.10: A hierarchical and graphical model.
  - Example 19.11: A hierarchical but not graphical model (complete graph).
  - Example 19.12: A model that is not hierarchical and therefore not graphical.
- 19.4 Model Generators
  - Hierarchical models can be written succinctly using generators.
  - Example: M = 1.2 + 1.3 stands for \log f = \psi_\emptyset + \psi_1 + \psi_2 + \psi_3 + \psi_{12} + \psi_{13}.
  - The generator M = 1.2.3 is the saturated model.
- 19.5 Fitting Log-Linear Models to Data
  - Given n random vectors from a multinomial distribution, the log-likelihood function can be expressed in terms of the log-linear parameters \beta.
  - Maximum likelihood estimates (mle) \hat{\beta} generally found numerically.
  - Fisher information matrix also found numerically, allowing estimation of standard errors.
  - Model selection problem: which \psi terms to include, similar to variable selection in linear regression.
20 Nonparametric Curve Estimation
- Discussion of nonparametric estimation of probability density functions and regression functions (curve estimation or smoothing).
- Contrast with estimating cumulative distribution functions, which can be done consistently without smoothness assumptions.
- Need for smoothing due to bias-variance tradeoff.
- 20.1 The Bias-Variance Tradeoff
  - Fundamental challenge in curve estimation: balancing bias and variance.
  - Oversmoothing leads to high bias and low variance.
  - Undersmoothing leads to low bias and high variance.
- 20.2 Histograms
  - Histogram as an example of a density estimator.
  - Dividing the real line into disjoint sets called bins.
  - Histogram estimator is a piecewise constant function with height proportional to the number of observations in each bin.
  - Smoothing parameter: number of bins (or binwidth).
  - Formal definition for iid data on $$ with density f, using m bins B_j of width h = 1/m (Equation 20.7).
  - \nu_j: number of observations in B_j.
  - \hat{p}_j = \nu_j / n.
  - p_j = \int_{B_j} f(u) du.
  - Histogram estimator: \hat{f}_n(x) = \hat{p}_j / h for x \in B_j (Equation 20.8).
  - Theorem 20.3: Mean and variance of \hat{f}_n(x):
    - E(\hat{f}_n(x)) = p_j / h (Equation 20.9).
    - V(\hat{f}_n(x)) = p_j(1 - p_j) / (nh^2) (Equation 20.9).
  - Integrated Squared Error (ISE) as a measure of risk: R(f, \hat{f}_n) = \int (f(x) - \hat{f}_n(x))^2 dx (Equation 20.10).
  - Mean Integrated Squared Error (MISE): E[R(f, \hat{f}_n)] = \int E[(f(x) - \hat{f}_n(x))^2] dx = \int Bias^2(\hat{f}_n(x)) dx + \int Var(\hat{f}_n(x)) dx.
  - Approximate MISE for histograms (Equation 20.11).
  - Optimal bandwidth h_{opt} \propto n^{-1/3} and optimal number of bins m_{opt} \propto n^{1/3} (Equation 20.12).
  - Cross-validation for choosing the number of bins.
  - Cross-validation score \hat{J}(h) (Equation 20.13).
  - Theorem 20.7: Identity for \hat{J}(h):
    - \hat{J}(h) = \frac{2}{(n-1)h} - \frac{n+1}{h(n-1)} \sum_{j=1}^{m} \hat{p}_j^2 (Equation 20.14).
  - Example 20.8: Cross-validation for astronomy data.
  - Confidence bands for histograms.
  - Theorem 20.10: Approximate 1 - \alpha confidence band (\hat{l}_n(x), \hat{u}_n(x)) where:
    - \hat{l}_n(x) = (\max\{\sqrt{\hat{f}_n(x)} - c, 0 \})^2 (Equation 20.17).
    - \hat{u}_n(x) = (\sqrt{\hat{f}_n(x)} + c)^2 (Equation 20.17).
    - c = z_{\alpha/(2m)} \frac{\sqrt{m}}{\sqrt{n}} (Equation 20.18).
- 20.3 Kernel Density Estimation
  - Kernel density estimator \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K(\frac{x - X_i}{h}) (Equation 20.20).
  - K: kernel function.
  - h: bandwidth (smoothing parameter).
  - Example 20.9: Gaussian kernel.
  - Expectation and variance of kernel density estimator (Equations 20.21 and 20.22).
  - Approximate bias and variance (Equations 20.23 and 20.24).
  - Approximate MISE for kernel density estimation (Equation 20.25).
  - Optimal bandwidth h_{opt} \propto n^{-1/5} (Equation 20.26).
  - Choice of kernel function is less critical than the choice of bandwidth.
  - Cross-validation for bandwidth selection.
- 20.4 Nonparametric Regression
  - Model: Y_i = r(x_i) + \epsilon_i, where E(\epsilon_i) = 0.
  - Goal: estimate the regression function r(x) = E(Y | X = x).
  - Nadaraya-Watson estimator: \hat{r}(x) = \frac{\sum_{i=1}^{n} K(\frac{x - x_i}{h}) Y_i}{\sum_{i=1}^{n} K(\frac{x - x_i}{h})} (Equation 20.27).
  - Local constant estimator.
  - Local linear estimator (Equation 20.28).
  - Choice of bandwidth h using cross-validation (Equation 20.29).
  - Risk function for regression (Equation 20.30).
  - Leave-one-out cross-validation score \hat{V}(h) (Equation 20.31).
  - Confidence bands for nonparametric regression.
  - Estimating \sigma^2 = Var(\epsilon_i) using differences of adjacent Y_i’s (Equation 20.36).
  - Approximate point-wise confidence interval (Equation 20.37).
  - Simultaneous confidence bands (Equation 20.38).
  - Example 20.11: Nonparametric regression for galaxy data.
21 Smoothing Using Orthogonal Functions
- Introduction to nonparametric curve estimation using orthogonal functions.
- Brief introduction to the theory of orthogonal functions.
- Application to density estimation and regression.
- 21.1 Orthogonal Functions and L^2 Spaces
  - Definition of L^2(a, b) as the space of square integrable functions on [a, b], where \int_a^b f(x)^2 dx < \infty.
  - Inner product of two functions f, g \in L^2(a, b): \langle f, g \rangle = \int_a^b f(x)g(x) dx.
  - Norm of a function: ||f|| = \sqrt{\langle f, f \rangle} = (\int_a^b f(x)^2 dx)^{1/2}.
  - Orthogonal functions: \langle \phi_j, \phi_k \rangle = 0 for j \neq k.
  - Orthonormal functions: \langle \phi_j, \phi_k \rangle = \delta_{jk} (Kronecker delta, 1 if j=k, 0 otherwise).
  - Any f \in L^2(a, b) can be expanded as f(x) = \sum_{j=1}^\infty \beta_j \phi_j(x), where \beta_j = \langle f, \phi_j \rangle = \int_a^b f(x) \phi_j(x) dx (Equation 21.3).
  - Parseval’s relation: ||f||^2 = \sum_{j=1}^\infty \beta_j^2 (Equation 21.6).
  - Example 21.1: Cosine basis on $$: \phi_0(x) = 1, \phi_j(x) = \sqrt{2} \cos(j \pi x) for j \ge 1.
  - Example 21.2: Doppler function f(x) = \sqrt{x(1-x)} \sin(\frac{2.1 \pi}{x + 0.05}) and its approximation using a cosine basis f_J(x) = \sum_{j=1}^J \beta_j \phi_j(x).
  - Example 21.3: Legendre polynomials P_j(x) = \frac{1}{2^j j!} \frac{d^j}{dx^j} (x^2 - 1)^j on [-1, 1].
  - If f is smooth, coefficients \beta_j will be small for large j.
- 21.2 Density Estimation
  - Let X_1, \ldots, X_n be iid from a distribution on $$ with density f \in L^2.
  - f(x) = \sum_{j=0}^\infty \beta_j \phi_j(x) with orthonormal basis \phi_1, \phi_2, \ldots.
  - Estimator for coefficients: \hat{\beta}_j = \frac{1}{n} \sum_{i=1}^n \phi_j(X_i) (Equation 21.11).
  - Theorem 21.4: Mean and variance of \hat{\beta}_j:
    - E(\hat{\beta}_j) = \beta_j (unbiased estimator).
    - Var(\hat{\beta}_j) = \frac{1}{n} (1 - \beta_j^2) (Equation 21.12).
  - Orthogonal series density estimator: \hat{f}_J(x) = \sum_{j=0}^J \hat{\beta}_j \phi_j(x), where J is a truncation parameter.
  - Risk of the estimator (mean integrated squared error, MISE).
  - Theorem 21.5: MISE(\hat{f}_J) = \frac{J+1}{n} + \sum_{j=J+1}^\infty \beta_j^2.
  - First term is variance, second term is squared bias.
  - Optimal J balances bias and variance.
  - Cross-validation for choosing J.
- 21.3 Regression
  - Regression model: Y_i = r(x_i) + \epsilon_i, i = 1, \ldots, n, with E(\epsilon_i) = 0, Var(\epsilon_i) = \sigma^2, and x_i = i/n (initially).
  - r(x) = \sum_{j=1}^\infty \beta_j \phi_j(x) with \beta_j = \int_0^1 r(x) \phi_j(x) dx (Equation 21.22).
  - Estimator for coefficients: \hat{\beta}_j = \frac{1}{n} \sum_{i=1}^n Y_i \phi_j(x_i), j = 1, 2, \ldots (Equation 21.23).
  - \hat{\beta}_j is approximately Normally distributed by the central limit theorem.
  - Orthogonal series regression estimator \hat{r}_J(x) = \sum_{j=1}^J \hat{\beta}_j \phi_j(x).
  - Estimating \sigma^2: \hat{\sigma}^2 = \frac{n}{k} \sum_{j=n-k+1}^n \hat{\beta}_j^2 with k \approx n/4 (Equation 21.28).
  - Risk estimation \hat{R}(J) = \frac{J \hat{\sigma}^2}{n} + \sum_{j=J+1}^n (\hat{\beta}_j^2 - \frac{\hat{\sigma}^2}{n})_+ (Equation 21.29), where (a)_+ = \max(a, 0).
  - Choose \hat{J} to minimize \hat{R}(J).
  - Example 21.10: Doppler test function and its estimation using orthogonal series regression.
- 21.4 Wavelets
  - Wavelets as a basis designed for inhomogeneous functions.
  - Allows placing “blips” in small regions without adding wiggles elsewhere.
  - Haar wavelets as an example.
  - Father wavelet (scaling function) \phi(x) = 1 for 0 \le x < 1, 0 otherwise.
  - Mother wavelet \psi(x) = 1 for 0 \le x < 1/2, -1 for 1/2 \le x < 1, 0 otherwise.
  - Rescaled and shifted wavelets: \psi_{j,k}(x) = 2^{j/2} \psi(2^j x - k).
  - Sets of rescaled and shifted mother wavelets at resolution j: W_j = \{\psi_{jk}, k = 0, 1, \ldots, 2^j - 1\}.
  - Theorem 21.13: The set \{\phi, W_0, W_1, W_2, \ldots \} is an orthonormal basis for L^2(0, 1).
  - Expansion of f \in L^2(0, 1): f(x) = \alpha \phi(x) + \sum_{j=0}^\infty \sum_{k=0}^{2^j - 1} \beta_{j,k} \psi_{j,k}(x).
  - Discrete Wavelet Transform (DWT) for Haar wavelets.
  - Wavelet shrinkage for denoising.
  - Universal thresholding for wavelet coefficients.
  - Example 21.14: Estimation of the Doppler function using Haar wavelets and universal thresholding.
- 21.5 Appendix
  - The DWT for Haar Wavelets algorithm.
22 Classification
- 22.1 Introduction
  - Definition of classification as supervised learning, predicting a discrete Y from X.
  - Terminology mapping from statistics to computer science:
    - classification \leftrightarrow supervised learning
    - predicting a discrete Y from X \leftrightarrow
    - data training sample (X_1, Y_1), \ldots, (X_n, Y_n) \leftrightarrow
    - covariates \leftrightarrow features the X_i’s
    - classifier \leftrightarrow hypothesis map h: \mathcal{X} \rightarrow \mathcal{Y}
    - estimation \leftrightarrow learning finding a good classifier.
- 22.2 Error Rates and the Bayes Classifier
  - Goal: find a classification rule h that makes accurate predictions.
  - Definitions:
    - Classification rule h: \mathcal{X} \rightarrow \{0, 1\} (for binary case).
    - Loss function L(h) = P(h(X) \neq Y) (probability of misclassification or error rate).
    - Observed error rate \hat{L}_n(h) = \frac{1}{n} \sum_{i=1}^n I(h(X_i) \neq Y_i).
    - Regression function r(x) = E(Y|X = x) = P(Y = 1|X = x) (for Y \in \{0, 1\}).
  - Bayes Classifier h^*(x) = \begin{cases} 1 & \text{if } P(Y = 1|X = x) > P(Y = 0|X = x) \\ 0 & \text{otherwise} \end{cases} (Equation 22.5).
  - Alternative form of Bayes Classifier: h^*(x) = \begin{cases} 1 & \text{if } \pi f_1(x) > (1 - \pi) f_0(x) \\ 0 & \text{otherwise} \end{cases} (Equation 22.6), where \pi = P(Y = 1) and f_k(x) is the density of X given Y = k.
  - Theorem 22.5: The Bayes rule is optimal, L(h^*) \le L(h) for any other classification rule h.
  - Three main approaches to approximate the Bayes rule when unknown quantities are involved:
    - Empirical Risk Minimization: Choose a set of classifiers H and find \hat{h} \in H that minimizes some estimate of L(h).
    - Regression: Find an estimate \hat{r} of the regression function r and define \hat{h}(x) = \begin{cases} 1 & \text{if } \hat{r}(x) > 1/2 \\ 0 & \text{otherwise} \end{cases}.
    - Density Estimation: Estimate f_0 and f_1, let \hat{\pi} = n^{-1} \sum_{i=1}^n Y_i, define \hat{r}(x) = \frac{\hat{\pi} \hat{f}_1(x)}{\hat{\pi} \hat{f}_1(x) + (1 - \hat{\pi}) \hat{f}_0(x)}, and use \hat{h}(x) = \begin{cases} 1 & \text{if } \hat{r}(x) > 1/2 \\ 0 & \text{otherwise} \end{cases}.
  - Theorem 22.6: For Y \in \mathcal{Y} = \{1, \ldots, K\}, the optimal rule is h^*(x) = \arg\max_{y \in \mathcal{Y}} P(Y = y|X = x).
- 22.3 Gaussian and Linear Classifiers
  - Linear discriminant analysis (LDA) assuming f_k(x) \sim N(\mu_k, \Sigma) (common covariance matrix).
  - Bayes rule becomes linear: h(x) = \begin{cases} 1 & \text{if } w^T x \ge c \\ 0 & \text{otherwise} \end{cases}.
  - w = \Sigma^{-1}(\mu_1 - \mu_0) and c = \frac{1}{2} (\mu_1 + \mu_0)^T \Sigma^{-1} (\mu_1 - \mu_0) - \log(\frac{\pi_1}{\pi_0}).
  - Estimating \mu_k with sample means \bar{X}_k and \Sigma with pooled sample covariance S_W.
  - Fisher’s discriminant analysis: find a linear combination U = w^T X such that the classes are well separated.
  - Optimizing J(w) = \frac{w^T B w}{w^T W w}, where B is between-class scatter and W is within-class scatter.
  - Solution w \propto S_W^{-1} (\bar{X}_1 - \bar{X}_0).
  - Fisher linear discriminant function U = w^T X = (\bar{X}_0 - \bar{X}_1)^T S_W^{-1} X.
  - Midpoint m = \frac{1}{2} (\bar{X}_0 + \bar{X}_1)^T S_W^{-1} (\bar{X}_0 - \bar{X}_1).
  - Fisher’s classification rule h(x) = \begin{cases} 0 & \text{if } w^T x \ge m \\ 1 & \text{if } w^T x < m \end{cases}.
  - Relationship with Bayes linear classifier when \hat{\pi} = 1/2.
- 22.4 Linear Regression and Logistic Regression
  - Direct approach: estimate r(x) = E(Y|X = x) and classify using \hat{h}(x) = \begin{cases} 1 & \text{if } \hat{r}(x) > 1/2 \\ 0 & \text{otherwise} \end{cases}.
  - Linear regression model Y = X\beta + \epsilon, with \hat{\beta} = (X^T X)^{-1} X^T Y and predicted values \hat{Y} = X\hat{\beta}.
  - Logistic regression model r(x) = P(Y = 1|X = x) = \frac{e^{\beta_0 + \sum_j \beta_j x_j}}{1 + e^{\beta_0 + \sum_j \beta_j x_j}} (Equation 22.24), with \hat{\beta} obtained numerically.
  - Example 22.11: Heart disease data, error rate using logistic regression (.27) and linear regression (.26).
  - Fitting richer logistic regression models by adding interaction terms (Equation 22.25).
  - Bias-variance tradeoff in choosing the complexity of the model.
- 22.5 Relationship Between Logistic Regression and LDA
- 22.6 Density Estimation and Naive Bayes
  - Naive Bayes classifier assumes conditional independence of features given the class label: f(x|Y=k) = \prod_{j=1}^d f_j(x_j|Y=k).
- 22.7 Trees
  - Classification trees partition the feature space into rectangular regions.
  - Prediction is constant within each region.
  - Recursive partitioning based on impurity measures (e.g., misclassification error, Gini index, cross-entropy).
  - Stopping criteria to prevent overfitting.
  - Pruning to simplify the tree.
  - Example 22.13: Classification tree for heart disease data yielding a misclassification rate of .21. Tree using only tobacco and age has a misclassification rate of .29 (Figure 22.4).
- 22.8 Assessing Error Rates and Choosing a Good Classifier
  - Observed classification error can underestimate the true error rate.
  - Example 22.14: Sequence of logistic regression models for heart disease data, showing observed error decreasing with complexity, while cross-validation error shows a bias-variance tradeoff (Figure 22.5).
  - Cross-Validation: Split data into training set T and validation set V. Train classifier on T, estimate error on V. k-fold cross-validation.
  - Probability Inequalities: Using theorems like Hoeffding’s inequality to bound the difference between observed error and true error.
  - Theorem 22.16 (Uniform Convergence): P(\max_{h \in H} |\hat{L}_n(h) - L(h)| > \epsilon) \le 2me^{-2n\epsilon^2} for a finite hypothesis space H of size m.
  - Theorem 22.17: \hat{L}_n(\hat{h}) \pm \epsilon is a 1 - \alpha confidence interval for L(\hat{h}), with \epsilon = \sqrt{\frac{1}{2n} \log(\frac{2m}{\alpha})}.
- 22.9 Support Vector Machines
  - Class of linear classifiers for binary Y \in \{-1, +1\}.
  - Linear classifier h(x) = \text{sign}(H(x)), where H(x) = a_0 + \sum_{i=1}^d a_i x_i and \text{sign}(z) = -1 if z < 0, 0 if z = 0, 1 if z > 0.
  - Linearly separable data: exists a hyperplane perfectly separating the classes.
  - Lemma 22.27: Data separable by some hyperplane iff exists H(x) such that Y_i H(x_i) \ge 1 for all i.
  - Maximum margin hyperplane: separates classes and maximizes the distance to the closest point (margin).
  - Support vectors: points on the boundary of the margin.
  - Theorem 22.28: The maximum margin hyperplane \hat{H}(x) = \hat{a}_0 + \sum_{i=1}^d \hat{a}_i x_i is the solution to a constrained optimization problem.
- 22.10 Kernelization
  - Mapping covariates to a higher-dimensional space \phi(x) to create non-linear classifiers in the original space while using linear classifiers in the higher space.
  - Example with an ellipse separable in a higher dimension.
- 22.11 Other Classifiers
23 Probability Redux: Stochastic Processes
- 23.1 Introduction
- 23.2 Markov Chains
  - Definition of a Markov chain.
  - Transition matrix P(i, j) = P(X_{n+1} = j|X_n = i) = p_{ij}.
  - n-step transition probabilities p_{ij}(n) = P(X_n = j|X_0 = i).
  - Chapman-Kolmogorov equations.
  - Distribution at time n: \mu_n(j) = P(X_n = j).
  - Lemma 23.10: The marginal probabilities are given by \mu_n = \mu_0 P^n.
  - Stationary distribution \pi where \pi P = \pi.
  - Ergodic Markov chain.
  - Theorem 23.26: If \pi satisfies detailed balance (\pi_i p_{ij} = p_{ji} \pi_j), then \pi is a stationary distribution.
  - Example: Genotype of descendants as a Markov chain.
- 23.3 Poisson Processes
  - Definition of a Poisson process.
  - Waiting times in a Poisson process.
- Summary of Terminology
  - Transition matrix: P(i, j) = P(X_{n+1} = j|X_n = i) = p_{ij}.
  - n-step transition probability: p_{ij}(n) = P(X_n = j|X_0 = i).
  - Marginal probability at time n: \mu_n(j) = P(X_n = j).
  - Stationary distribution \pi: \pi P = \pi.
  - Detailed balance: \pi_i p_{ij} = p_{ji} \pi_j.
- Example (Markov chain Monte Carlo): Brief description of MCMC, using a proposal distribution W \sim N(X_i, b^2) and acceptance probability r = \min \{ \frac{g(W)}{g(X_i)}, 1 \} to generate a Markov chain with stationary distribution f(x) \propto g(x).
24 Simulation Methods
- 24.1 Bayesian Inference Revisited
  - Mention of using simulation methods in Bayesian inference.
- 24.2 Basic Monte Carlo Integration
  - Estimating integrals using random sampling.
  - Estimate of \theta = E(g(X)) = \int g(x) f(x) dx by \hat{\theta}_n = \frac{1}{n} \sum_{i=1}^n g(X_i), where X_i \sim f.
  - Central Limit Theorem for Monte Carlo integration: \sqrt{n}(\hat{\theta}_n - \theta) / \sigma \xrightarrow{d} N(0, 1), where \sigma^2 = Var(g(X)).
- 24.3 Importance Sampling
  - Reducing variance by sampling from an importance sampling distribution h(x).
  - Estimator: \hat{\theta}_n = \frac{1}{n} \sum_{i=1}^n \frac{g(X_i) f(X_i)}{h(X_i)}, where X_i \sim h.
- 24.4 MCMC Part I: The Metropolis–Hastings Algorithm
  - Introduction to Markov chain Monte Carlo (MCMC) for sampling from a target distribution f(x).
  - Proposal distribution q(y|x).
  - Acceptance probability r = \min \{ \frac{f(y) q(x\mid y)}{f(x) q(y \mid x)}, 1 \}.
  - Algorithm:
    - Start with X_0.
    - For n \ge 0, given X_n, generate Y \sim q(\cdot|X_n).
    - Calculate acceptance probability r.
    - Set X_{n+1} = Y with probability r, and X_{n+1} = X_n with probability 1-r.
- 24.5 MCMC Part II: Different Flavors
  - Gibbs Sampler: Sampling each variable from its full conditional distribution.
    - Example with a bivariate Normal distribution.
    - Example involving drawing \mu and \psi_i iteratively from their conditional distributions.
    - f(\psi_i|rest) \propto \exp \{ -\frac{1}{2} (\psi_i - \mu)^2 \} \exp \{ -\frac{1}{2\sigma^2_i} (Z_i - \psi_i)^2 \}.
    - \psi_i|rest \sim N(e_i, d^2_i), where e_i = \frac{Z_i/\sigma^2_i + \mu}{1 + 1/\sigma^2_i} and d^2_i = \frac{1}{1 + 1/\sigma^2_i}.
  - Metropolis within Gibbs: Combining Metropolis-Hastings steps within a Gibbs sampling framework.
    - Algorithm involving proposals and acceptance probabilities for X_n and Y_n.
    - Acceptance for X: r = \min \{ \frac{f(Z, Y_n) q(X_n|Z)}{f(X_n, Y_n) q(Z|X_n)}, 1 \}.
    - Acceptance for Y: r = \min \{ \frac{f(X_{n+1}, Z) q̃(Y_n|Z)}{f(X_{n+1}, Yn) q̃(Z|Y_n)}, 1 \}.

Frequently Asked Questions

What are probability inequalities and why are they important in statistics?

Probability inequalities, as discussed in Chapter 4, provide bounds on the probability of certain events occurring. These inequalities, such as Markov’s inequality, Chebyshev’s inequality, and Jensen’s inequality, are crucial because they allow us to make statements about the likelihood of events even when the exact probability distribution is unknown or complex. They are fundamental tools for understanding the concentration of probability mass and for deriving other statistical results, particularly in the study of the behavior of random variables.

What are the different modes of convergence for random variables, as outlined in Chapter 5?

Chapter 5 details several ways in which a sequence of random variables can converge to a limiting random variable or a constant. The main types of convergence discussed are:

Convergence in probability: The probability that the difference between the random variable and its limit exceeds any small positive number approaches zero as the sequence progresses. Almost sure convergence (or convergence with probability 1): The sequence of random variables converges pointwise to its limit for all outcomes except for a set of probability zero. Convergence in mean square: The expected squared difference between the random variable and its limit approaches zero. Convergence in distribution (or weak convergence): The cumulative distribution functions of the sequence of random variables converge pointwise to the cumulative distribution function of the limit at all points where the limit distribution function is continuous. These different modes of convergence have distinct implications and are used in various contexts within statistical theory.

What are the Law of Large Numbers and the Central Limit Theorem, and why are they central to statistical inference?

The Law of Large Numbers (LLN), presented in Chapter 5, essentially states that the sample average of a large number of independent and identically distributed (i.i.d.) random variables will converge to the true population mean. This theorem provides the theoretical justification for using sample means to estimate population means.

The Central Limit Theorem (CLT), also in Chapter 5, is another cornerstone of statistics. It states that the sum (or average) of a large number of i.i.d. random variables will be approximately normally distributed, regardless of the shape of the original distribution, provided certain conditions are met. The CLT is vital because it allows us to make probabilistic statements and construct statistical inferences (like confidence intervals and hypothesis tests) about population parameters, even when the underlying distribution is not normal.

What are the primary goals of point estimation, confidence sets, and hypothesis testing, as introduced in Chapter 6?

Chapter 6 lays the groundwork for statistical inference.

Point estimation aims to find a single “best” value to estimate an unknown population parameter based on observed sample data. Confidence sets (specifically confidence intervals in the one-dimensional case) provide a range of plausible values for an unknown population parameter, along with a measure of confidence that the true parameter lies within that range. Hypothesis testing is a formal procedure used to decide between two competing claims (hypotheses) about a population parameter, based on evidence from a sample. It involves formulating a null hypothesis and an alternative hypothesis, and then using data to determine whether there is enough evidence to reject the null hypothesis. 5. How can we estimate the cumulative distribution function (cdf) and what are statistical functionals, as discussed in Chapter 7?

Chapter 7 introduces methods for understanding the distribution of data. The empirical distribution function (EDF) is a non-parametric estimator of the true cdf, based directly on the observed sample. It assigns a probability of 1/n to each observed data point and is a step function that jumps at the values of the data points.

Statistical functionals are functions of the cdf. Many important statistical parameters and estimators can be expressed as functionals of the underlying distribution. Examples include the mean, variance, quantiles, and correlation. Studying statistical functionals provides a unified way to analyze the properties of various estimators.

What is the bootstrap method and how can it be used for variance estimation and constructing confidence intervals, as detailed in Chapter 8?

The bootstrap is a powerful resampling technique, explained in Chapter 8, used to estimate the sampling distribution of an estimator. It works by repeatedly drawing random samples with replacement from the original data to create many “bootstrap samples.” By calculating the estimator of interest for each bootstrap sample, we can approximate the variance of the estimator and construct bootstrap confidence intervals. This method is particularly useful when analytical methods for calculating variance or constructing confidence intervals are difficult or unavailable, especially for complex estimators or small sample sizes. The jackknife, a related technique for bias and variance estimation, is also mentioned in the appendix of this chapter.

What are the method of moments and maximum likelihood estimation, and what are some important properties of maximum likelihood estimators, as covered in Chapter 9?

Chapter 9 focuses on parametric inference. The method of moments is a technique for estimating parameters of a probability distribution by equating sample moments (like the sample mean and sample variance) to their corresponding population moments (expressed as functions of the parameters) and solving for the parameters.

Maximum likelihood estimation (MLE) is another widely used method. It involves finding the parameter values that maximize the likelihood function, which represents the probability of observing the given sample data under different possible parameter values. MLEs often have desirable asymptotic properties, such as consistency (converging to the true parameter as the sample size increases) and asymptotic normality. The delta method, also mentioned, is a technique for approximating the distribution of a function of an estimator, often used in conjunction with MLEs.

What are the Wald test and p-values, as introduced in Chapter 10 in the context of hypothesis testing?

Chapter 10 delves into hypothesis testing. The Wald test is a statistical test used to assess hypotheses about parameters in statistical models, often based on the maximum likelihood estimator. It compares the estimated parameter value to the hypothesized value, taking into account the estimated variance of the estimator.

A p-value is a measure of the statistical significance of an observed result in hypothesis testing. It is defined as the probability of obtaining test results at least as extreme as the results actually observed, assuming that the null hypothesis is true. A small p-value provides evidence against the null hypothesis, leading to its rejection if the p-value is below a predetermined significance level (alpha).

The Review

Proin sodales neque erat, varius cursus diam tincidunt sit amet. Etiam scelerisque fringilla nisl eu venenatis. Donec sem ipsum, scelerisque ac venenatis quis, hendrerit vel mauris. Praesent semper erat sit amet purus condimentum, sit amet auctor mi feugiat. In hac habitasse platea dictumst. Nunc ac mauris in massa feugiat bibendum id in dui. Praesent accumsan urna at lacinia aliquet. Proin ultricies eu est quis pellentesque. In vel lorem at nisl rhoncus cursus eu quis mi. In eu rutrum ante, quis placerat justo. Etiam euismod nibh nibh, sed elementum nunc imperdiet in. Praesent gravida nunc vel odio lacinia, at tempus nisl placerat. Aenean id ipsum sed est sagittis hendrerit non in tortor.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis sagittis posuere ligula sit amet lacinia. Duis dignissim pellentesque magna, rhoncus congue sapien finibus mollis. Ut eu sem laoreet, vehicula ipsum in, convallis erat. Vestibulum magna sem, blandit pulvinar augue sit amet, auctor malesuada sapien. Nullam faucibus leo eget eros hendrerit, non laoreet ipsum lacinia. Curabitur cursus diam elit, non tempus ante volutpat a. Quisque hendrerit blandit purus non fringilla. Integer sit amet elit viverra ante dapibus semper. Vestibulum viverra rutrum enim, at luctus enim posuere eu. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.

Etiam quis tortor luctus, pellentesque ante a, finibus dolor. Phasellus in nibh et magna pulvinar malesuada. Ut nisl ex, sagittis at sollicitudin et, sollicitudin id nunc. In id porta urna. Proin porta dolor dolor, vel dapibus nisi lacinia in. Pellentesque ante mauris, ornare non euismod a, fermentum ut sapien. Proin sed vehicula enim. Aliquam tortor odio, vestibulum vitae odio in, tempor molestie justo. Praesent maximus lacus nec leo maximus blandit.

Maecenas turpis velit, ultricies non elementum vel, luctus nec nunc. Nulla a diam interdum, faucibus sapien viverra, finibus metus. Donec non tortor diam. In ut elit aliquet, bibendum sem et, aliquam tortor. Donec congue, sem at rhoncus ultrices, nunc augue cursus erat, quis porttitor mauris libero ut ex. Nullam quis leo urna. Donec faucibus ligula eget pellentesque interdum. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean rhoncus interdum erat ut ultricies. Aenean tempus ex non elit suscipit, quis dignissim enim efficitur. Proin laoreet enim massa, vitae laoreet nulla mollis quis.

The paper

paper

Citation

BibTeX citation:

@online{bochman,
  author = {Bochman, Oren},
  title = {📖 {All} of {Statistics}},
  url = {https://orenbochman.github.io/reviews/2004/all-of-statisitcs/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. n.d. “📖 All of Statistics.” https://orenbochman.github.io/reviews/2004/all-of-statisitcs/.