All of nonparametric statistics – Oren Bochman’s Blog

This book is Wasserman’s follow up to “All of Statistics”. This one covers non-parametric methods. Though the first book also covered a number of non-parametric methods like wavelets, this one goes into greater detail and covers a number of additional topics.

TL;DR - Too Long; Didn’t Read about Nonparametrics Stats

Nonparametric statistics are a bit of a misnomer as they not only make use of parameters but that the number of parameters isn’t fixed but can grow as needed. Say we are dealing with a problems like classification of species where we are likely to always gave to come up with new classes as we collect more data.

Here is a light hearted Deep Dive into the book:

Glossary

This book uses lots of big terms so let us try to unpack them so we can understand the presentation more easily. This glossary is still a little tricky in places sorry about that while I try to simplify it.

Probability theory

These definitions are from probability theory and should be quite familiar to most readers.

\sigma-field: A collection A of events with properties suitable as a support of a probability measure. More formally:; A class of events A such that:

\begin{aligned} \emptyset &\in A && \text{(a null event)} \\ A &\in A \implies A^c \in A && \text{(events have complements)} \\ A_1, A_2, \ldots &\in A \implies \bigcup_{i=1}^\infty A_i \in A && \text{(closure under countable unions)} \end{aligned}

Probability measure: A measurable function that assigns probabilities to events in a \sigma-field in a way that is consistent with the axioms of probability. More formally:; A function P defined on a \sigma-field A such that P(A) \geq 0 \forall A \in A, P(\Omega) = 1 and if A_1, A_2, \ldots \in A are disjoint then P\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i).
Probability space: The triple (\Omega, A, P) where \Omega is the sample space, A is a \sigma-field of events, and P is a probability measure.
Random variable: A map X : \Omega \to \mathbb{R} such that, for every real x, \{\omega \in \Omega : X(\omega) \leq x\} \in A.
Mean of a function a: Depending on if it is continuos or discrete we have

\begin{aligned} \mathbb{E} (a(X)) &= \int a(x) dF(x) &\stackrel{\cdot}{\equiv} \left \{ \int a(x) f(x) dx \text{ continuous case } \right \} \\ &= \int a(x) dF(x) &\stackrel{\cdot}{\equiv} \left\{ \sum_j a(x_j) f(x_j) \text{ discrete case} \right\} \end{aligned}

Variance of a random variable X: A measure of the spread of a distribution defined as:

\mathbb{V}(X) = \mathbb{E}((X - \mathbb{E}(X))^2).

Mean squared error (MSE): A loss function used to quantify goodness of fit of an estimator defined as:; \text{mse} = \text{bias}^2(\hat{\theta}_n) + \mathbb{V}(\hat{\theta}_n). Also defined as:; R(f(x), \hat{f}_n(x)) = E ( L(f(x), \hat{f}_n(x)) ).

Functional analysis

The following come from functional analysis.

Linear functional (of F): A functional of the form \int a(x) dF(x).
Sobolev space of order m (W^{(m)}): Sobolev space is a vector space of functions with a norm.; \{ f \in L_2(0, 1) : D^m f \in L_2(0, 1) \} where D^m f is the mth weak derivative of f.
Sobolev space of order m and radius c (W^{(m, c)}): Sobolev space \{ f : f \in W^{(m)}, \|D^m f \|_2 \leq c^2 \}.
Periodic Sobolev class (\tilde{W}^{(m, c)}): \{ f \in W^{(m, c)} : D^j f(0) = D^j f(1), j = 0, \ldots, m - 1 \}.
Besov space (B_{p,q}^\xi): Besov space generalizes Sobolev spaces and characterizes smoothness in terms of differences.
Ellipsoid (\Theta): \{ \theta : \sum_{j=1}^\infty a_j^2 \theta_j^2 \leq c^2 \} where a_j is a sequence of numbers such that a_j \to \infty as j \to \infty.
Gâteaux derivative (of T at F in the direction G): L_F(G) = \lim_{\epsilon \to 0} \frac{T((1 - \epsilon)F + \epsilon G) - T(F)}{\epsilon}.
Hadamard differentiable (at F): There exists a linear functional L_F on D such that for any \epsilon_n \to 0 and \{D, D_1, D_2, \ldots\} \subset D such that d(D_n, D) \to 0 and F + \epsilon_n D_n \in \mathcal{F}, \lim_{n \to \infty} \left(\frac{T(F + \epsilon_n D_n) - T(F)}{\epsilon_n} - L_F(D_n)\right) = 0.
Linear smoother (estimator \hat{r}_n of r): For each x, there exists a vector \mathbf{l}(x) = (l_1(x), \ldots, l_n(x))^T such that \hat{r}_n(x) = \sum_{i=1}^n l_i(x) Y_i.
Overcomplete dictionary: A set of basis functions where the number of basis functions m is greater than the number of observations n.

Estimation theory

Confidence set: Generalization of confidence intervals.; A set C_n of possible values of a quantity of interest \theta, which depends on the data X_1, \ldots, X_n.
Empirical probability distribution (\hat{P}_n(A)): \hat{P}_n(A) = \frac{\text{number of } X_i \in A}{n}.
Empirical influence function (\hat{L}(x)): \hat{L}(x) = L_{\hat{F}_n}(x).
Histogram estimator (\hat{f}_n(x)): \hat{f}_n(x) = m \sum_{j=1} \frac{\hat{p}_j}{h} I(x \in B_j) where h=1/m is the binwidth, B_j are the bins, and \hat{p}_j is the proportion of observations in bin B_j.
Cross-validation estimator of risk (\hat{J}(h)): \hat{J}(h) = \int \hat{f}_n^2(x) dx - \frac{2}{n} \sum_{i=1}^n \hat{f}_n^{(-i)}(X_i).
Leave-one-out cross-validation score (cv = \hat{R}(h)): \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{r}^{(-i)}(x_i))^2 where \hat{r}^{(-i)} is the estimator obtained by omitting the ith pair (x_i, Y_i).
Plug-in estimator (of \theta = T(F)): Defined by \hat{\theta}_n = T(\hat{F}_n).
Influence function (L_F(x)): L_F(x) = \lim_{\epsilon \to 0} \frac{T((1 - \epsilon)F + \epsilon \delta_x) - T(F)}{\epsilon} where \delta_x is a point mass at x.
Minimax risk: \inf_{\hat{\theta}_n} \sup_{\theta \in \Theta} E_\theta [L(\hat{\theta}_n, \theta)] where the infimum is over all estimators \hat{\theta}_n and the supremum is over a class of parameters \Theta.
Linear shrinkage estimator (in Normal means problem): \hat{\theta} = bZ = (bZ_1, \ldots, b_n Z_n).
Thresholding: A nonlinear shrinkage method used in wavelet regression where small wavelet coefficients are set to zero.
Hard threshold estimator: \hat{\theta}_i = Z_i I(|Z_i| > \lambda).
Hard thresholding (estimator \hat{\beta}_{jk}): takes the following form

\hat{\beta}_{jk} = \begin{cases} 0 & \text{if } |D_{jk}| < \lambda \\ D_{jk} & \text{if } |D_{jk}| \geq \lambda \end{cases}

Soft threshold estimator: \hat{\theta}_i = \text{sign}(Z_i)(|Z_i| - \lambda)_+.
Soft thresholding (estimator \hat{\beta}_{jk}): \hat{\beta}_{jk} = \text{sign}(D_{jk})(|D_{jk}| - \lambda)_+.
Universal threshold (\lambda): \hat{\sigma} \sqrt{\frac{2 \log n}{n}}.

Nonparametics

Nonparametric delta method: The approximation \frac{T(\hat{F}_n) - T(F)}{\hat{s}_e} \approx N(0, 1) where \hat{s}_e = \hat{\tau}/\sqrt{n} and \hat{\tau}^2 = \frac{1}{n} \sum_{i=1}^n \hat{L}^2(X_i).
Wavelet regression: A method to estimate a regression function by projecting the data onto a wavelet basis and then shrinking the wavelet coefficients.
Multiresolution analysis (MRA) of \mathbb{R} (generated by \phi): A sequence of closed subspaces V_j \subset L_2(\mathbb{R}), j \in \mathbb{Z}, such that V_j \subset V_{j+1}, \bigcup_{j \in \mathbb{Z}} V_j is dense in L_2(\mathbb{R}), \bigcap_{j \in \mathbb{Z}} V_j = \{0\}, f(x) \in V_j \Leftrightarrow f(2x) \in V_{j+1}, and there exists a scaling function (father wavelet) \phi \in V_0 such that \{\phi(x - k) : k \in \mathbb{Z}\} forms a Riesz basis for V_0.
Father wavelet (scaling function) (\phi): A function in V_0 that generates the basis for V_0 in a multiresolution analysis.
Mother wavelet (\psi): A function such that \{\psi_{jk}(x) = 2^{j/2}\psi(2^j x - k) : j, k \in \mathbb{Z}\} forms an orthonormal basis for L_2(\mathbb{R}).

Book Outline

Introduction
- 1.1 What Is Nonparametric Inference?
  - Basic idea: Use data to infer an unknown quantity with minimal assumptions where the number of parameters can grow arbitrarily as required by the data.
  - Here nonparametric inference refers to modern statistical methods that aim to keep the number of underlying assumptions as weak as possible.
  - Problems considered:
    - Estimating the distribution function (cdf): Given an i.i.d sample X_1, \ldots, X_n \sim F, estimate F(x) = P(X \leq x).
    - Estimating functionals: Given an i.i.d sample X_1, \ldots, X_n \sim F, estimate a functional T(F) such as the mean T(F ) = ∫ xdF (x).
    - Density estimation: Given an i.i.d sample X_1, \ldots, X_n \sim F, estimate the density f(x) = F ′(x).
    - Nonparametric regression or curve estimation: Given (X_1, Y_1), \ldots, (X_n, Y_n) estimate the regression function r(x) = \mathbb{E}(Y \mid X = x).
    - Normal means: Given Y_i \sim \mathcal{N}(\theta_i, \sigma^2), i = 1, \ldots, n, estimate \theta= (\theta_1, \ldots, \theta_n).
  - Typically assumes the unknown quantity (distribution F, density f, or regression function r) lies in some large set F called a statistical model. For example, when estimating a density f, we might assume f \in F = \{ g : \int (g′′(x))^2dx \leq c^2 \}.
- 1.2 Notation and Background
  - Summary of some useful notation (see also Table 1.1).
  - Definition of mean of a: E(a(X)) = ∫ a(x)dF (x) ≡ \{ ∫ a(x)f(x)dx continuous case, \sum_j a(x_j)f(x_j) discrete case.
  - Definition of variance of X: V(X) = E(X − E(X))^2.
  - Brief review of probability.
    - Sample space \Omega: The set of possible outcomes of an experiment.
    - Events and \sigma-field A: A class of events A is a \sigma-field if
      1. \emptyset \in A,
      2. A \in A implies that A^c \in A and
      3. A_1, A_2, \ldots\in A implies that \cup_{i=1}^\infty A_i \in A.
    - Probability measure P: A function P defined on a \sigma-field A such that P(A) \geq 0 for all A \in A, P(\Omega) = 1 and if A_1, A_2, \ldots \in A are disjoint then P(\cup_{i=1}^\infty A_i) = \sum_{i=1}^\infty P(A_i).
    - Probability space (\Omega, A, P): The triple consisting of the sample space, the \sigma-field, and the probability measure.
    - Random variable X : \Omega \to R: A map X : \Omega \to R such that, for every real x, \{\omega \in \Omega : X(\omega) \leq x\} \in A.
    - Mean squared error (mse): mse = bias^2(\hat{\theta}a_n) + V(\hat{\theta}_n) (1.10).
- 1.3 Confidence Sets
  - Much of nonparametric inference is devoted to finding an estimator \hat{\theta}_n of some quantity of interest \theta and providing confidence sets for these quantities.
  - Let F be a class of distribution functions and let \theta be some quantity of interest. Let C_n be a set of possible values of \theta which depends on the data X_1, \ldots, X_n. The coverage of C_n is P_F(\theta\in C_n). C_n is a 1 − \alpha confidence set if \inf_{F} P_F(\theta\in C_n) \geq 1 − \alpha.
  - finite sample confidence set -
  - asymptotic confidence set - converge to
  - uniform asymptotic confidence set -
  - point wise asymptotic confidence set - for a specific value
  - confidence ball and bands - for functions
  - confidence envelope
- 1.4 Useful Inequalities
  - Jensen’s inequality: If g is convex then Eg(X) \geq g(EX) (1.32). If g is concave then Eg(X) \leq g(EX) (1.33).
- 1.5 Bibliographic Remarks
  - References on probability inequalities and their use in statistics and pattern recognition include Devroye et al. (1996) and van der Vaart and Wellner (1996). To review basic probability and mathematical statistics, I recommend Casella and Berger (2002), van der Vaart (1998) and Wasserman (2004).
Estimating the cdf and Statistical Functionals
- 2.1 The cdf
  - Definition of the empirical cumulative distribution function (cdf) \hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n I(X_i \le x).
  - Glivenko-Cantelli Theorem: \Vert \hat{F}_n - F \Vert_\infty \stackrel{a.s.}{\longrightarrow} 0.
  - Dvoretzky-Kiefer-Wolfowitz (DKW) inequality: P(\Vert \hat{F}_n - F \Vert_\infty > \epsilon) \le 2e^{-2n\epsilon^2} for all F and all \epsilon > 0.
  - Construction of a nonparametric 1 - \alpha confidence band for F(x) using the DKW inequality: (L(x), U(x)) where L(x) = \max\{\hat{F}_n(x) - \epsilon_n, 0\} and U(x) = \min\{\hat{F}_n(x) + \epsilon_n, 1\} with \epsilon_n = \sqrt{\frac{1}{2n} \log(\frac{2}{\alpha})}.
  - Theorem 2.6 summarizes the construction of the confidence band.
  - Example 2.7 illustrates a 95 percent confidence band.
- 2.2 Estimating Statistical Functionals
  - Introduction to statistical functionals T(F) which map a distribution function F to a number.
  - Plug-in estimator: Estimate T(F) by T(\hat{F}_n).
  - Example 2.32 (The mean): Plug-in estimator for the mean \theta = \int xdF(x) is \hat{\theta} = \int xd\hat{F}_n(x) = \bar{X}_n. Asymptotic nonparametric 95 percent confidence interval for \theta is \bar{X}_n \pm 2 \hat{se} where \hat{se}^2 = \hat{\sigma}^2/n.
  - Statistical functionals in the form T(F) = a(T_1(F), \ldots, T_m(F)).
- 2.3 Influence Functions
  - Definition of the influence function L(x) of a functional T at F.
  - T((1 - \epsilon)F + \epsilon \delta_x) \approx T(F) + \epsilon L(x) for small \epsilon, where \delta_x is a point mass at x.
  - Asymptotic variance of the plug-in estimator \sqrt{n}(T(\hat{F}_n) - T(F)) \stackrel{d}{\longrightarrow} N(0, \sigma^2) where \sigma^2 = V_F(L(X_1)).
  - Estimated standard error \hat{se} = (\frac{1}{n} \sum_{i=1}^n \hat{L}(X_i)^2)^{1/2} where \hat{L} is the estimated influence function.
  - Pointwise asymptotic 1 - \alpha confidence interval T(\hat{F}_n) \pm z_{1-\alpha/2} \hat{se}.
- 2.4 Empirical Probability Distributions
  - The empirical probability distribution P_n assigns mass 1/n to each observation X_i.
  - \hat{F}_n(x) = P_n((-\infty, x]).
- 2.5 Bibliographic Remarks
  - Provides references for further reading on empirical processes, statistical functionals, and influence functions.
- 2.6 Appendix
  - (Content of the appendix is not detailed in the provided excerpts).
The Bootstrap and the Jackknife
- 3.1 The Jackknife
  - Introduction to the jackknife as a method for estimating bias and variance.
  - Definition of the jackknife estimator T_{(i)}: the estimator computed by omitting the i-th observation.
  - Jackknife estimate of bias: bias_{jack} = (n-1)(\bar{T}_{(.)} - T_n), where \bar{T}_{(.)} = \frac{1}{n} \sum_{i=1}^n T_{(i)} and T_n = T(X_1, \ldots, X_n).
  - Jackknife estimate of variance: V_{jack} = \frac{n-1}{n} \sum_{i=1}^n (T_{(i)} - \bar{T}_{(.)})^2.
  - Approximate 1 - \alpha confidence interval using the jackknife: T_n \pm t_{n-1, 1-\alpha/2} \sqrt{V_{jack}}.
  - Example 3.3 illustrating jackknife estimation of bias and standard error for the mean.
  - Example 3.5 showing the inconsistency of the jackknife variance estimator for the median.
- 3.2 The Bootstrap
  - Introduction to the bootstrap as a general method for statistical inference based on resampling.
  - Nonparametric bootstrap procedure:
    1. Draw a bootstrap sample X^*_1, \ldots, X^*_n with replacement from the original data X_1, \ldots, X_n.
    2. Compute the statistic of interest T^*_n = T(X^*_1, \ldots, X^*_n).
    3. Repeat steps 1 and 2 B times to get T^*_{n,1}, \ldots, T^*_{n,B}.
  - Bootstrap estimate of standard error: se_{boot} = \sqrt{\frac{1}{B} \sum_{b=1}^B (T^*_{n,b} - \bar{T}^*_n)^2}, where \bar{T}^*_n = \frac{1}{B} \sum_{b=1}^B T^*_{n,b}.
  - Example 3.12 provides pseudo-code for bootstrapping the median.
- 3.3 Parametric Bootstrap
  - Procedure for parametric bootstrap:
    1. Assume a parametric model F_\theta for the data.
    2. Estimate the parameter \theta from the data to get \hat{\theta}.
    3. Draw bootstrap samples from F_{\hat{\theta}}.
    4. Compute the statistic of interest for each bootstrap sample.
  - Useful when a good parametric model is known or can be assumed.
- 3.4 Bootstrap Confidence Intervals
  - Introduction to different types of bootstrap confidence intervals.
  - Normal approximation interval: T_n \pm z_{1-\alpha/2} se_{boot}.
  - Percentile interval: C_n = (T^*_{(B\alpha/2)}, T^*_{(B(1-\alpha/2))}), using the \alpha/2 and 1 - \alpha/2 quantiles of the bootstrap sample.
  - Pivotal interval: Based on a pivotal quantity R_n(X, \theta) whose distribution does not depend on \theta. Approximate the distribution of R_n(X, \theta) by the distribution of R^*_n(X^*, T_n). The 1 - \alpha bootstrap pivotal interval is (T_n - R^*_{1-\alpha/2}, T_n - R^*_{\alpha/2}), where R^*_\beta is the \beta quantile of R^*_n.
  - Bootstrap studentized pivotal interval: (\hat{\theta}_n - z^*_{1-\alpha/2} \hat{se}_{boot}, \hat{\theta}_n - z^*_{\alpha/2} \hat{se}_{boot}), where z^*_\beta is the \beta quantile of Z^*_n,b = \frac{T^*_{n,b} - T_n}{\hat{se}^*_b}, and \hat{se}^*_b is an estimate of the standard error of T^*_{n,b}.
  - Example 3.17 shows various bootstrap confidence intervals for skewness.
- 3.5 Some Theory
  - Brief overview of theoretical aspects of the bootstrap.
  - Consistency of the bootstrap distribution.
  - Discussion of when the bootstrap works and when it might fail.
- 3.6 Bibliographic Remarks
  - Provides references for further reading on the jackknife and bootstrap methods.
- 3.7 Appendix
  - (Content of the appendix is not detailed in the provided excerpts).
Smoothing: General Concepts
- Introduction to the necessity of smoothing data to estimate curves like probability density functions (f) or regression functions (r).
- Discussion of two main types of problems studied:
  - Density estimation: Estimating the probability density function f given a sample X_1, \ldots, X_n \sim f.
  - Regression: Estimating the regression function r given pairs (x_1, Y_1), \ldots, (x_n, Yn) where Y_i = r(x_i) + \epsilon_i and E(\epsilon_i) = 0.
- Example 4.3 (Density estimation) showing histograms of astronomy data with different amounts of smoothing.
- Example 4.4 (Nonparametric regression) discussing the Cosmic Microwave Background (CMB) data.
- Example 4.5 (Nonparametric regression) illustrating the LIDAR experiment data and regressograms.
- Example 4.6 (Nonparametric regression) showing BPD (Bronchopulmonary Dysplasia) data and fits from different regression methods.
- Example 4.7 (Nonparametric regression) presenting rock data and additive model fits.
- 4.1 The Bias–Variance Tradeoff
  - Introduction to the concepts of bias and variance in the context of curve estimation.
  - Definition of risk or mean squared error (mse): mse = R(f(x), \hat{f}_n(x)) = E(L(f(x), \hat{f}_n(x))) (4.9).
  - Relationship between risk, bias, and variance: R(f(x), \hat{f}_n(x)) = bias_x^2 + V_x (4.10) and risk = mse = bias^2 + variance (4.11).
  - Illustration of the bias-variance tradeoff where bias increases and variance decreases with more smoothing.
  - Example illustrating the mse of a histogram estimator: mse \approx Ah^4 + B/(nh) (4.18).
- 4.2 Kernels
  - Introduction to kernels as a smoothing tool.
  - Examples of common kernels: boxcar, Gaussian, Epanechnikov, and tricube.
- 4.3 Which Loss Function?
  - Discussion of loss functions beyond squared error, such as L_p loss and Kullback–Leibler loss.
  - Reason for the continued popularity of L_2 (squared error) loss.
- 4.4 Confidence Sets
  - Brief mention of the desire to provide confidence sets for estimated curves.
- 4.5 The Curse of Dimensionality
  - Explanation of how the difficulty of nonparametric estimation increases with the dimensionality of the data.
  - Even with computational advancements, the statistical challenge of high-dimensional data remains, leading to large confidence intervals.
- 4.6 Bibliographic Remarks
  - Listing of several key texts on smoothing methods.
Nonparametric Regression
- Introduction to nonparametric regression, also known as “learning a function”.
- Given n pairs of observations (x_1, Y_1), \ldots, (x_n, Yn) where Y_i = r(x_i) + \epsilon_i and E(\epsilon_i) = 0.
- The goal is to estimate the regression function r(x) = E(Y |X = x).
- Methods considered include local regression methods (kernel regression, local polynomial regression) and penalization methods (splines).
- Chapter 8 and 9 will cover an approach based on orthogonal functions.
- All estimators in this chapter are linear smoothers.
- 5.1 Review of Linear and Logistic Regression
  - Brief review of standard parametric regression techniques as a contrast to nonparametric methods.
- 5.2 Linear Smoothers
  - Definition of a linear smoother: An estimator \hat{r}_n of r is a linear smoother if \hat{r}_n(x) = \sum_{i=1}^n \beta_i(x)Y_i.
  - Vector of fitted values r = (\hat{r}_n(x_1), \ldots, \hat{r}_n(x_n))^T.
  - Relationship r = LY where L is the smoothing matrix.
- 5.3 Choosing the Smoothing Parameter
  - The importance of selecting an appropriate smoothing parameter (e.g., bandwidth h).
  - Cross-validation as a common data-driven method for choosing the smoothing parameter.
  - The cross-validation score cv = \hat{R}(h) = \frac{1}{n} \sum_{i=1}^n (Y_i - \hat{r}^{(-i)}(x_i))^2 (5.30).
  - Definition of \hat{r}^{(-i)}(x) = \sum_{j=1}^n Y_j \beta_{j,(-i)}(x) (5.31) with specific conditions on \beta_{j,(-i)}(x) (5.32).
  - Generalized cross-validation (GCV) as an alternative.
- 5.4 Local Regression
  - Introduction to local regression techniques.
  - Kernel regression (Nadaraya-Watson estimator): \hat{r}_n(x) = \frac{\sum_{i=1}^n K(\frac{x-x_i}{h})Y_i}{\sum_{i=1}^n K(\frac{x-x_i}{h})} (5.38).
  - Local polynomial regression: Fitting a local polynomial of degree p in a neighborhood of x.
  - Example 5.54 illustrating local linear regression for the LIDAR data.
  - Theorem 5.60 stating that the local linear estimator has weights \beta_i(x) = \frac{K_h(x-x_i)(S_2(x) - S_1(x)(x_i - x))}{S_0(x)S_2(x) - S_1(x)^2} where S_j(x) = \sum_{i=1}^n K_h(x-x_i)(x_i - x)^j.
- 5.5 Penalized Regression, Regularization and Splines
  - Introduction to penalized regression and regularization methods.
  - Smoothing splines: Minimizing a penalized residual sum of squares M(\lambda) = \sum_{i=1}^n (Y_i - r(x_i))^2 + \lambda \int (r''(x))^2 dx (5.70, 5.71).
  - Theorem 5.73 stating that the minimizer is a natural cubic spline with knots at the data points.
  - Basis functions for splines: Truncated power basis (Theorem 5.74) and B-spline basis (Theorem 5.76).
- 5.6 Variance Estimation
  - Estimating the variance \sigma^2(x) = V(\epsilon|X=x).
  - Method based on squared residuals \hat{\sigma}^2(x) = \frac{\sum_{i=1}^n K_h(x-x_i)(Y_i - \hat{r}_n(x_i))^2}{\sum_{i=1}^n K_h(x-x_i)} (5.89).
  - Iterative procedure for estimating r(x) and \sigma^2(x).
  - Example 5.96 showing variance estimation for the CMB data.
- 5.7 Confidence Bands
  - Constructing confidence bands for the regression function r(x).
  - Typically of the form \hat{r}_n(x) \pm c \ se(x) (5.97).
  - The bias problem: confidence bands are often for r_n(x) = E(\hat{r}_n(x)) not r(x).
  - Constructing approximate pointwise confidence bands using the asymptotic normality of \hat{r}_n(x).
  - Simultaneous confidence bands using \kappa_0 and c derived from the L_\infty norm of a Gaussian process.
  - Approximate 1 - \alpha simultaneous confidence band I(x) = \hat{r}_n(x) \pm c \hat{\sigma}(x) ||\beta(x)|| (5.104).
  - Example 5.105 showing simultaneous confidence bands for the CMB data.
  - Remark on adjusting for the uncertainty in the smoothing parameter choice using the Bonferroni inequality.
  - Remark on using bootstrap methods for confidence bands.
- 5.8 Average Coverage
  - Discussion of confidence bands with average coverage instead of pointwise or simultaneous coverage.
- 5.9 Summary of Linear Smoothing
  - Steps to construct the estimate \hat{r}_n and a confidence band.
  - Example 5.110 applying the summary to the LIDAR data.
- 5.10 Local Likelihood and Exponential Families
  - Extending local methods to likelihood-based estimation for exponential families.
  - Local likelihood estimation for binary regression (logistic regression).
  - Example 5.119 showing local linear logistic regression.
  - Example 5.120 illustrating local likelihood for BPD data.
- 5.11 Scale-Space Smoothing
  - An approach that examines \hat{r}_h(x) over a range of bandwidths h.
  - The scale-space surface S = \{ r_h(x), x \in X , h \in H \}.
  - Method SiZer (significant zero crossings of derivatives).
- 5.12 Multiple Regression
  - Nonparametric regression with multiple covariates r(x_1, \ldots, x_p).
  - Challenges due to the curse of dimensionality.
  - Additive models: r(x) = \mu + \sum_{j=1}^p r_j(x_j).
  - Backfitting algorithm for fitting additive models.
  - Example 5.126 applying additive models to rock data.
  - Projection pursuit regression: r(x_1, \ldots, x_p) = \mu + \sum_{m=1}^M r_m(\alpha_m^T x).
  - Algorithm for projection pursuit regression.
  - Example 5.130 applying projection pursuit regression to rock data.
  - Regression trees: Partitioning the covariate space into rectangles and fitting a constant in each.
  - Complexity parameter \alpha and cost-complexity pruning (5.132).
  - Example 5.133 showing a regression tree for rock data.
  - Multivariate Adaptive Regression Splines (MARS): Building a model from piecewise linear basis functions.
  - Tensor Product Models: r(x) = \sum_{m=1}^M \beta_m h_m(x) where h_m are basis functions in a tensor product space (5.135).
- 5.13 Other Issues
  - Plug-In Bandwidths: Using formulas for asymptotically optimal bandwidths.
  - Choice of Kernel: Impact of kernel shape.
  - Boundary Effects: Problems near the boundaries of the data range.
  - Varying Coefficient Models: r(x) = \sum_{j=1}^p \beta_j(x) x_j.
  - Quantile Regression: Modeling conditional quantiles of Y given X.
  - Derivative Estimation: Estimating derivatives of the regression function (5.143).
- 5.14 Bibliographic Remarks
  - References for further reading on nonparametric regression.
- 5.15 Appendix
  - (Content of the appendix is not detailed in the provided excerpts).
Density Estimation
- Introduction to the problem of estimating the probability density function f(x) from a sample X_1, \ldots, X_n.
- 6.1 Cross-Validation
  - Using cross-validation to evaluate the quality of a density estimator \hat{f}_n with the risk R = E(L) (integrated mean squared error).
- 6.2 Histograms
  - Definition of a histogram estimator \hat{f}_n(x) = m \sum_{j=1}^m \hat{p}_j I(x \in B_j) (6.7), where h = 1/m is the binwidth, Y_j is the count in bin B_j, and \hat{p}_j = Y_j/n.
  - Example 6.8 showing histograms of astronomy data with different numbers of bins (oversmoothing, undersmoothing, and cross-validation chosen).
  - Theorem 6.9 relating the expected value of the histogram estimator to the true density and its derivatives.
  - The risk (integrated squared error) of a histogram estimator J(h) = E \int (f̂_n(x) − f(x))^2 dx (6.12).
  - The leave-one-out cross-validation score for histograms \hat{J}(h) = \sum_{i=1}^n (\hat{f}_n^{(-i)}(X_i))^2 (6.14).
  - Theorem 6.15 giving a formula for the cross-validation score: \hat{J}(h) = \frac{2}{h(n - 1)} - \frac{n + 1}{h(n - 1)} \sum_{j=1}^m \hat{p}_j^2 (6.16).
  - Example 6.17 showing the application of cross-validation to choose the number of bins for the astronomy data.
  - Constructing confidence sets for f at the resolution of the histogram.
- 6.3 Kernel Density Estimation
  - The kernel density estimator \hat{f}_n(x) = \frac{1}{nh} \sum_{i=1}^n K(\frac{x - X_i}{h}) = \frac{1}{n} \sum_{i=1}^n K_h(x - X_i) (6.23).
  - The mean integrated squared error (MISE) as a measure of performance (6.27).
  - Asymptotic MISE (AMISE) and optimal bandwidth (6.29, 6.31).
  - Rule-of-thumb bandwidth selection based on assuming a Normal distribution (6.33).
  - Leave-one-out cross-validation for kernel density estimation \hat{J}(h) = \frac{1}{n} \sum_{i=1}^n (\hat{f}_n^{(-i)}(X_i))^2 (6.34).
  - Theorem 6.35 giving a formula for the cross-validation score: \hat{J}(h) = \frac{1}{n(n-1)h} \sum_{i=1}^n \sum_{j=1}^n K(\frac{X_i - X_j}{h}) - \frac{2}{n(n-1)h} \sum_{i \neq j} K(\frac{X_i - X_j}{h}).
  - Example 6.36 applying kernel density estimation to the astronomy data and using cross-validation for bandwidth selection.
  - Confidence bands for kernel density estimators using the relationship with regression.
- 6.4 Local Polynomials
  - Extending the idea of local fitting to density estimation.
  - Local polynomial density estimation.
  - Relation to kernel density estimation when the polynomial degree p=0.
- 6.5 Multivariate Problems
  - Challenges of density estimation in higher dimensions (curse of dimensionality).
- 6.6 Converting Density Estimation Into Regression
  - Transforming the density estimation problem into a regression problem by binning the data and using regression techniques on the counts.
  - Applying kernel regression to the square root of the counts.
  - Constructing confidence bands using regression techniques.
  - Example 6.51 applying this method to the Bart Simpson distribution.
- 6.7 Bibliographic Remarks
  - Listing key references for density estimation.
- 6.8 Appendix
  - Proof of Theorem 6.11 regarding the bias of the histogram estimator.
Normal Means and Minimax Theory
- Introduction to the Normal means problem and its role in unifying some nonparametric problems, serving as a basis for Chapters 8 and 9.
- The chapter is noted as being more theoretical.
- Recommendation for readers not interested in theoretical details to read sections 7.1, 7.2, and 7.3 and then skip to the next chapter.
- 7.1 The Normal Means Model
  - Defining the basic Normal means model: Zi \sim N(\theta_i, \sigma^2_n), for i = 1, 2, \ldots (7.2), where \theta = (\theta_1, \theta_2, \ldots) is the unknown parameter and \sigma^2_n is assumed known.
  - Practical note: In reality, \sigma^2_n would often need to be estimated (as discussed in Chapter 5), which might affect the exact theoretical results.
  - Example 7.3 illustrating the model with a one-way analysis of variance setup: X_{ij} = \theta_i + \sigma \delta_{ij} where \delta_{ij} are independent N(0,1) random variables.
- 7.2 Function Spaces
  - Introduction to function spaces relevant to the theoretical development.
- 7.3 Connection to Regression and Density Estimation
  - Discussing how the Normal means problem relates to nonparametric regression and density estimation problems.
- 7.4 Stein’s Unbiased Risk Estimator (sure)
  - Introducing Stein’s Unbiased Risk Estimator as a method for estimating the risk of an estimator.
  - Example showing the application of sure to the soft threshold estimator \hat{\theta}_i = sign(Zi)(|Zi| - \lambda)_+ and providing the risk estimator \hat{R}(Z) = \sum_{i=1}^n (\sigma^2 - 2\sigma^2I(|Zi| \leq \lambda) + min(Z^2_i , \lambda^2) ) (7.22).
  - Mentioning that sure is not appropriate for the hard threshold estimator \hat{\theta}_i = Zi I(|Zi| > \lambda).
- 7.5 Minimax Risk and Pinsker’s Theorem
  - Discussing the concept of minimax risk.
  - Introducing Pinsker’s Theorem (Theorem 7.28) which gives an exact expression for the asymptotic minimax risk over an ellipsoid \Theta(c^2) = \{\theta : \sum a_i^2 \theta_i^2 \leq c^2 \}.
- 7.6 Linear Shrinkage and the James–Stein Estimator
  - Returning to model (7.1) and exploring improvements over the MLE using linear estimators of the form \hat{\theta} = bZ = (bZ_1, \ldots, bZ_n) where 0 \leq b \leq 1.
  - Defining the set of linear shrinkage estimators L = \{bZ : b \in\}.
  - Presenting the James-Stein estimator \hat{\theta}^{JS} = (1 - \frac{(n-2)\sigma^2_n}{\|Z\|^2})_+ Z (7.41) and noting it minimizes the estimated risk over L.
  - Mentioning a block estimation scheme by Cai et al. (2000) using the James-Stein estimator within blocks.
  - Theorem 7.50 (Cai, Low and Zhao, 2000) providing asymptotic optimality of their block estimator over Sobolev ellipsoids \Theta(m, c) = \{\theta : \sum_{i=1}^\infty a_i^2 \theta^2_i \leq c^2 \} where a_{2i} = a_{2i+1} = 1 + (2i\pi)^{2m}.
- 7.7 Adaptive Estimation Over Sobolev Spaces
  - Discussing adaptive estimation methods for Sobolev spaces.
- 7.8 Confidence Sets
  - Focusing on the construction of confidence sets B_n \subset R^n for \theta = (\theta_1, ..., \theta_n) such that \inf_{\theta \in R^n} P_\theta(\theta \in B_n) \geq 1 - \alpha (7.51).
  - Describing different methods for constructing confidence sets.
- 7.9 Optimality of Confidence Sets
  - Discussing the concept of optimal confidence sets.
- 7.10 Random Radius Bands?
  - Questioning the use of random radius bands for confidence sets.
- 7.11 Penalization, Oracles and Sparsity
  - Connecting penalization methods to oracle properties and the concept of sparsity in the parameter vector \theta.
  - Mentioning the LASSO (Tibshirani (1996)) and basis pursuit (Chen et al. (1998)) in the context of criterion (7.89).
  - Relating soft thresholding to wavelet methods (Chapter 9).
- 7.12 Bibliographic Remarks
  - Providing references for further reading on Normal means and minimax theory.
- 7.13 Appendix
  - Containing technical proofs, including the proof of Theorem 7.28 (Pinsker’s Theorem).
Nonparametric Inference Using Orthogonal Functions

This chapter introduces the use of orthogonal functions (e.g., Fourier series, wavelets) for nonparametric inference in both regression and density estimation.
For nonparametric regression, the function is expanded in an orthonormal basis: - r(x) = \sum^\infty_{j=1} \theta_j\phi_j(x) (8.3)“.
The concept of a modulator is introduced for shrinking the coefficients:
- “A modulator is a vector b = (b_1, \ldots, b_n) such that 0 \leq b_j \leq 1, j = 1, \ldots, n. A modulation estimator is an estimator of the form \hat{\theta} = bZ = (b_1 Z_1, b_2 Z_2, \ldots, b_nZ_n). (8.8)”.
Risk estimation and minimization over classes of modulators are discussed, connecting to the Pinsker bound.
The chapter also covers irregular designs and density estimation using orthogonal functions.

Wavelets and Other Adaptive Methods

This chapter focuses on wavelets, a powerful tool for adaptive nonparametric estimation, particularly for functions with varying degrees of smoothness.
Haar wavelets are introduced as a simple example.
The construction of more general wavelets through multi-resolution analysis (MRA) is outlined:
“9.11 Definition. Given a function φ, define V0, V1, \ldots, as in (9.10). We say that φ generates a multi-resolution analysis (MRA) of R if V_{j} \subset V_{j+1}, j \geq 0, (9.12) and \cup_{j\geq 0} Vj is dense in L2(R). (9.13)”.
Wavelet regression techniques, including wavelet thresholding (VisuShrink and SureShrink), are discussed for denoising and estimating functions from noisy data. The oracle risk and the performance of thresholding estimators are mentioned.
Besov spaces are introduced as function spaces that are well-suited for characterizing the properties of functions that can be effectively estimated using wavelets.
The construction of confidence sets using wavelet methods is covered.
Other adaptive methods, such as the Intersection of Confidence Intervals (ICI) method by Goldenshluger and Nemirovski, are briefly introduced as alternatives to wavelet-based approaches for adapting to unknown smoothness.

Other Topics

This final chapter covers a range of more advanced and specialized topics in nonparametric inference.
Measurement error in covariates and responses and its impact on estimation are discussed, along with potential correction methods. Inverse problems, where the goal is to infer an underlying function or parameter from indirect observations, are introduced.
Nonparametric Bayes, semiparametric inference, and issues with correlated errors are briefly mentioned.
The chapter touches upon classification, sieves, shape-restricted inference, and testing in a nonparametric framework.
Finally, computational issues relevant to implementing nonparametric methods are highlighted.

FAQ

What is nonparametric inference and how does it differ from parametric inference?

Nonparametric inference refers to statistical methods where the structure of the underlying model is not fully specified by a finite number of parameters. Instead of assuming a particular distribution (like a normal distribution) with fixed parameters, nonparametric methods make fewer assumptions about the data-generating process. They aim to estimate functions or distributions directly from the data. This contrasts with parametric inference, which relies on models defined by a fixed set of parameters. Nonparametric methods are more flexible and can capture complex relationships in the data, but they often require larger sample sizes and can be less efficient if a simple parametric model is indeed appropriate.

What are confidence sets and why are they important in nonparametric inference?

Confidence sets are generalizations of confidence intervals to higher dimensions or to function spaces. Instead of providing a range of plausible values for a single parameter, a confidence set provides a set of plausible functions, distributions, or other statistical objects. They are crucial in nonparametric inference because the goal is often to estimate an entire function or distribution, and it’s important to quantify the uncertainty associated with this estimate. A confidence set provides a way to assess the range of plausible estimates consistent with the observed data at a certain confidence level.

What are the bootstrap and the jackknife, and how are they used in nonparametric inference?

The bootstrap and the jackknife are resampling techniques used to estimate the properties of an estimator (like its variance or bias) or to construct confidence intervals without relying on strong parametric assumptions.

Bootstrap: This method involves repeatedly resampling with replacement from the original data to create many “bootstrap samples.” The statistic of interest is calculated for each bootstrap sample, and the distribution of these statistics is used to approximate the sampling distribution of the original estimator. This can be used to estimate standard errors and construct bootstrap confidence intervals.

Jackknife: This technique involves systematically leaving out one observation at a time from the original data and calculating the statistic of interest for each of these “leave-one-out” samples. These values are then used to estimate bias and variance of the estimator. Both methods are valuable in nonparametric settings where analytical derivations of standard errors or confidence intervals might be difficult or unreliable due to the complexity of the estimators or the lack of distributional assumptions.

What is the bias-variance tradeoff in smoothing, and how does it relate to the choice of smoothing parameters like bandwidth?

Smoothing techniques, such as kernel density estimation and nonparametric regression, aim to uncover underlying patterns in noisy data by averaging or weighting nearby observations. The bias-variance tradeoff is a fundamental concept in these methods.

Bias: This refers to the error introduced by approximating a complex real-world problem (which may be nonlinear) by a simpler model (the smoothed estimate). If the smoothing is too strong (e.g., a large bandwidth), the estimate might be overly smooth and miss important details in the true underlying function, leading to high bias.

Variance: This refers to the variability of the estimator. If the smoothing is weak (e.g., a small bandwidth), the estimator might be too sensitive to the noise in the data, resulting in high variance. The choice of the smoothing parameter (like the bandwidth h in kernel methods) controls this tradeoff. A larger bandwidth typically leads to a smoother estimate with lower variance but potentially higher bias. A smaller bandwidth leads to a more wiggly estimate with higher variance but potentially lower bias. The goal is to choose a bandwidth that balances bias and variance to minimize the overall risk (e.g., mean squared error).

What are linear smoothers in nonparametric regression, and what are some examples?

A nonparametric regression estimator \hat{r}_n(x) is called a linear smoother if it can be expressed as a linear combination of the response variables Y_i, i.e., \hat{r}n(x) = \sum{i=1}^{n} l_i(x) Y_i, where l_i(x) are weights that depend on the predictor variables x_i and the point of estimation x, but not on the Y_i’s. These weights essentially determine how much each observation Y_i contributes to the estimate at x. The collection of these weights for all x_i forms the smoothing matrix L, such that the vector of fitted values \mathbf{\hat{r}} = LY.

Examples of linear smoothers discussed include:

Regressogram: This method divides the range of the predictor variable into bins and estimates the regression function within each bin by the average of the Y_i’s in that bin. The weights l_i(x) are constant within each bin and zero outside.

Local Averages: This estimator averages the Y_i’s for x_i values within a certain distance (defined by a bandwidth h) of the target point x. The weights l_i(x) are equal for points within the window and zero outside.

Kernel Estimators: These methods use a kernel function K and a bandwidth h to weight the Y_i’s based on the distance of their corresponding x_i’s from the target x. The weights l_i(x) are proportional to K((x-x_i)/h).

Local Polynomials: These estimators fit a low-degree polynomial to the data within a local neighborhood of x and use the value of the fitted polynomial at x as the estimate \hat{r}_n(x). The weights are determined by the polynomial fit.

Smoothing Splines: These methods find a smooth function that minimizes the residual sum of squares plus a penalty term that penalizes the roughness of the function. The resulting estimator is also a linear smoother.

How is the concept of minimaxity used in nonparametric inference, particularly in the context of normal means problems?

Minimaxity in statistics refers to finding an estimator that performs optimally in the worst-case scenario over a given class of parameters or functions. In nonparametric inference, where the underlying function or parameter might belong to a large, infinite-dimensional space, minimax theory helps in understanding the fundamental limitations of estimation. It aims to find an estimator \hat{\theta} that minimizes the maximum risk R(\theta, \hat{\theta}) over all \theta in a class \Theta, where the risk R quantifies the error of the estimator (e.g., mean squared error).

In the context of normal means problems, which serve as a simplified yet insightful model for many nonparametric problems (like regression or density estimation in an orthonormal basis), minimax theory helps determine the best possible rate of convergence for the risk. For instance, Pinsker’s theorem provides the minimax mean squared error for estimating a function in a Sobolev ellipsoid, showing that the optimal rate depends on the smoothness of the function class.

Minimax theory also guides the development of estimators that achieve these optimal rates. Examples like shrinkage estimators (e.g., James-Stein estimator in the finite normal means problem, and wavelet thresholding in the sequence space setting related to function estimation) are often motivated by minimax considerations, aiming to reduce risk compared to simpler estimators like the MLE, especially when many parameters are involved and some might be zero or small.

What role do orthogonal functions and wavelets play in nonparametric inference? Orthogonal functions (like Fourier series, cosine basis) and wavelets provide flexible bases for representing functions and distributions in nonparametric inference.

Orthogonal Functions: By expanding an unknown function r(x) in an orthonormal basis {\phi_j(x)}, we can represent it by its coefficients \theta_j = \int r(x) \phi_j(x) dx. The data can then be projected onto these basis functions, yielding noisy estimates of these coefficients. Nonparametric inference then proceeds by appropriately shrinking or selecting these coefficients. This approach transforms the problem of estimating a function into a problem of estimating a sequence of coefficients, which can be analyzed using techniques similar to the normal means problem. Modulator estimators, which apply weights b_j to the estimated coefficients Z_j, are a common approach. The choice of basis can be tailored to the expected properties of the function (e.g., smoothness, periodicity).

Wavelets: Wavelets are basis functions that are localized in both time (or space) and frequency. This localization property makes them particularly well-suited for analyzing functions with local features like sharp changes or discontinuities. Wavelet methods involve decomposing the data into wavelet coefficients, applying a thresholding rule to these coefficients (to remove noise and capture important features), and then reconstructing the function from the thresholded coefficients. Wavelet shrinkage estimators like VisuShrink and SureShrink are popular nonparametric regression and density estimation techniques known for their adaptivity to varying smoothness of the underlying function. They can achieve near-optimal performance over a wide range of function spaces.

Both orthogonal functions and wavelets allow for a sparse representation of many signals and functions, meaning that many of their coefficients are close to zero. This sparsity is exploited in nonparametric methods to achieve efficient estimation and adaptivity.

What are some of the challenges and extensions in nonparametric inference discussed in the text? The text touches upon several challenges and extensions in nonparametric inference:

Curse of Dimensionality: In multivariate settings, many nonparametric methods suffer from the curse of dimensionality, where the amount of data needed to achieve a certain level of accuracy increases exponentially with the number of predictor variables.

Measurement Error: When the predictor variables are measured with error, it can significantly affect the performance of nonparametric estimators. Deconvolution techniques are needed to correct for this bias.

Inverse Problems: These problems involve inferring an unknown function or parameter from indirect measurements. They are often ill-posed, requiring regularization techniques, many of which have nonparametric flavors.

Nonparametric Bayes: This area combines the flexibility of nonparametric models with the framework of Bayesian inference, using priors over infinite-dimensional spaces of functions or distributions (e.g., Dirichlet process mixtures).

Semi-parametric Inference: These methods combine parametric and nonparametric components in a model. For example, one might model the effect of some covariates parametrically while leaving the effect of others unspecified.

Correlated Errors: The standard assumptions of independence often do not hold in practice. Nonparametric methods for data with correlated errors are more complex and less developed than those for independent data.

Classification: While the main focus is on regression and density estimation, nonparametric methods are also used for classification problems, such as k-nearest neighbors or kernel-based methods.

Shape-Restricted Inference: In some applications, prior knowledge about the shape of the function (e.g., monotonicity, convexity) is available. Nonparametric methods that incorporate these shape constraints can lead to improved estimates.

Testing: Nonparametric hypothesis testing aims to compare distributions or test for relationships without assuming specific parametric forms.

Computational Issues: Implementing nonparametric methods can be computationally intensive, especially for large datasets or complex techniques like bootstrap or wavelet analysis.

These topics highlight the ongoing research and the breadth of problems that nonparametric inference seeks to address, often requiring specialized techniques beyond the basic methodologies of regression and density estimation.

Some Thoughts

The book provide a ~~comprehensive~~ overview of nonparametric inference, starting from fundamental concepts like cdf and functional estimation and progressing to advanced topics such as wavelet analysis, minimax theory, and specialized problems. The book emphasizes both the theoretical underpinnings and the practical application of various nonparametric methods for regression, density estimation, and other inference tasks.

Resources

Videos for Nonparametric Statistics Course by Max Menzies using this book

Citation

BibTeX citation:

@online{bochman2025,
  author = {Bochman, Oren},
  title = {All of Nonparametric Statistics},
  date = {2025-06-10},
  url = {https://orenbochman.github.io/reviews/2006/all-of-nonparametric-statistics/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2025. “All of Nonparametric Statistics.” June 10, 2025. https://orenbochman.github.io/reviews/2006/all-of-nonparametric-statistics/.