About – An Introduction to Statistical Learning

This are some notes on one of my favorite books of all times: An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. And in this note perhaps I can explain why I like it so much, despite much criticism scattered throughout these notes.

The book is available for free from the authors’ website. The book is a great introduction to statistical learning and is a good starting point for anyone interested in machine learning. The book is a bit light on the maths but it is a good introduction to the concepts and the authors have a more advanced book called The Elements of Statistical Learning which is more mathematically rigorous.

Another reason I like this book is that there are videos by the authors on YouTube which explain the material. Although I first read the book book before the videos came out and I considered it easy to understand for most topics. However I found that the videos seem to be a bit orthogonal i.e. the seem to give lost of background information, beyond what is going on in the book. This is a good thing as it gives a lot of context to the material.

The authors also make their slides and figures available for free.

A little while back this book came out with codes in python. This makes it more relevant for data scientists and machine learning engineers then the original which came out in R. More so there is now a chapter on deep learning which is a big plus. Though I dare say its hard to get the lab to run properly on a local machine.

Tne book covers lots of algorithms as well as many practical aspects of Data Science and Machine Learning. The book is a great starting point for anyone interested in the field.

I haven’t come across any videos with some of the people who were not the authors of the course. I hope to find these.

1 Introduction 1
2 Statistical Learning 15
2.1 What Is Statistical Learning? . . . . . . . . . . . . . . . . . 15
2.1.1 Why Estimate f ? . . . . . . . . . . . . . . . . . . . 17
2.1.2 How Do We Estimate f ? . . . . . . . . . . . . . . . 20
2.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability . 23
2.1.4 Supervised Versus Unsupervised Learning . . . . . 25
2.1.5 Regression Versus Classification Problems . . . . . 27
2.2 Assessing Model Accuracy . . . . . . . . . . . . . . . . . . 27
2.2.1 Measuring the Quality of Fit . . . . . . . . . . . . 28
2.2.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . 31
2.2.3 The Classification Setting . . . . . . . . . . . . . . 34
2.3 Lab: Introduction to Python . . . . . . . . . . . . . . . . . 40
2.3.1 Getting Started . . . . . . . . . . . . . . . . . . . . 40
2.3.2 Basic Commands . . . . . . . . . . . . . . . . . . . 40
2.3.3 Introduction to Numerical Python . . . . . . . . . 42
2.3.4 Graphics . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.5 Sequences and Slice Notation . . . . . . . . . . . . 51
2.3.6 Indexing Data . . . . . . . . . . . . . . . . . . . . . 51
2.3.7 Loading Data . . . . . . . . . . . . . . . . . . . . . 55
2.3.8 For Loops . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.9 Additional Graphical and Numerical Summaries . . 61
2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3 Linear Regression 69
3.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 70
3.1.1 Estimating the Coefficients . . . . . . . . . . . . . 71
3.1.2 Assessing the Accuracy of the Coefficient Estimates . . . 72
3.1.3 Assessing the Accuracy of the Model . . . . . . . . 77
3.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . 80
3.2.1 Estimating the Regression Coefficients . . . . . . . 81
3.2.2 Some Important Questions . . . . . . . . . . . . . . 83
3.3 Other Considerations in the Regression Model . . . . . . . 91
3.3.1 Qualitative Predictors . . . . . . . . . . . . . . . . 91
3.3.2 Extensions of the Linear Model . . . . . . . . . . . 94
3.3.3 Potential Problems . . . . . . . . . . . . . . . . . . 100
3.4 The Marketing Plan . . . . . . . . . . . . . . . . . . . . . . 109
3.5 Comparison of Linear Regression with K-Nearest Neighbors. . 111
3.6 Lab: Linear Regression . . . . . . . . . . . . . . . . . . . . 116
3.6.1 Importing packages . . . . . . . . . . . . . . . . . . 116
3.6.2 Simple Linear Regression . . . . . . . . . . . . . . . 117
3.6.3 Multiple Linear Regression . . . . . . . . . . . . . . 122
3.6.4 Multivariate Goodness of Fit . . . . . . . . . . . . 123
3.6.5 Interaction Terms . . . . . . . . . . . . . . . . . . . 124
3.6.6 Non-linear Transformations of the Predictors . . . 125
3.6.7 Qualitative Predictors . . . . . . . . . . . . . . . . 126
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4 Classification 135
4.1 An Overview of Classification . . . . . . . . . . . . . . . . . 135
4.2 Why Not Linear Regression? . . . . . . . . . . . . . . . . . 136
4.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . 138
4.3.1 The Logistic Model . . . . . . . . . . . . . . . . . . 139
4.3.2 Estimating the Regression Coefficients . . . . . . . 140
4.3.3 Making Predictions . . . . . . . . . . . . . . . . . . 141
4.3.4 Multiple Logistic Regression . . . . . . . . . . . . . 142
4.3.5 Multinomial Logistic Regression . . . . . . . . . . . 144
4.4 Generative Models for Classification . . . . . . . . . . . . . 146
4.4.1 Linear Discriminant Analysis for p = 1 . . . . . . . 147
4.4.2 Linear Discriminant Analysis for p > 1 . . . . . . . 150
4.4.3 Quadratic Discriminant Analysis . . . . . . . . . . 156
4.4.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 158
4.5 A Comparison of Classification Methods . . . . . . . . . . 161
4.5.1 An Analytical Comparison . . . . . . . . . . . . . . 161
4.5.2 An Empirical Comparison . . . . . . . . . . . . . . 164
4.6 Generalized Linear Models . . . . . . . . . . . . . . . . . . 167
4.6.1 Linear Regression on the Bikeshare Data . . . . . . 167
4.6.2 Poisson Regression on the Bikeshare Data . . . . . 169
4.6.3 Generalized Linear Models in Greater Generality . 172
4.7 Lab: Logistic Regression, LDA, QDA, and KNN . . . . . . 173
4.7.1 The Stock Market Data . . . . . . . . . . . . . . . 173
4.7.2 Logistic Regression . . . . . . . . . . . . . . . . . . 174
4.7.3 Linear Discriminant Analysis . . . . . . . . . . . . 179
4.7.4 Quadratic Discriminant Analysis . . . . . . . . . . 181
4.7.5 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . 182
4.7.6 K-Nearest Neighbors . . . . . . . . . . . . . . . . . 183
4.7.7 Linear and Poisson Regression on the Bikeshare Data188
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5 Resampling Methods 201
5.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . 202
5.1.1 The Validation Set Approach . . . . . . . . . . . . 202
5.1.2 Leave-One-Out Cross-Validation . . . . . . . . . . 204
5.1.3 k-Fold Cross-Validation . . . . . . . . . . . . . . . 206
5.1.4 Bias-Variance Trade-Off for k-Fold
Cross-Validation . . . . . . . . . . . . . . . . . . . 208
5.1.5 Cross-Validation on Classification Problems . . . . 209
5.2 The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.3 Lab: Cross-Validation and the Bootstrap . . . . . . . . . . 215
5.3.1 The Validation Set Approach . . . . . . . . . . . . 216
5.3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . 217
5.3.3 The Bootstrap . . . . . . . . . . . . . . . . . . . . 220
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6 Linear Model Selection and Regularization 229
6.1 Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . 231
6.1.1 Best Subset Selection . . . . . . . . . . . . . . . . . 231
6.1.2 Stepwise Selection . . . . . . . . . . . . . . . . . . 233
6.1.3 Choosing the Optimal Model . . . . . . . . . . . . 235
6.2 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . 240
6.2.1 Ridge Regression . . . . . . . . . . . . . . . . . . . 240
6.2.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . 244
6.2.3 Selecting the Tuning Parameter . . . . . . . . . . . 252
6.3 Dimension Reduction Methods . . . . . . . . . . . . . . . . 253
6.3.1 Principal Components Regression . . . . . . . . . . 254
6.3.2 Partial Least Squares . . . . . . . . . . . . . . . . . 260
6.4 Considerations in High Dimensions . . . . . . . . . . . . . 262
6.4.1 High-Dimensional Data . . . . . . . . . . . . . . . . 262
6.4.2 What Goes Wrong in High Dimensions? . . . . . . 263
6.4.3 Regression in High Dimensions . . . . . . . . . . . 265
6.4.4 Interpreting Results in High Dimensions . . . . . . 266
6.5 Lab: Linear Models and Regularization Methods . . . . . . 267
6.5.1 Subset Selection Methods . . . . . . . . . . . . . . 268
6.5.2 Ridge Regression and the Lasso . . . . . . . . . . . 273
6.5.3 PCR and PLS Regression . . . . . . . . . . . . . . 280
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7 Moving Beyond Linearity 289
7.1 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . 290
7.2 Step Functions . . . . . . . . . . . . . . . . . . . . . . . . . 292
7.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . 293
7.4 Regression Splines . . . . . . . . . . . . . . . . . . . . . . . 294
7.4.1 Piecewise Polynomials . . . . . . . . . . . . . . . . 294
7.4.2 Constraints and Splines . . . . . . . . . . . . . . . 296
7.4.3 The Spline Basis Representation . . . . . . . . . . 296
7.4.4 Choosing the Number and Locations of the Knots . . . . . . 297
7.4.5 Comparison to Polynomial Regression . . . . . . . 299
7.5 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . 300
7.5.1 An Overview of Smoothing Splines . . . . . . . . . 300
7.5.2 Choosing the Smoothing Parameter λ. . . . . . . 301
7.6 Local Regression . . . . . . . . . . . . . . . . . . . . . . . . 303
7.7 Generalized Additive Models . . . . . . . . . . . . . . . . . 305
7.7.1 GAMs for Regression Problems . . . . . . . . . . . 306
7.7.2 GAMs for Classification Problems . . . . . . . . . . 308
7.8 Lab: Non-Linear Modeling . . . . . . . . . . . . . . . . . . 309
7.8.1 Polynomial Regression and Step Functions . . . . . 310
7.8.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . 315
7.8.3 Smoothing Splines and GAMs . . . . . . . . . . . . 317
7.8.4 Local Regression . . . . . . . . . . . . . . . . . . . 324
7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8 Tree-Based Methods 331
8.1 The Basics of Decision Trees . . . . . . . . . . . . . . . . . 331
8.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . 331
8.1.2 Classification Trees . . . . . . . . . . . . . . . . . . 337
8.1.3 Trees Versus Linear Models . . . . . . . . . . . . . 341
8.1.4 Advantages and Disadvantages of Trees . . . . . . . 341
8.2 Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees . . 343
8.2.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . 343
8.2.2 Random Forests . . . . . . . . . . . . . . . . . . . . 346
8.2.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . 347
8.2.4 Bayesian Additive Regression Trees . . . . . . . . . 350
8.2.5 Summary of Tree Ensemble Methods . . . . . . . . 353
8.3 Lab: Tree-Based Methods . . . . . . . . . . . . . . . . . . . 354
8.3.1 Fitting Classification Trees . . . . . . . . . . . . . . 355
8.3.2 Fitting Regression Trees . . . . . . . . . . . . . . . 358
8.3.3 Bagging and Random Forests . . . . . . . . . . . . 360
8.3.4 Boosting . . . . . . . . . . . . . . . . . . . . . . . . 361
8.3.5 Bayesian Additive Regression Trees . . . . . . . . . 362
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
9 Support Vector Machines 367
9.1 Maximal Margin Classifier . . . . . . . . . . . . . . . . . . 367
9.1.1 What Is a Hyperplane? . . . . . . . . . . . . . . . . 368
9.1.2 Classification Using a Separating Hyperplane . . . 368
9.1.3 The Maximal Margin Classifier . . . . . . . . . . . 370
9.1.4 Construction of the Maximal Margin Classifier . . 372
9.1.5 The Non-separable Case . . . . . . . . . . . . . . . 372
9.2 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . 373
9.2.1 Overview of the Support Vector Classifier . . . . . 373
9.2.2 Details of the Support Vector Classifier . . . . . . . 374
9.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . 377
9.3.1 Classification with Non-Linear Decision Boundaries . . . 378
9.3.2 The Support Vector Machine . . . . . . . . . . . . 379
9.3.3 An Application to the Heart Disease Data . . . . . 382
9.4 SVMs with More than Two Classes . . . . . . . . . . . . . 383
9.4.1 One-Versus-One Classification . . . . . . . . . . . . 384
9.4.2 One-Versus-All Classification . . . . . . . . . . . . 384
9.5 Relationship to Logistic Regression . . . . . . . . . . . . . 384
9.6 Lab: Support Vector Machines . . . . . . . . . . . . . . . . 387
9.6.1 Support Vector Classifier . . . . . . . . . . . . . . . 387
9.6.2 Support Vector Machine . . . . . . . . . . . . . . . 390
9.6.3 ROC Curves . . . . . . . . . . . . . . . . . . . . . . 392
9.6.4 SVM with Multiple Classes . . . . . . . . . . . . . 393
9.6.5 Application to Gene Expression Data . . . . . . . . 394
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 395
10 Deep Learning 399
10.1 Single Layer Neural Networks . . . . . . . . . . . . . . . . 400
10.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . 402
10.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . 406
10.3.1 Convolution Layers . . . . . . . . . . . . . . . . . . 407
10.3.2 Pooling Layers . . . . . . . . . . . . . . . . . . . . 410
10.3.3 Architecture of a Convolutional Neural Network . . 410
10.3.4 Data Augmentation . . . . . . . . . . . . . . . . . . 411
10.3.5 Results Using a Pretrained Classifier . . . . . . . . 412
10.4 Document Classification . . . . . . . . . . . . . . . . . . . . 413
10.5 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 416
10.5.1 Sequential Models for Document Classification . . 418
10.5.2 Time Series Forecasting . . . . . . . . . . . . . . . 420
10.5.3 Summary of RNNs . . . . . . . . . . . . . . . . . . 424
10.6 When to Use Deep Learning . . . . . . . . . . . . . . . . . 425
10.7 Fitting a Neural Network . . . . . . . . . . . . . . . . . . . 427
10.7.1 Backpropagation . . . . . . . . . . . . . . . . . . . 428
10.7.2 Regularization and Stochastic Gradient Descent . . 429
10.7.3 Dropout Learning . . . . . . . . . . . . . . . . . . . 431
10.7.4 Network Tuning . . . . . . . . . . . . . . . . . . . . 431
10.8 Interpolation and Double Descent . . . . . . . . . . . . . . 432
10.9 Lab: Deep Learning . . . . . . . . . . . . . . . . . . . . . . 435
10.9.1 Single Layer Network on Hitters Data . . . . . . . 437
10.9.2 Multilayer Network on the MNIST Digit Data . . . 444
10.9.3 Convolutional Neural Networks . . . . . . . . . . . 448
10.9.4 Using Pretrained CNN Models . . . . . . . . . . . 452
10.9.5 IMDB Document Classification . . . . . . . . . . . 454
10.9.6 Recurrent Neural Networks . . . . . . . . . . . . . 458
10.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
11 Survival Analysis and Censored Data 469
11.1 Survival and Censoring Times . . . . . . . . . . . . . . . . 470
11.2 A Closer Look at Censoring . . . . . . . . . . . . . . . . . . 470
11.3 The Kaplan–Meier Survival Curve . . . . . . . . . . . . . . 472
11.4 The Log-Rank Test . . . . . . . . . . . . . . . . . . . . . . 474
11.5 Regression Models With a Survival Response . . . . . . . . 476
11.5.1 The Hazard Function . . . . . . . . . . . . . . . . . 476
11.5.2 Proportional Hazards . . . . . . . . . . . . . . . . . 478
11.5.3 Example: Brain Cancer Data . . . . . . . . . . . . 482
11.5.4 Example: Publication Data . . . . . . . . . . . . . 482
11.6 Shrinkage for the Cox Model . . . . . . . . . . . . . . . . . 484
11.7 Additional Topics . . . . . . . . . . . . . . . . . . . . . . . 486
11.7.1 Area Under the Curve for Survival Analysis . . . . 486
11.7.2 Choice of Time Scale . . . . . . . . . . . . . . . . . 487
11.7.3 Time-Dependent Covariates . . . . . . . . . . . . . 488
11.7.4 Checking the Proportional Hazards Assumption . . 488
11.7.5 Survival Trees . . . . . . . . . . . . . . . . . . . . . 488
11.8 Lab: Survival Analysis . . . . . . . . . . . . . . . . . . . . . 489
11.8.1 Brain Cancer Data . . . . . . . . . . . . . . . . . . 489
11.8.2 Publication Data . . . . . . . . . . . . . . . . . . . 493
11.8.3 Call Center Data . . . . . . . . . . . . . . . . . . . 494
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
12 Unsupervised Learning 503
12.1 The Challenge of Unsupervised Learning . . . . . . . . . . 503
12.2 Principal Components Analysis . . . . . . . . . . . . . . . . 504
12.2.1 What Are Principal Components? . . . . . . . . . . 505
12.2.2 Another Interpretation of Principal Components . 508
12.2.3 The Proportion of Variance Explained . . . . . . . 510
12.2.4 More on PCA . . . . . . . . . . . . . . . . . . . . . 512
12.2.5 Other Uses for Principal Components . . . . . . . . 515
12.3 Missing Values and Matrix Completion . . . . . . . . . . . 515
12.4 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . 520
12.4.1 K-Means Clustering . . . . . . . . . . . . . . . . . 521
12.4.2 Hierarchical Clustering . . . . . . . . . . . . . . . . 525
12.4.3 Practical Issues in Clustering . . . . . . . . . . . . 532
12.5 Lab: Unsupervised Learning . . . . . . . . . . . . . . . . . 535
12.5.1 Principal Components Analysis . . . . . . . . . . . 535
12.5.2 Matrix Completion . . . . . . . . . . . . . . . . . . 539
12.5.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . 542
12.5.4 NCI60 Data Example . . . . . . . . . . . . . . . . . 546
12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
13 Multiple Testing 557
13.1 A Quick Review of Hypothesis Testing . . . . . . . . . . . 558
13.1.1 Testing a Hypothesis . . . . . . . . . . . . . . . . . 558
13.1.2 Type I and Type II Errors . . . . . . . . . . . . . . 562
13.2 The Challenge of Multiple Testing . . . . . . . . . . . . . . 563
13.3 The Family-Wise Error Rate . . . . . . . . . . . . . . . . . 565
13.3.1 What is the Family-Wise Error Rate? . . . . . . . 565
13.3.2 Approaches to Control the Family-Wise Error Rate 567
13.3.3 Trade-Off Between the FWER and Power . . . . . 572
13.4 The False Discovery Rate . . . . . . . . . . . . . . . . . . . 573
13.4.1 Intuition for the False Discovery Rate . . . . . . . 573
13.4.2 The Benjamini–Hochberg Procedure . . . . . . . . 575
13.5 A Re-Sampling Approach to p-Values and False Discovery Rates . . . 577
13.5.1 A Re-Sampling Approach to the p-Value . . . . . . 578
13.5.2 A Re-Sampling Approach to the False Discovery Rate579
13.5.3 When Are Re-Sampling Approaches Useful? . . . . 581
13.6 Lab: Multiple Testing . . . . . . . . . . . . . . . . . . . . . 583
13.6.1 Review of Hypothesis Tests . . . . . . . . . . . . . 583
13.6.2 Family-Wise Error Rate . . . . . . . . . . . . . . . 585
13.6.3 False Discovery Rate . . . . . . . . . . . . . . . . . 588
13.6.4 A Re-Sampling Approach . . . . . . . . . . . . . . 590

Citation

BibTeX citation:

@online{2025,
  author = {},
  title = {About},
  date = {2025-01-30},
  url = {https://orenbochman.github.io/notes-islr/about.html},
  langid = {en}
}

For attribution, please cite this work as:

“About.” 2025. January 30, 2025. https://orenbochman.github.io/notes-islr/about.html.