“Essentially, all models are wrong, but some are useful” — George Box
The chapter is 64 pages long and covers the following topics:
- Linear Regression
- Simple Linear Regression
- Estimating the Coefficients
- Assessing the Accuracy of the Coefficient Estimates
- Assessing the Accuracy of the Model
- Multiple Linear Regression
- Estimating the Regression Coefficients
- Some Important Questions
- Other Considerations in the Regression Model
- Qualitative Predictors
- Extensions of the Linear Model
- Potential Problems
- The Marketing Plan
- Comparison of Linear Regression with K-Nearest Neighbors
- Simple Linear Regression
- There is a lab:
- Linear regression is the fundamental statistical technique for modeling the relationship between a predictor variable and a response variable. In this model we assume a linear relationship between the predictor and the response, represented by a straight line. The goal is to estimate the coefficients of the model that minimize the difference between observed and predicted values.
- The model’s accuracy is assessed 🙈 🙉 🙊 via metrics like:
- the Residual Standard Error (RSE) 💔
- the adjusted R^2 💋
- the fantastic F-statistic, 💖 and
- the black 🐑 of statistics - p-values. 😱
The coefficients are interpreted as the average change in the response for a one-unit increase in the predictor, holding all other variables constant 🥱 . Qualitative predictors are incorporated using 🥟 dummy variables, and interactions between predictors can capture complex relationships.
Model diagnostics help identify potential issues like non-linearity 🧛, heteroscedasticity 👾, outliers 👽, high leverage points 👻, and collinearity 🧟.
Here is a light-hearted deep dive into regression to help you ease into this chapter on linear regression. 🎧
“The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively” — Fred Mosteller1 and John Tukey
Linear Regression Orientation :🗝️:
There are plenty of big words in this chapter, but don’t worry, we’ll 💪 break them down together🤲! Here’s a quick overview of the key concepts covered in this chapter:
- Simple Linear Regression
-
Modeling the relationship between a single input (predictor) variable and an output (response) variable using a straight line.
- Multiple Linear Regression
-
Extending simple linear regression to accommodate multiple predictor variables.
- Model Assessment
-
Evaluating the accuracy and fit of linear regression models using metrics like RSE, R^2, F-statistic, and p-values.
- Model Interpretation
-
Understanding the practical meaning of estimated coefficients and their significance in the context of the data.
- Qualitative Predictors
-
When we want to play with non0numerical predictors, e.g. representing color categories, we can encode them as numerical values or better yet we can use dummy variables that track if a category is on ▶️ or off ⏹️. Incorporating categorical variables into regression models using dummy variables.
- Interactions
-
Modeling complex relationships by including interaction terms between predictors. If the terms are A and B then the interaction term is \beta A \times B. where \beta is the coefficient representing thr strength of the interaction.
- Polynomial Regression
-
Capturing non-linear relationships using polynomial terms of predictor variables.
- Model Diagnostics 🌡️
-
Identifying and addressing potential issues 🤒 like non-linearity, heteroscedasticity, outliers, high leverage points, and collinearity.
- Heteroscedasticity 👾
-
When the response spreads out more as the predictors increase. You can usually eyeball 🙄 this by looking at the residuals vs. fitted values plot.
- Outliers 👽
-
Observations that don’t fit the general pattern of the data. If they are mistakes we can toss them out 🗑️. But is they are for real, we call them high leverage points 👻.
- Collinearity 🧟
-
when predictors are highly correlated, leading to unstable coefficient estimates. The VIF (Variance Inflation Factor) helps detect collinearity.
Simple linear regression
Model: The relationship between response (Y) and predictor (X) is represented as Y = β_0 + β_1 X + \epsilon
where
- β_0 is the intercept,
- β_1 is the slope, and
- \epsilon is the error term.
We have two problems:
- Inference: how can we find the best values for β_0 and β_1 that minimize the difference between the observed and predicted values of Y.
- Prediction: given a value of X, and \hat{β}_0 and \hat{β}_1 parameters how can we predict the corresponding value of \hat{y}.
\hat{y} = \hat{β}_0 + \hat{β}_1 x
where \hat{y} is the predicted value of Y for a given value of X.
and we can define the residuals as:
e_i = y_i - \hat{y}_i
Estimating the Coefficients
Least Squares Estimation: The coefficients β_0 and β_1 are estimated by minimizing the Residual Sum of Squares (RSS), which quantifies the difference between observed and predicted values.
The RSS is defined as:
RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} e_i^2
least squares estimates are:
\hat{β}_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
\hat{β}_0 = \bar{y} - \hat{β}_1 \bar{x}
where \bar{x} and \bar{y} are the sample means of the predictors and response, respectively.
Assessing the Accuracy of the Coefficient Estimates
- Assessing Accuracy: The standard errors of the coefficients help construct confidence intervals and perform hypothesis tests to assess the significance of the relationship between X and Y.
SE(\hat{β}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}
SE(\hat{β}_0)^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right]
where \sigma^2 = Var[\epsilon]
- R^2: This statistic quantifies the proportion of variability in the response explained by the model, indicating how well the model fits the data.
We can use these standard errors to compute confidence intervals. A 95% CI is one which will contain the true value of the parameter with a probability of 95%.
This approach seems strange as we seem to have computed the values precisely using the above estimators. What can possibly change that would give us a different value ?
This is intimated in the slides which mentions repeated sampling?
Where is this repeated sampling -
Some strange Fisherian fantasy that if we repeated the experiment many times we would get different samples each time drawn from some contrafactual distribution that generates the data for each contrafactual version of the experiment.
Getting back to the book we may rephrase this as follows. Since we know the Data generating rule we can simulate data to our hear’s content. Each time the random number generator spits out different Xs so we get a new data set. For each data set the least square estimator will give us different parameters. Corresponding to a different approximation of the population regression model. (One that has points from all possible experiments). I.e. the point is that we are using x and \sigma^2 from the sample not the population so we will have
the book says
In other words, in real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.
I.e. we don’t know the Data generating rule. Its likely to be non-linear. But we can compare our approximate rule to the data and estimate the error associated with each parameter. That what the standard error means. (it is the normalized distance the estimator from the data.)
Assessing the Accuracy of the Model
Hypothesis Testing and Confidence Intervals
Some important questions
- Is at least one of the predictors X_1, X_2, \ldots , X_p useful in predicting the response?
- Do all the predictors help to explain Y, or is only a subset of the predictors useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Simple Linear Regression:
Multiple Linear Regression:
Model: The model extends to multiple predictors: Y = β_0 + β_1 X_1 + β_2 X_2 + \ldots + β_p X_p + \epsilon
Interpretation: Each coefficient β_j represents the average change in Y for a one-unit increase in X_j, holding all other predictors constant.
F-statistic: This statistic tests the overall significance of the model, determining whether at least one predictor is useful in predicting the response.
Variable Selection: Techniques like Mallow’s Cp, AIC, BIC, and adjusted R^2 are employed to choose the best subset of predictors for the model.
Qualitative Predictors:
When we have categorical predictors
- Dummy Variables: Categorical variables are incorporated by creating dummy variables, which take on values of 0 or 1 to represent different categories.
- Baseline Category: One category is chosen as the baseline, and its coefficient represents the average response for that category. Other dummy variable coefficients represent differences from the baseline.
Interactions:
- Interaction Terms: Including interaction terms like X_1 X_2 in the model allows the relationship between one predictor and the response to vary depending on the value of another predictor.
- Synergy Effect: Interactions can capture synergistic effects where the combined impact of two predictors is greater than the sum of their individual impacts.
Polynomial Regression:
- Non-linear Relationships: Polynomial terms like X^2 are used to model non-linear relationships between predictors and the response.
- Overfitting: Higher-degree polynomials can lead to overfitting, where the model fits the training data too closely but generalizes poorly to new data.
Model Diagnostics:
- Residual Plots: Visualizing residuals against fitted values helps assess linearity, homoscedasticity, and the presence of outliers.
- High Leverage Points: Observations with extreme predictor values can have a disproportionate impact on the model and should be investigated.
- Collinearity: High correlation among predictors can lead to unstable coefficient estimates and inflated standard errors. VIF (Variance Inflation Factor) helps detect collinearity.
Key Quotes:
“The least squares approach chooses \hat{\beta}_0 and \hat{\beta}_1 to minimize the RSS.”
“We interpret β_j as the average effect on Y of a one unit increase in X_j , holding all other predictors fixed.”
“The woes of (interpreting) regression coefficients: … a regression coefficient β_j estimates the expected change in Y per unit change in X_j , with all other predictors held fixed. But predictors usually change together!”
“A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.”
“There will always be one fewer dummy variable than the number of levels. The level with no dummy variable — African American in this example — is known as the baseline.”
“When levels of either TV or radio are low, then the true sales are lower than predicted by the linear model. But when advertising is split between the two media, then the model tends to underestimate sales.”
Resources
Slides
Chapter
Footnotes
studied attribution problem of the disputed Federalist Papers and authored the amazing “Fifty Challenging Problems in Probability”↩︎
Reuse
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Chapter 3: {Linear} {Regression}},
date = {2024-06-21},
url = {https://orenbochman.github.io/notes-islr/posts/ch03/},
langid = {en}
}