Caution
Section omitted to comply with the Honor Code
Bayesian Statistics: Techniques and Models
Oren Bochman
Conditional Probability, Bayes’ Law, Homework
Section omitted to comply with the Honor Code
---
title: "Homework on Deviance information criterion - M2L5HW2"
subtitle: "Bayesian Statistics: Techniques and Models"
categories:
- Bayesian Statistics
keywords:
- Conditional Probability
- Bayes' Law
- Homework
---
::::: {.content-visible unless-profile="HC"}
::: {.callout-caution}
Section omitted to comply with the Honor Code
:::
:::::
::::: {.content-hidden unless-profile="HC"}
::: {#exr-dic-1}
[DIC]{.column-margin} What is the primary interpretation of the penalty term in the deviance information criterion (DIC)?
:::
::: {.solution .callout-tip collapse="true"}
#### Solution:
- [ ] It penalizes overly simple models.
- [x] It gives an effective number of parameters estimated in the model.
- [ ] It estimates the optimal number of predictor variables (covariates) to include in the model.
- [ ] It gives an estimate of how much your mean squared error would increase for each additional parameter estimated.
It penalizes overly complicated models which fit this particular data set well, but may fail to generalize. This penalty will be particularly useful for hierarchical models.
:::
::: {#exr-dic-2}
[DIC]{.column-margin} DIC is a helpful tool for selecting among competing models. Which of the following changes to a linear model is not appropriate to evaluate with DIC?
:::
::: {.solution .callout-tip collapse="true"}
#### Solution:
- [ ] Adding or removing candidate covariates (predictors)
- [ ] Choice of distribution for the likelihood
- [ ] Transformation of covariates (predictors)
- [x] Minor changes to the prior distributions
If we optimize the model with respect to the prior, we might as well have not used priors. This practice can lead to inflated confidence and misleading results.
One exception is if we use a completely different class of priors or prior structure that has a specific purpose, like variable selection. We will explore this in the next lesson.
:::
::: {#exr-dic-3}
[DIC]{.column-margin} Although the residual analysis of the Anscombe data showed no major problem that we will pursue, it is still worthwhile to compare some competing models. First, calculate and report the DIC for the original model (that you fit for the previous quiz). Round your answer to the nearest whole number.
\index{dataset!Anscombe}
```{r}
#| label: C2-L07-Ex2-1
library("rjags")
library("car") # load the 'car' package
data("Anscombe") # load the data set
#?Anscombe # read a description of the data
#head(Anscombe) # look at the first few lines of the data
#pairs(Anscombe) # scatter plots for each pair of variables
mod_0=lm(data = Anscombe,formula = 'education ~ income + young + urban')
#summary((mod_0))
#par(mar = c(2.5, 1, 2.5, 1))
#plot(mod_0)
```
```{r}
#| label: model1
mod_string = " model {
for (i in 1:length(education)) {
education[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*income[i] + b[2]*young[i] + b[3]*urban[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:3) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(1.0/2.0, 1.0*1500.0/2.0)
## Initial guess of variance based on overall
## variance of education variable. Uses low prior
## effective sample size. Technically, this is not
## a true 'prior', but it is not very informative.
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
data_jags = as.list(Anscombe)
head(data_jags)
params_1 = c("b", "sig")
inits = function() {
#inits = list("b"=rnorm(2,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
inits = list("b"=rnorm(3,0.0,1.0/1.0e6), "prec"=rgamma(1,0.5,1.0*1500.0/2.0))
}
set.seed(72)
mod_1 = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
update(mod_1, 1e3) # burn-in
mod1_sim = coda.samples(model=mod_1,
variable.names=params_1,
n.iter=5e3)
mod1_csim = as.mcmc(do.call(rbind, mod1_sim))
summary(mod1_sim)
```
```{r}
#| label: mcmc diagnostics for mod1
#| fig-width: 7
#| fig-height: 10
plot(mod1_sim)
```
```{r}
#| label: model2
mod_string_2 = " model {
for (i in 1:length(education)) {
education[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1] * income[i] + b[2]* young[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:2) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(1.0/2.0, 1.0*1500.0/2.0)
## Initial guess of variance based on overall
## variance of education variable. Uses low prior
## effective sample size. Technically, this is not
## a true 'prior', but it is not very informative.
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
params_2 = c("b", "sig")
inits2 = function() {
inits = list("b"=rnorm(2,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
set.seed(72)
mod_2 = jags.model(textConnection(mod_string_2), inits = inits2, data=data_jags, n.chains=3)
update(mod_2, 1e3) # burn-in
mod2_sim = coda.samples(model=mod_2,
variable.names=params_2,
n.iter=5e3)
mod2_csim = as.mcmc(do.call(rbind, mod2_sim))
summary(mod2_sim)
```
```{r}
#| label: fig-mcmc-diagnostics-for-mod2
#| fig-width: 7
#| fig-height: 8
plot(mod2_sim)
```
```{r}
#| label: model3
mod_string_3 = " model {
for (i in 1:length(education)) {
education[i] ~ dnorm(mu[i], prec)
mu[i] = b0 + b[1]*income[i] + b[2] * young[i] + b[3] * income[i]* young[i]
}
b0 ~ dnorm(0.0, 1.0/1.0e6)
for (i in 1:3) {
b[i] ~ dnorm(0.0, 1.0/1.0e6)
}
prec ~ dgamma(1.0/2.0, 1.0*1500.0/2.0)
## Initial guess of variance based on overall
## variance of education variable. Uses low prior
## effective sample size. Technically, this is not
## a true 'prior', but it is not very informative.
sig2 = 1.0 / prec
sig = sqrt(sig2)
} "
params_3 = c("b", "sig")
inits3 = function() {
inits = list("b"=rnorm(3,0.0,100.0), "prec"=rgamma(1,1.0,1.0))
}
set.seed(72)
mod_3 = jags.model(textConnection(mod_string_3), inits = inits3, data=data_jags, n.chains=3)
update(mod_3, 1e3) # burn-in
mod3_sim = coda.samples(model=mod_3,
variable.names=params_3,
n.iter=5e3)
mod3_csim = as.mcmc(do.call(rbind, mod3_sim))
summary(mod3_sim)
```
```{r}
#| label: mcmc diagnostics for mod3
#| fig-width: 7
#| fig-height: 10
plot(mod3_sim)
```
```{r}
#| label: C2-L07-Ex2-8
dic.samples(mod_1, n.iter=1e5)
dic.samples(mod_2, n.iter=1e5)
dic.samples(mod_3, n.iter=1e5)
```
```{r}
#| label: C2-L07-Ex2-9
#mod3_sim
#length(mod3_csim[mod3_csim[,1]>=0,3])/dim(mod3_csim)[1]
length(mod1_csim[mod1_csim[,1]>=0,3])/dim(mod1_csim)[1]
```
:::
::: {.solution .callout-tip collapse="true"}
#### Solution:
486.4
This number by itself is not very useful. We now need to compare it to the DIC from other models and see which is lowest.
:::
::: {#exr-dic-4}
[DIC]{.column-margin} We will consider two alternative models for the Anscombe data. Because income and urban may be more highly correlated with each other than with education, and since urban was less significant than income in our models so far, we'll consider dropping it (we'll discuss correlated covariates more in the next lesson).
The two alternative models we will try are based on these adjustments:
1) Remove the term in the linear model for urban.
2) In addition to dropping urban, add an interaction term β3×income×youth.
Fit both models in JAGS and calculate the DIC for each. If predictive performance is our criterion, which model would you conclude performs best?
:::
```{r}
#| label: lbl-dic-4
mod_0=lm(data = Anscombe,formula = 'education ~ income + young + urban')
summary((mod_0))
par(mar = c(2.5, 1, 2.5, 1))
plot(mod_0)
```
::: {.solution .callout-tip collapse="true"}
#### Solution:
- [ ] The DIC is lowest for the second model without the urban covariate. This is our preferred model.
- [x] The DIC is lowest for the original model with all covariates. This is our preferred model.
- [ ] The DIC is lowest for the third model with the interaction term. This is our preferred model.
- [ ] The DIC is indistinguishable among the three models. We cannot clearly identify a preferred model.
With DIC, a decrease of even a few points can indicate significant gains in model predictive performance.
:::
::: {#exr-dic-5}
[DIC]{.column-margin} Using the model favored by the DIC, obtain a Monte Carlo estimate of the posterior probability that the coefficient for income is positive (greater than 0.0). Round your answer to two decimal places.
:::
::: {.solution .callout-tip collapse="true"}
#### Solution:
1.00
There is strong evidence that increases in per-capita income are associated with increases in per-capita education expenditures. We cannot conclude that one causes the other since these data are merely observational, but we do know they are correlated.
:::
::: {#exr-dic-6}
[convergence]{.column-margin} Which of the following accurately summarizes our conclusions based on the model favored by the DIC?
:::
::: {.solution .callout-tip collapse="true"}
#### Solution:
- [ ] Increases in per-capita income and percent youth are associated with decreases in mean per-capita education expenditures. Increases in percent urban are irrelevant.
- [ ] Increases in per-capita income and percent youth are associated with decreases in mean per-capita education expenditures. Increases in percent urban are associated with increases in mean per-capita education expenditures.
- [ ] Increases in per-capita income and percent urban are associated with increases in mean per-capita education expenditures. Increases in percent youth are associated with decreases in mean per-capita education expenditures.
- [x] Increases in per-capita income and percent youth are associated with increases in mean per-capita education expenditures. Increases in percent urban are associated with decreases in mean per-capita education expenditures.
```{r}
#| label: lbl-dic-6
summary(mod1_sim)
(pm_params1 = colMeans(mod1_csim))
```
Check the sign (positive or negative) of the coefficients in the model. A positive coefficient means that the variables (the covariate and response) increase or decrease together and a negative coefficient means that one increases when the other decreases.
:::
:::::