Chapter 13: Multiple Testing – An Introduction to Statistical Learning

Figure 1: Opening Remarks

Figure 2: Examples and Framework

TL;DR - Multiple Testing 🔬🔬🔬

Multiple testing refers to the practice of conducting multiple hypothesis tests simultaneously, which can lead to an increased risk of making at least one Type I error. The Family-wise Error Rate (FWER) and False Discovery Rate (FDR) are two common error rates used to control for the risks of multiple testing. The Bonferroni correction, Holm’s method, and the Benjamini-Hochberg procedure are popular methods to adjust p-values and control for these error rates. Re-sampling methods like permutation testing can also be used to approximate the null distribution of the test statistic when theoretical distributions are unknown or assumptions about the data are difficult to justify.

Glossary of Key Terms

Hypothesis Testing: A statistical method used to assess the evidence provided by data against a specific null hypothesis in favor of an alternative hypothesis.
Null Hypothesis (H0): The default statement of no effect or no difference that we aim to disprove using statistical evidence.
Alternative Hypothesis (Ha): The statement that contradicts the null hypothesis and represents the claim we want to support if there is enough evidence.
Test Statistic: A numerical summary of the data that is used to assess the compatibility of the data with the null hypothesis.
P-value: The probability of observing a test statistic as extreme as or more extreme than the one calculated from the data, assuming the null hypothesis is true.
Type I Error: Rejecting a true null hypothesis (false positive).
Type II Error: Failing to reject a false null hypothesis (false negative).
Multiple Hypothesis Testing: Performing multiple hypothesis tests simultaneously, which increases the probability of making at least one Type I error.
Family-wise Error Rate (FWER): The probability of making at least one Type I error among all hypotheses tested.
False Discovery Rate (FDR): The expected proportion of falsely rejected null hypotheses among all rejected null hypotheses.
Bonferroni Correction: A method to control the FWER by dividing the desired overall significance level by the number of tests.
Holm’s Method: A step-down method to control the FWER that adjusts p-value thresholds sequentially, potentially leading to more rejections than the Bonferroni correction.
Benjamini-Hochberg Procedure: A method to control the FDR by identifying the largest p-value that satisfies a specific criterion and rejecting all null hypotheses with p-values less than or equal to this identified p-value.
Re-sampling Methods: Techniques like permutation testing that use random sampling with replacement from the observed data to approximate the null distribution of the test statistic, especially helpful when theoretical null distributions are unknown or assumptions about the data are difficult to justify.
Permutation Testing: A re-sampling method where data points are randomly reassigned to different groups to create new datasets under the assumption of the null hypothesis. This allows for the estimation of the null distribution of the test statistic without relying on specific distributional assumptions.

Chapter Outline

Central Problem: When testing multiple hypotheses simultaneously, the probability of making at least one Type I error (false positive) increases dramatically. This necessitates the use of specific methods to control this error rate.

Key Concepts:

Type I Error: Rejecting the null hypothesis when it is actually true (false positive). Type II Error: Failing to reject the null hypothesis when it is actually false (false negative). Family-Wise Error Rate (FWER): The probability of making at least one Type I error across all tested hypotheses. False Discovery Rate (FDR): The expected proportion of rejected null hypotheses that are actually false positives. Multiple Testing Correction: Adjusting p-values or rejection thresholds to account for the increased risk of Type I errors when testing multiple hypotheses.

Methods for Controlling FWER:

Bonferroni Correction: The simplest and most conservative method. It divides the desired alpha level (the probability of making a Type I error) by the number of hypotheses being tested (m). This new, stricter alpha level is then used as the threshold for rejecting each individual hypothesis. Formula: Reject H0j if p-value < α/m Advantages: Easy to implement. Disadvantages: Can be overly conservative, leading to higher Type II error rates, especially with a large number of hypotheses. Holm’s Method: A step-down procedure that offers more power than Bonferroni while still controlling the FWER. It orders the p-values from smallest to largest and compares each p-value to a sequentially adjusted alpha level. Formula: Reject H0j if p(j) < α/(m + 1 - j), where p(j) is the jth smallest p-value. Advantages: More powerful than Bonferroni. Disadvantages: Still relatively conservative. Tukey’s and Scheffé’s Methods: Specialized methods tailored for specific types of multiple comparisons. Tukey’s method is designed for all pairwise comparisons of means, while Scheffé’s method is more general and allows for any linear combination of means to be tested.

Methods for Controlling FDR:

Benjamini-Hochberg Procedure: A step-up procedure that controls the FDR at a desired level (q). It orders the p-values from smallest to largest and compares each p-value to a sequentially adjusted q-value threshold. Formula: Reject H0j if p(j) < qj/m Advantages: Less conservative than FWER control methods, leading to higher power. Disadvantages: May lead to a higher proportion of false positives compared to FWER control methods. 4. Re-Sampling Approaches:

When the theoretical null distribution of a test statistic is unknown, re-sampling methods (e.g., permutation tests) can be used to approximate the null distribution and calculate p-values. This is particularly useful for small sample sizes or complex test statistics.

Illustrations: There are many examples using simulated data and real datasets (Fund dataset, Khan dataset) to demonstrate the impact of multiple testing on error rates and the effectiveness of different correction methods.

Key Quotes:

“The Bonferroni correction gives us peace of mind that we have not falsely rejected too many null hypotheses, but for a price: we reject few null hypotheses, and thus will typically make quite a few Type II errors.”

“If m = 10,000, then we expect to falsely reject 100 null hypotheses by chance! That’s a lot of Type I errors, i.e., false positives!”

Conclusion:

Multiple hypothesis testing is a common challenge in data analysis, requiring careful consideration of error control methods. Choosing the appropriate method depends on the specific research question, desired level of error control, and available resources. Re-sampling approaches offer a flexible alternative when theoretical null distributions are unavailable.

Multiple Hypothesis Testing Study Guide

Short Answer Questions Instructions: Answer the following questions in 2-3 sentences each.

What is the purpose of a test statistic in hypothesis testing?
Explain the concept of a p-value and its role in hypothesis testing.
What is the difference between a Type I error and a Type II error?
Why is multiple hypothesis testing problematic, even if we control the Type I error rate for each individual test?
What is the family-wise error rate (FWER), and how does it differ from the Type I error rate?
Explain how the Bonferroni correction controls the FWER.
What is the advantage of Holm’s method over the Bonferroni correction for controlling the FWER?
Define the false discovery rate (FDR). How does it differ from the FWER?
Describe the Benjamini-Hochberg procedure for controlling the FDR.
When might a re-sampling approach be useful for multiple hypothesis testing? Provide an example.

Short Answer Key

A test statistic summarizes the data’s compatibility with the null hypothesis. Its value helps determine the strength of evidence against the null hypothesis.
The p-value is the probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the data, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
A Type I error occurs when we reject a true null hypothesis, while a Type II error occurs when we fail to reject a false null hypothesis.
When performing multiple hypothesis tests, the probability of making at least one Type I error increases, even if the individual Type I error rate for each test is controlled. This leads to an inflated overall false positive rate.
The family-wise error rate (FWER) is the probability of making at least one Type I error among all hypotheses tested. Unlike the Type I error rate which focuses on a single test, the FWER considers the probability of making any false rejections across multiple tests.
The Bonferroni correction controls the FWER by dividing the desired overall significance level (alpha) by the number of tests (m). This more stringent significance level for each individual test ensures that the FWER is maintained at or below alpha.
Holm’s method is a step-down procedure that adjusts the p-value thresholds for each hypothesis sequentially, starting with the smallest p-value. This allows for potentially more rejections compared to the Bonferroni correction, which uses a fixed threshold for all tests, while still controlling the FWER.
The false discovery rate (FDR) is the expected proportion of false rejections (Type I errors) among all rejected hypotheses. It focuses on controlling the rate of false positives among the discoveries made, rather than controlling the probability of making any Type I error like the FWER.
The Benjamini-Hochberg procedure controls the FDR by ranking p-values and finding the largest p-value that satisfies a specific criterion. It then rejects all null hypotheses with p-values less than or equal to this identified p-value.
Re-sampling approaches are useful when the theoretical null distribution of the test statistic is unknown, or when we want to avoid making strong assumptions about the data. For instance, in a two-sample t-test with small sample sizes, where the normality assumption might not hold, we can use permutation testing to approximate the null distribution and calculate p-values.

Essay Questions

Discuss the trade-off between controlling the FWER and the power of multiple hypothesis tests. How do different correction methods impact this trade-off?
In what situations might controlling the FDR be more appropriate than controlling the FWER? Explain your reasoning with examples.
Compare and contrast the Bonferroni correction and Holm’s method for controlling the FWER. Discuss their strengths and weaknesses, and provide scenarios where one might be preferred over the other.
Explain the concept of permutation testing and its use in multiple hypothesis testing when theoretical null distributions are unavailable. How does it help in estimating p-values?
Describe a real-world example of multiple hypothesis testing and discuss the implications of choosing different error control methods (FWER, FDR) for the specific problem.

Slides

Chapter Slides

Chapter

Reuse

CC SA BY-NC-ND

Citation

BibTeX citation:

@online{bochman2024,
  author = {Bochman, Oren},
  title = {Chapter 13: {Multiple} {Testing}},
  date = {2024-09-20},
  url = {https://orenbochman.github.io/notes-islr/posts/ch13/},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2024. “Chapter 13: Multiple Testing.” September 20, 2024. https://orenbochman.github.io/notes-islr/posts/ch13/.