Six quick tips to improve modeling

Author

Oren Bochman

Published

Monday, August 26, 2024

Keywords

modeling, machine learning, data science, tips, tricks, best practices

The following tips are from (Gelman and Hill 2007) Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill.

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Vol. Analytical methods for social research. New York: Cambridge University Press.
  1. Fit many models:
    • Start with a baseline
    • Add more effects
    • Add interactions
    • Add non-linearities
    • Add more levels
    • Use different priors
    • Use different algorithms (each has its own inductive biases)
  2. Do a little work to make your computations faster and more reliable
    1. data subsetting is faster than full data set
    2. redundant parameterization (e.g. re-centering and scaling) add some parameters but don’ really change the model yet make it more computationally stable by improving the geometry the sampler has to work with. 3 fake data and predictive simulations help understand if the problems we are facing are due to the model or the data. Fake data creates a version of the data for which we know what to expect.
  3. Graphing the relevant
    • Graphing the data is fine
    • Graphing the model is more informative (regression lines and curves)
    • I like to plot lots of graphs like regression diagnostics, residuals, and posterior predictive checks. But Gelman and Hill suggest that these are not relevant and warn that one should be prepared to explain any graph you show. Some of these diagnostics graphs are kind of hard to explain. e.g. Cook’s distance v.s. leverage plot.
  4. Transformations
    • Logarithms of all-positive variables (primarily because this leads to multiplicative models on the original scale, which often makes sense)
    • Standardizing based on the scale or potential range of the data (so that coefficients can be more directly interpreted and scaled); an alternative is to present coefficients in scaled and unscaled forms
    • Transforming before multilevel modeling (thus attempting to make coefficients more comparable, thus allowing more effective second-level regressions, which in turn improve partial pooling). There is some risk in transformations for real world models.
    • Can we be certain the predictions are still valid on the original scale?
    • What happen if new data comes in and the Z-transformation we used is no longer valid?
    • What if we are working with elasticity (percent change over percent change in response) - does the transformation still make sense?
  5. Consider all coefficients as potentially varying
    • Practical concerns sometimes limit the feasible complexity of a model
    • Ideally we would like to have a model that is as complex as the data might be in reality we need a good fit and the ability to understand the model.
  6. Estimate causal inferences in a targeted way, not as a byproduct of a large regression
    • Least square regression does not care about causality.
    • If you do you need to go beyond, e.g. sketch the structural model, identify the causal effect and the roles of the confounders. Then use regression to estimate the effects.

Citation

BibTeX citation:
@online{bochman2024,
  author = {Bochman, Oren},
  title = {Six Quick Tips to Improve Modeling},
  date = {2024-08-26},
  url = {https://orenbochman.github.io/posts/2024/2024-08-26-six-quick-tips/},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2024. “Six Quick Tips to Improve Modeling.” August 26, 2024. https://orenbochman.github.io/posts/2024/2024-08-26-six-quick-tips/.