Zero inflated data is a type of data that has a large number of zero values.
This type of data is common in many fields, including biology, economics, and social sciences.
In this post, we will explore how to analyze zero inflated data.
Imagine you are creating a regression model to predict the sales for a product as a function of price.
You have a dataset that contains the sales data for the product, as well as the price of the product.
Looking at the data you notice
- as price increases the number of sales decreases.
- there are many days where no sales are made.
- the regression line does not fit the data well.
when you try to dive deeper into the data you notice that
- some products when out of stock cannot be bought until the next shipment arrives.
- occasionaly the IT system fails and no sales are recorded.
- When acounting for 1,2 demand is stochastic. there is always a chance that no sales are made on a certain day.
- For high value products, if we are more expensive than the competition by some threshold, chances of zero sales are high.
- Sales get canceled but it can also be credited as a negative sale on some other day.
- Sales are seasonal, some days for some products are far more likely to have zero sales than others.
- There was a marketing campaign that increased sales for a certain period of time but the budget run out and sales dropped to zero.
- There was a well marketed promotion that made clients buy more than they needed at a low price and then they did not buy for a long time.
- Traffic to our store for some products is mainly from a recommendation engine and that engine may stop recommending our product or drop our product’s ranking leading to zero sales.
- When some products might get bad reviews on the site and sales might drop to zero.
- While we can track some of these factors, we cannot track all of them. Also ther effects can be specific to a certain product or a certain day.
- In the end though we cannot match cause to effect, we can only speculate why we get zero sales on some days.
- However regardless of the root cause, we probably want our model to have a good fit to the real functional relationship between price and sales.
- We might also want to be able to predict the stochatic nature of the demand. (i.e. the probability of zero sales on a certain day assuming a stochastic model of demand)
- We might also want to know how much more zeros we are getting than we would expect from a simple model of demand.
Some approaches to model zero inflated data are:
- Zero inflated Poisson regression
- Zero inflated negative binomial regression
- Zero inflated generalized linear models
- Zero inflated mixed models
- Zero inflated hurdle models
Zero inflated Poisson regression
The zero inflated Poisson regression model is a type of regression model that is used to analyze zero inflated data.
This model assumes that the data is generated by a Poisson distribution, but that there are two processes that can generate zeros:
- A process that generates zeros with a certain probability, even if the Poisson process generates a non-zero value.
- A Poisson process that generates zeros with a certain probability.
The model is defined as:
Y_i = \begin{cases} 0 & \text{with probability } \pi_i \\ Poisson(\lambda_i) & \text{with probability } 1 - \pi_i \end{cases}
where Y_i is the observed value, \pi_i is the probability of a zero value, and \lambda_i is the mean of the Poisson distribution.
The model can be estimated using maximum likelihood estimation.
Zero inflated negative binomial regression
The zero inflated negative binomial regression model is a type of regression model that is used to analyze zero inflated data.
This model assumes that the data is generated by a negative binomial distribution, but that there are two processes that can generate zeros:
- A process that generates zeros with a certain probability, even if the negative binomial process generates a non-zero value.
- A negative binomial process that generates zeros with a certain probability.
The model is defined as:
Y_i = \begin{cases} 0 & \text{with probability } \pi_i \\ NegBin(\mu_i, \theta) & \text{with probability } 1 - \pi_i \end{cases}
where Y_i is the observed value, \pi_i is the probability of a zero value, \mu_i is the mean of the negative binomial distribution, and \theta is the dispersion parameter.
The model can be estimated using maximum likelihood estimation.
Zero inflated generalized linear models
The zero inflated generalized linear model is a type of regression model that is used to analyze zero inflated data.
This model is a generalization of the zero inflated Poisson and zero inflated negative binomial regression models.
The model is defined as:
Y_i = \begin{cases} 0 & \text{with probability } \pi_i \\ GLM(\eta_i) & \text{with probability } 1 - \pi_i \end{cases}
where Y_i is the observed value, \pi_i is the probability of a zero value, and \eta_i is the linear predictor.
The model can be estimated using maximum likelihood estimation.
Zero inflated mixed models
The zero inflated mixed model is a type of regression model that is used to analyze zero inflated data.
This model is a generalization of the zero inflated Poisson and zero inflated negative binomial regression models.
The model is defined as:
Y_i = \begin{cases} 0 & \text{with probability } \pi_i \\ Mixed(\eta_i) & \text{with probability } 1 - \pi_i \end{cases}
where Y_i is the observed value, \pi_i is the probability of a zero value, and \eta_i is the linear predictor.
The model can be estimated using maximum likelihood estimation.
Zero inflated hurdle models
The zero inflated hurdle model is a type of regression model that is used to analyze zero inflated data.
This model is a generalization of the zero inflated Poisson and zero inflated negative binomial regression models.
The model is defined as:
Y_i = \begin{cases} 0 & \text{with probability } \pi_i \\ Hurdle(\eta_i) & \text{with probability } 1 - \pi_i \end{cases}
where Y_i is the observed value, \pi_i is the probability of a zero value, and \eta_i is the linear predictor.
The model can be estimated using maximum likelihood estimation.
In conclusion, zero inflated data is a type of data that has a large number of zero values.
There are several approaches to model zero inflated data, including zero inflated Poisson regression, zero inflated negative binomial regression, zero inflated generalized linear models, zero inflated mixed models, and zero inflated hurdle models.
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Zero Inflated Data},
date = {2024-06-23},
url = {https://orenbochman.github.io/posts/2024/2024-06-23-zero-inflated-data/2024-06-23-zero-inflated-data.html},
langid = {en}
}