6  Distributions - M1L3

Bayesian Statistics: From Concept to Data Analysis

Outline of distributions
Bayesian Statistics
Author

Oren Bochman

Keywords

Distributions, Bernoulli Distribution, Binomial Distribution

6.1 Distributions

6.2 The Bernoulli & Binomial Distribution

Figure 6.1: Bernoulli and Binomial Distributions

These two distributions are built on a trial of a coin toss (possibly biased).

  • We use the Bernoulli distribution to model a random variable for the probability of such a coin toss trial.
  • We use the Binomial distribution to model a random variable for the probability of getting k heads in N independent trials.

6.2.1 The Bernoulli Distribution

Arises when modeling events with two possible outcomes, Success and Failure for a coin toss these can be Heads and Tails

X \sim \mathrm{Bernoulli}(p) = \begin{cases} \mathbb{P}r(X=1) = p & \text{success} \\ \mathbb{P}r(X=0)=1-p & \text{failure} \end{cases} \tag{6.1}

Where parameter p is the probability of getting heads.

The probability for the two events is:

Notation:

  • we use (Roman) p if its value is known.
  • we use (Greek) \theta when its value is unknown.

This is a probability mass function since it is discrete. But we call it a Probability Density Function (PDF) in the measure-theoretic sense.

f(X=x\mid p) = p^x(1-p)^x \mathbb{I}_{[0,1]}(x) \tag{6.2}

\mathbb{E}(x)= p \tag{6.3}

\text{Var}(x)= \mathbb{P}r(1-p) \tag{6.4}

import numpy as np
from scipy.stats import bernoulli
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
p = 0.3
mean, var, skew, kurt = bernoulli.stats(p, moments='mvsk')
print(f'{mean=:1.2f}, {var=:1.2f}, {skew=:1.2f}, {kurt=:1.2f}')
mean=0.30, var=0.21, skew=0.87, kurt=-1.24
x = np.arange(bernoulli.ppf(0.01, p),
              bernoulli.ppf(0.99, p))
ax.plot(x, bernoulli.pmf(x, p), 'bo', ms=8, label='bernoulli pmf')
ax.vlines(x, 0, bernoulli.pmf(x, p), colors='b', lw=5, alpha=0.5)

rv = bernoulli(p)
ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,
        label='frozen pmf')
ax.legend(loc='best', frameon=False)
plt.show()

Bernoulli distribution

Bernoulli distribution
## Generate random numbers
r = bernoulli.rvs(p, size=10)
r
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])
Figure 6.2: Jacob Bernoulli
TipBiographical note on Jacob Bernoulli

It seems that to make a correct conjecture about any event whatever, it is necessary to calculate exactly the number of possible cases and then to determine how much more likely it is that one case will occur than another. (Bernoulli 1713)

The Bernoulli distribution as well as The Binomial distribution are due to Jacob Bernoulli (1655-1705) who was a prominent mathematicians in the Bernoulli family. He discovered the fundamental mathematical constant e. However, his most important contribution was in the field of probability, where he derived the first version of the law of large numbers.

for a fuller biography see

6.2.2 The Binomial Distribution

\overbrace{\underbrace{\fbox{0}\ \ldots \fbox{0}}_{N_0}\ \underbrace{\fbox{1}\ \ldots \fbox{1}}_{N_1}}^N \tag{6.5}

The Binomial distribution models counts of successes in independent Bernoulli trials . It arises when we need to consider the summing N independent and identically distributed Bernoulli RV with the same probability of success \theta.

TipConditions
  • Discrete data
  • Two possible outcomes for each trial
  • Each trial is independent
  • The probability of success/failure is the same in each trial

Binomial reparams mindmap

Binomial reparams mindmap

X \sim \mathrm{Bin}(n,p) \tag{6.6}

the probability function

f(X=x \mid \theta) = {n \choose x} \theta^x(1-\theta)^{n-x} \tag{6.7}

\mathcal{L}(\theta)=\prod_{i=1}^{n} {n\choose x_i} \theta ^ {x_i} (1− \theta) ^ {(n−x_i)} \tag{6.8}

\begin{aligned} \ell( \theta) &= \log \mathcal{L}( \theta) \\ &= \sum_{i=1}^n \left[\log {n\choose x_i} + x_i \log \theta + (n-x_i)\log (1- \theta) \right] \end{aligned} \tag{6.9}

\mathbb{E}[X]= N \times \theta \tag{6.10}

\mathbb{V}ar[X]=N \cdot \theta \cdot (1-\theta) \tag{6.11}

\mathbb{H}(X) = \frac{1}{2}\log_2 \left (2\pi n \theta(1 - \theta)\right) + O(\frac{1}{n}) \tag{6.12}

\mathcal{I}(\theta)=\frac{n}{ \theta \cdot (1- \theta)} \tag{6.13}

6.2.2.1 Relationships

Figure 6.3: binomial distribution relations

The Binomial Distribution is related to:

  • the Geometric distribution,
  • The Multinomial distribution with two categories is the binomial.
  • the Poisson distribution distribution. If X \sim \mathrm{Binomial}(n, p) rv and Y \sim \mathrm{Poisson}(np) distribution then \mathbb{P}r(X = n) \approx \mathbb{P}r(Y = n) for large n and small np.
  • the Bernoulli distribution If X \sim \mathrm{Binomial}(n, p) RV with n = 1, X \sim Bernoulli(p) RV.
  • the Normal distribution If X \sim \mathrm{Binomial}(n, p) RV and Y \sim Normal(\mu=np,\sigma=n\mathbb{P}r(1-p)) then for integers j and k, \mathbb{P}r(j \le X \le k) \approx \mathbb{P}r(j – {1 \over 2} \le Y \le k + {1 \over 2}). The approximation is better when p ≈ 0.5 and when n is large. For more information, see normal approximation to binomial
  • Hypergeometric: The difference between a binomial distribution and a hypergeometric distribution is the difference between sampling with replacement and sampling without replacement. As the population size increases relative to the sample size, the difference becomes negligible. So If X \sim Binomial(n, p) RV and Y \sim HyperGeometric(N,a,b) then

\lim_{n\to \infty} X = Y

import numpy as np
from scipy.stats import binom
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
n, p = 5, 0.4
mean, var, skew, kurt = binom.stats(n, p, moments='mvsk')
print(f'{mean=:1.2f}, {var=:1.2f}, {skew=:1.2f}, {kurt=:1.2f}')
mean=2.00, var=1.20, skew=0.18, kurt=-0.37
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))
ax.plot(x, binom.pmf(x, n, p), 'bo', ms=8, label='binom pmf')
ax.vlines(x, 0, binom.pmf(x, n, p), colors='b', lw=5, alpha=0.5)
rv = binom(n, p)
ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1,
        label='frozen pmf')
ax.legend(loc='best', frameon=False)
plt.show()

## generate random numbers
r = binom.rvs(n, p, size=10)
r
array([0, 1, 2, 2, 4, 2, 1, 2, 2, 1])

6.2.3 The Discrete Uniform Distribution

X \sim U[0,1] \tag{6.14}

f(x)= \begin{cases} 1, & \text{if}\ x \in [0,1] \\ 0, & \text{otherwise} \end{cases} = \mathbb{I}_{\{0 \le x \le 1\}}(x) \tag{6.15}

import numpy as np
from scipy.stats import uniform
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)

n, p = 5, 0.4
mean, var, skew, kurt = uniform.stats(moments='mvsk')
print(f'{mean=:1.2f}, {var=:1.2f}, {skew=:1.2f}, {kurt=:1.2f}')
mean=0.50, var=0.08, skew=0.00, kurt=-1.20
# we use ppf to get the domain from a range of (0.01,0.99)
x = np.linspace(uniform.ppf(0.01), uniform.ppf(0.99), 100)
ax.plot(x, uniform.pdf(x), 'r-', lw=5, alpha=0.6, label='uniform pdf')
rv = uniform()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

## generate random numbers
r = uniform.rvs(size=1000)

# And compare the histogram:
ax.hist(r, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
(array([1.12566475, 0.96012582, 1.02634139, 1.10359289, 0.89391024,
       1.00426953, 1.23602404, 1.02634139, 0.73940724, 1.03737732,
       0.88287431]), array([0.00182179, 0.09243492, 0.18304804, 0.27366116, 0.36427429,
       0.45488741, 0.54550054, 0.63611366, 0.72672679, 0.81733991,
       0.90795304, 0.99856616]), [<matplotlib.patches.Polygon object at 0x74f59798ca00>])
ax.set_xlim([x[0], x[-1]])
(0.01, 0.99)
ax.legend(loc='best', frameon=False)
plt.show()

6.2.4 The Continuous Uniform Distribution

X \sim \mathrm{Uniform}[\theta_1,\theta_2] \tag{6.16}

f(x)= \frac{1}{\theta_2-\theta_1} \mathbb{I}_{\{\theta_1 \le x \le \theta_2\}}(x) \tag{6.17}

6.3 The Normal, Z, t Distributions

The normal, AKA Gaussian distribution is one of the most important distributions in statistics.

It arises as the limiting distribution of sums (and averages) of random variables. This is due to the Section F.1. Because of this property, the normal distribution is often used to model the “errors,” or unexplained variations of individual observations in regression models.

6.3.1 The Standard Normal distribution

The standard normal distribution is given by

\mathcal{Z} \sim \mathcal{N}[1,0] \tag{6.18}

f(z) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}} \tag{6.19}

\mathbb{E}[\mathcal{Z}] = 0 \tag{6.20}

\mathbb{V}ar[\mathcal{Z}]= 1 \tag{6.21}

6.3.2 The Normal distribution

Now consider X = \sigma \mathcal{Z}+\mu where \sigma > 0 and \mu is any real constant. Then \mathbb{E}(X) = \mathbb{E}(\sigma \mathcal{Z}+\mu) = \sigma \mathbb{E}(\mathcal{Z}) + \mu = \sigma \times 0 + \mu = \mu and Var(X) = Var(\sigma^2 + \mu) = \sigma^2 Var(\mathcal{Z}) + 0 = \sigma^2 \cdot 1 = \sigma^2

Then, X follows a normal distribution with mean \mu and variance \sigma^2 (standard deviation \sigma) denoted as

X \sim \mathcal{N}[\mu,\sigma^2] \tag{6.22}

f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{\sqrt{2 \pi \sigma^2}}(x-\mu)^2} \tag{6.23}

\mathbb{E}[x]= \mu \tag{6.24}

\mathbb{V}ar[x]= \sigma^2 \tag{6.25}

  • The normal distribution is symmetric about the mean \mu and is often described as a bell-shaped curve.
  • Although X can take on any real value (positive or negative), more than 99% of the probability mass is concentrated within three standard deviations of the mean.

The normal distribution has several desirable properties.

One is that if X_1 \sim \mathcal{N}(\mu_1, \sigma^2_1) and X_2 \sim \mathcal{N}(\mu_2, \sigma^2_2) are independent, then X_1+X_2 \sim \mathcal{N}(\mu_1+\mu_2, \sigma^2_1+\sigma^2_2).

Consequently, if we take the average of n Independent and Identically Distributed (IID) normal random variables we have

\bar X = \frac{1}{n}\sum_{i=1}^n X_i \sim \mathcal{N}(\mu, \frac{\sigma^2}{n}) \tag{6.26}

import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)

n, p = 5, 0.4
mean, var, skew, kurt = norm.stats(moments='mvsk')
print(f'{mean=:1.2f}, {var=:1.2f}, {skew=:1.2f}, {kurt=:1.2f}')
mean=0.00, var=1.00, skew=0.00, kurt=0.00
x = np.linspace(norm.ppf(0.01),
                norm.ppf(0.99), 100)
ax.plot(x, norm.pdf(x),
       'r-', lw=5, alpha=0.6, label='norm pdf')

rv = norm()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')
r = norm.rvs(size=1000)

ax.hist(r, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
(array([0.00726787, 0.00363393, 0.01816967, 0.03633934, 0.05814294,
       0.14172341, 0.19259849, 0.21440209, 0.3016165 , 0.31978617,
       0.3343219 , 0.40700058, 0.38883091, 0.28708076, 0.30888437,
       0.27617896, 0.11265195, 0.08721441, 0.07631261, 0.0327054 ,
       0.01453573, 0.00726787, 0.00726787]), array([-3.13879616, -2.86361227, -2.58842839, -2.3132445 , -2.03806062,
       -1.76287673, -1.48769285, -1.21250896, -0.93732508, -0.66214119,
       -0.3869573 , -0.11177342,  0.16341047,  0.43859435,  0.71377824,
        0.98896212,  1.26414601,  1.5393299 ,  1.81451378,  2.08969767,
        2.36488155,  2.64006544,  2.91524932,  3.19043321]), [<matplotlib.patches.Polygon object at 0x74f58fc931c0>])
ax.set_xlim([x[0], x[-1]])
(-2.3263478740408408, 2.3263478740408408)
ax.legend(loc='best', frameon=False)
plt.show()

6.3.3 The t-Distribution

If we have normal data, we can use (Equation C.32) to help us estimate the mean \mu. Reversing the transformation from the previous section, we get

\frac {\hat X - \mu}{\sigma / \sqrt(n)} \sim N(0, 1) \tag{6.27}

However, we may not know the value of \sigma. If we estimate it from data, we can replace it with S = \sqrt{\sum_i \frac{(X_i-\hat X)^2}{n-1}}, the sample standard deviation. This causes the expression (Equation C.33) to no longer be distributed as a Standard Normal; but as a standard t-distribution with ν = n − 1 degrees of freedom

X \sim t[\nu] \tag{6.28}

f(t\mid\nu) = \frac{\Gamma(\frac{\nu+1}{2})}{\Gamma(\frac{\nu}{2})\sqrt{\nu\pi}}\left (1 + \frac{t^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\mathbb{I}_{t\in\mathbb{R}} \qquad \text{(PDF)} \tag{6.29}

\text{where }\Gamma(w)=\int_{0}^{\infty}t^{w-1}e^{-t}\mathrm{d}t \text{ is the gamma function}

f(t\mid\nu)={\frac {1}{{\sqrt {\nu }}\,\mathrm {B} ({\frac {1}{2}},{\frac {\nu }{2}})}}\left(1+{\frac {t^{2}}{\nu }}\right)^{-(\nu +1)/2}\mathbb{I}_{t\in\mathbb{R}} \qquad \text{(PDF)} \tag{6.30}

\text{where } B(u,v)=\int_{0}^{1}t^{u-1}(1-t)^{v-1}\mathrm{d}t \text{ is the beta function}

\mathbb{E}[Y] = 0 \qquad \text{ if } \nu > 1 \tag{6.31}

\mathbb{V}ar[Y] = \frac{\nu}{\nu - 2} \qquad \text{ if } \nu > 2 \tag{6.32}

The t distribution is symmetric and resembles the Normal Distribution but with thicker tails. As the degrees of freedom increase, the t distribution looks more and more like the standard normal distribution.

6.4 The Exponential Distribution

The Exponential distribution models the waiting time between events for events with a rate \lambda. Those events, typically, come from a Poisson process.

The Exponential distribution is often used to model the waiting time between random events. Indeed, if the waiting times between successive events are independent then they form an \exp(r(\lambda) distribution. Then for any fixed time window of length t, the number of events occurring in that window will follow a Poisson distribution with mean t\lambda.

X \sim Exp[\lambda] \tag{6.33}

f(x \mid \lambda) = \frac{1}{\lambda} e^{- \frac{x}{\lambda}}(x)\mathbb{I}_{\lambda\in\mathbb{R}^+ } \mathbb{I}_{x\in\mathbb{R}^+_0 } \quad \text{(PDF)} \tag{6.34}

\mathbb{E}(x)= \lambda \tag{6.35}

\mathbb{V}ar[X]= \lambda^2 \tag{6.36}

import numpy as np
from scipy.stats import expon
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)

n, p = 5, 0.4
mean, var, skew, kurt = expon.stats(moments='mvsk')
print(f'{mean=:1.2f}, {var=:1.2f}, {skew=:1.2f}, {kurt=:1.2f}')
mean=1.00, var=1.00, skew=2.00, kurt=6.00
x = np.linspace(expon.ppf(0.01), expon.ppf(0.99), 100)
ax.plot(x, expon.pdf(x), 'r-', lw=5, alpha=0.6, label='expon pdf')

rv = expon()
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

r = expon.rvs(size=1000)

ax.hist(r, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
(array([0.9700008 , 0.79990588, 0.56545071, 0.39075862, 0.33099553,
       0.29421825, 0.23905233, 0.19767789, 0.16549777, 0.11033184,
       0.11492901, 0.08734604, 0.05516592, 0.04137444, 0.05976308,
       0.02758296, 0.03218012, 0.00919432, 0.01838864, 0.01838864,
       0.00919432, 0.00459716, 0.01838864, 0.        , 0.00459716,
       0.        , 0.00919432, 0.00459716, 0.        , 0.00459716,
       0.01379148]), array([1.76061227e-04, 2.17701654e-01, 4.35227247e-01, 6.52752840e-01,
       8.70278433e-01, 1.08780403e+00, 1.30532962e+00, 1.52285521e+00,
       1.74038081e+00, 1.95790640e+00, 2.17543199e+00, 2.39295758e+00,
       2.61048318e+00, 2.82800877e+00, 3.04553436e+00, 3.26305996e+00,
       3.48058555e+00, 3.69811114e+00, 3.91563673e+00, 4.13316233e+00,
       4.35068792e+00, 4.56821351e+00, 4.78573911e+00, 5.00326470e+00,
       5.22079029e+00, 5.43831589e+00, 5.65584148e+00, 5.87336707e+00,
       6.09089266e+00, 6.30841826e+00, 6.52594385e+00, 6.74346944e+00]), [<matplotlib.patches.Polygon object at 0x74f58fb40f10>])
ax.set_xlim([x[0], x[-1]])
(0.010050335853501442, 4.605170185988091)
ax.legend(loc='best', frameon=False)
plt.show()

7 Additional Discrete Distributions

7.1 The Geometric Distribution

The Geometric distribution arises when we want to know “What is the number of Bernoulli trials required to get the first success?”, i.e., the number of Bernoulli events until a success is observed, such as the probability of getting the first head when flipping a coin. It takes values on the positive integers starting with one (since at least one trial is needed to observe a success).

X \sim \mathrm{Geo}(p) \tag{7.1}

\mathbb{P}r(X = x\mid p) = \mathbb{P}r(1-p)^{x-1} \qquad \forall x \in N;\quad 0\le p \le 1 \tag{7.2}

\mathbb{E}[X] = \frac{1}{p} \tag{7.3}

\mathbb{V}ar[X]=\frac{1-p}{p^2} \tag{7.4}

\mathbb{M}_X[t] = \frac{pe^t}{1-(1-p)e^t} \qquad t < -log(1-p) \tag{7.5}

7.2 The Multinomial Distribution

Another generalization of the Bernoulli distribution and the Binomial distribution is the Multinomial distribution, which sums the successes of Bernoulli trials when there are n different possible outcomes. Suppose we have n trials and there are k different possible outcomes that occur with probabilities p_1, \ldots, p_k. For example, we are rolling a six-sided die that might be loaded so that the sides are not equally likely, then n is the total number of rolls, k = 6, p_1 is the probability of rolling a one, and we denote by x_1, \ldots, x_6 a possible outcome for the number of times we observe rolls of each of one through six, where

X \sim \mathrm{Multinomial}(p_1,...p_k)

P (X = x \mid p_1,\ldots,p_k) = \frac{n!}{x_1! \cdot \cdot \cdot x_k! } \prod_i p_i^{x_i}

7.3 The Poisson Distribution

The Poisson distribution arises when modeling count data. The parameter \lambda > 0 is the rate at which we expect to observe the thing we are counting. We write this as X \sim \mathrm{Poisson}(\lambda)

\mathbb{P}r(X = x \mid \lambda) = \frac{\lambda^x e^{−\lambda}}{x!} \qquad \forall x \in \mathbb{N}_0 \qquad \text{PDF} \tag{7.6}

\mathbb{E}[X] = \lambda \qquad \text{Expectation} \tag{7.7}

\mathbb{V}ar[X] = \lambda \qquad \text{Variance} \tag{7.8}

\mathbb{M}_X(t) = \exp[\lambda(e^t-1)] \qquad \text{Moment Generating fn.} \tag{7.9}

\mathcal{I}_X(t) = \frac{1}{\lambda} \tag{7.10}

7.3.1 Relations

Figure 7.1: Relations of the Poisson distribution

A Poisson process is a process wherein events occur on average at rate \mathbb{E}, events occur one at a time, and events occur independently of each other.

Figure 7.2: Siméon Denis Poisson
TipBiographical Note on The Siméon Denis Poisson

The Poisson distribution is due to Baron Siméon Denis Poisson (1781-1840) see (Poisson 2019, 205–7) was a French mathematician and physicist who worked on statistics, complex analysis, partial differential equations, the calculus of variations, analytical mechanics, electricity and magnetism, thermodynamics, elasticity, and fluid mechanics.

for a fuller biography see

7.4 Hypergeometric Distribution

Consider an urn with a white balls and b black balls. Draw N balls from this urn without replacement. The number white balls drawn, n is Hypergeometrically distributed.

X \sim \mathrm{Hypergeometric}(n \mid N,a,b)

\mathrm{Hypergeometric}(n\mid N,a,b) = \frac{\normalsize{\binom{a}{n} \binom{b}{N - n}}} {\normalsize{\binom{a + b}{N}}} \quad \text{(PDF)} \tag{7.11}

\mathbb{E}[X]=N\frac{a}{a+b} \qquad \text{(expectation)} \tag{7.12}

\mathbb{V}ar[X]=N\,\frac{ab}{(a + b)^2}\,\frac{a+b-N}{a+b-1} \qquad \text{(variance)} \tag{7.13}