Bayesian Inference

Notation

Bayes' theorem

Given Bayes formula

π (θ ∣ y) = \frac{π ( y ∣ θ ) π ( θ )}{π ( y )} = \frac{π ( y ∣ θ ) π ( θ )}{\int _{θ} π ( y ∣ θ ) π ( θ ) d θ}

we we define four different names representing each term: prior, posterior, likelihood and marginal likelihood.

Prior

π (θ)

The prior distribution represents our knowledge about our uncertain quantity (parameters) before some evidence is taken into account.

Posterior

π (θ ∣ y)

The posterior distribution represents our knowledge about our uncertain quantity (parameters) after some evidence is taken into account.

Likelihood

π (y ∣ θ)

The likelihood distribution describes how likely the data is given some uncertain quantity (parameter). It is a function of the parameters of the chosen statistical model, given by our prior, that describes the data we are interested in.

Marginal likelihood

π (y)

The marginal likelihood may be referred to as the evidence. We can see that we get this distribution by marginalizing out theta from $π (y, θ)$ — integrating out theta. Thus we can write

π (y) = \int_{θ} π (y, θ) d θ = \int_{θ} π (y ∣ θ) π (θ) d θ

In the case we have updated our prior with our posterior the formula is turned into

π (y) = \int_{θ} π (y ∣ θ) π (θ ∣ y) d θ

where $y$ represents the old data and $y$ the data we want to predict.

The marginal likelihood is generally difficult to compute, except for a small number of distributions that have the relation conjugate prior. When this is not the case, we could use some kind of numerical integration, discretization and Monte Carlo method among others.

Prior predictive

The prior predictive density is the marginal likelihood using the prior

π (y) = \int_{θ} π (y, θ) d θ = \int_{θ} π (y ∣ θ) π (θ) d θ

Posterior predictive

The posterior predictive density is the marginal likelihood using the posterior

π (y ∣ y) = \int_{θ} π (y ∣ θ) π (θ ∣ y) d θ

Both the prior predictive and the posterior predictive has a simple closed form if we have a conjugacy.

Conjugacy

If the posterior and the prior is of the same probability distribution family we say that we have a conjugacy and the prior and posterior distributions are called conjugate distributions. The prior is called a conjugate prior for the likelihood function.

Some of the most common conjugacies:

Beta-Binomial
Exponential-Gamma
Multinomial-Dirichlet
Poisson-Gamma
Normal-Gamma
Normal-Normal

Proportionality

When calculating the posterior we can write

π (θ ∣ y) = \frac{π ( y ∣ θ ) π ( θ )}{π ( y )} \propto_{θ} π (y ∣ θ) π (θ)

where $\propto_{θ}$ means proportional to theta to express that two expressions are identical ignoring any factor not involving theta. This is very useful because as we have concluded, $π (y)$ can be tricky to compute. We could do this trick because the posterior will always integrate to 1, so there would be no loss in information if we multiply or divide by factors that do not depend on $θ$ . These factors could be inserted again at the end of our proportional to calculations to fulfill the requirement that the posterior should integrate to 1.