Notation
Bayes' theorem
Given Bayes formula
we we define four different names representing each term: prior, posterior, likelihood and marginal likelihood.
Prior
The prior distribution represents our knowledge about our uncertain quantity (parameters) before some evidence is taken into account.
Posterior
The posterior distribution represents our knowledge about our uncertain quantity (parameters) after some evidence is taken into account.
Likelihood
The likelihood distribution describes how likely the data is given some uncertain quantity (parameter). It is a function of the parameters of the chosen statistical model, given by our prior, that describes the data we are interested in.
Marginal likelihood
The marginal likelihood may be referred to as the evidence. We can see that we get this distribution by marginalizing out theta from — integrating out theta. Thus we can write
In the case we have updated our prior with our posterior the formula is turned into
where represents the old data and the data we want to predict.
The marginal likelihood is generally difficult to compute, except for a small number of distributions that have the relation conjugate prior. When this is not the case, we could use some kind of numerical integration, discretization and Monte Carlo method among others.
Prior predictive
The prior predictive density is the marginal likelihood using the prior
Posterior predictive
The posterior predictive density is the marginal likelihood using the posterior
Both the prior predictive and the posterior predictive has a simple closed form if we have a conjugacy.
Conjugacy
If the posterior and the prior is of the same probability distribution family we say that we have a conjugacy and the prior and posterior distributions are called conjugate distributions. The prior is called a conjugate prior for the likelihood function.
Some of the most common conjugacies:
- Beta-Binomial
- Exponential-Gamma
- Multinomial-Dirichlet
- Poisson-Gamma
- Normal-Gamma
- Normal-Normal
Proportionality
When calculating the posterior we can write
where means proportional to theta to express that two expressions are identical ignoring any factor not involving theta. This is very useful because as we have concluded, can be tricky to compute. We could do this trick because the posterior will always integrate to 1, so there would be no loss in information if we multiply or divide by factors that do not depend on . These factors could be inserted again at the end of our proportional to calculations to fulfill the requirement that the posterior should integrate to 1.