Probability Theory

Random variable

A random variable is a variable whose outcome depends on a random event. In probability theory a random variable is understood as a measurable function defined on the probability space . A random variable maps from the sample space to any measurable space with some probability.

Probability mass function

The probability mass function, also known as the discrete density function, is a function that gives the exact probability of a discrete random variable to some value. It differs from the probability density function in that it is associated with discrete random variables instead of continuous random variables.

Probability density function

The probability density function (PDF) must be integrated over an interval to yield the probability. It is defined as follows

Pr (a \leq X \leq b) = \int_{a}^{b} f_{X} (x) d x

In the continuous case the probability of a point always gives the probability 0, $Pr (X = x) = 0$ , which is why we need to evaluate it over an interval instead.

Stochastic process

A stochastic process is a random process that is usually defined as a family of random variables. Thus, each random variable takes the value from the same mathematical space known as the state space . There are two types of stochastic processes, that is, discrete-time and continuous-time stochastic processes. Examples of stochastic processes are the Bernoulli process ^[1] and random walk among others. The Bernoulli process can be looked as flipping a coin multiple times where the sequence of flipped coins represents several independent and identically distributed (i.i.d) Bernoulli random variables.

Statistical inference

Statistical inference is the process of inferring properties of an underlying distribution of probability using data analysis. Creating logical claims that is justified by the data.

Classical inference

In classical inference (Frequentist) parameters are fixed or non-random quantities and the probability only concerns the data. For a Frequentist the probability of an event is the proportion of that event in the long run.

Bayesian inference

Bayesian inference is a method used to update the probability of a model using Bayes' theorem

Pr (A ∣ B) = \frac{Pr ( B ∣ A ) Pr ( A )}{Pr ( B )}

Contrary to how classical inference work, Bayesian inference take into account the uncertainty of the parameters when creating the model. The parameters themselves are random variables. The Bayesian approach bases its decision on prior knowledge.

Kolmogorov axioms

The Kolmogorov axioms consist of of three axioms that is the foundation of probability theory.

First axiom

The probability of an event is always positive.

Pr (E) \in R_{+}, \forall E \in F

where $F$ is the event space.

Second axiom

The probability that at least one of outcomes in the sample space will occur has the probability of 1.

Pr (Ω) = 1

where $Ω$ is the sample space.

Third axiom

Any countable sequence of mutually exclusive events $E_{1}, E_{2}, . . .$ satisfies

Pr (i = 1 ⋃ \infty E_{i}) = i = 1 \sum \infty Pr (E_{i}) .

Conditional probability

The conditional probability of event $A$ occurring after event $B$ is defined as

Pr (A ∣ B) = \frac{Pr ( A \cap B )}{Pr ( B )}

Independent events

Two events are independent if

Pr (A \cap B) = Pr (A) Pr (B)

Thus the following holds for conditional independent events

Pr (A ∣ B) = \frac{Pr ( A \cap B )}{Pr ( B )} = \frac{Pr ( A ) Pr ( B )}{Pr ( B )} = Pr (A)

Total law of probability

Given an event $A$ , what is the probability of $A$ given $B$ every single $B$ ? The total law of probability states that if we have a sequence of events $B_{n}$ that partitions the sample space the following holds

P (A) = n \sum P (A \cap B_{n}) = n \sum P (A ∣ B_{n}) P (B_{n})

Joint distributions

Given multiple different random variables $X_{1}, \dots, X_{n}$ defined on the same probability space is a probability distribution that gives the probability that each random variable falls into a particular set of values.

p_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = Pr (X_{1} = x_{1}, \dots, X_{n} = x_{n})

It could be written in terms of conditional probabilities with the chain rule property

p_{X_{1}, \dots, X_{n}} (x_{1}, \dots, x_{n}) = Pr (X_{1} = x_{1}) \cdot Pr (X_{2} = x_{2} ∣ X_{1} = x_{1}) \cdot Pr (X_{3} = x_{3} ∣ X_{1} = x_{1}, X_{2} = x_{2}) \dots \cdot Pr (X_{n} = x_{n} ∣ X_{1} = x_{1}, X_{2} = x_{2}, \dots, X_{n - 1} = x_{n - 1})

Chain rule

The chain rule of probabilities can be described by the following example

P r (X_{4}, X_{3}, X_{2}, X_{1}) = P r (X_{4} ∣ X_{3}, X_{2}, X_{1}) \cdot P r (X_{3}, X_{2}, X_{1}) = P r (X_{4} ∣ X_{3}, X_{2}, X_{1}) \cdot P r (X_{3} ∣ X_{2}, X_{1}) \cdot P r (X_{2}, X_{1}) = P r (X_{4} ∣ X_{3}, X_{2}, X_{1}) \cdot P r (X_{3} ∣ X_{2}, X_{1}) \cdot P r (X_{2} ∣ X_{1}) \cdot P r (X_{1})

Expectation

Expectation is the expected value a distribution takes on, the most common outcome.

Discrete

E [X] = x \sum x Pr (X = x)

Continuous

E [X] = \int_{x} x Pr (X = x) d x

Conditional discrete

E [X ∣ Y = y] = x \sum x Pr (X = x ∣ y)

Conditional continuous

E [X ∣ Y = y] = \int_{x} x Pr (X = x ∣ y) d x

Total law of expectation discrete

E [X] = y \sum E [X ∣ Y = y] Pr (X = x)

Total law of expectation continuous

E [X] = \int_{y} E [X ∣ Y = y] Pr (X = x) d y

In both the discrete and the continuous case they could be written as

E [X] = E [E [X ∣ Y]]

Linearity of expectation

Linearity of expectation is a property that states that the expected value of the sum of random variables is equal to individually sum the expectation of each random variable regardless if they are independent.

E [X_{1} + \dots + X_{n}] = E [X_{1}] + \dots + E [X_{n}]

More generally the following holds

E [i = 1 \sum n ​ c_{i} ​ X_{i} ​] = i = 1 \sum n ​ (c_{i} ​ \cdot E [X_{i} ​])

Variance

Variance is defined as

Var (X) = E [X^{2}] - E [X]^{2}

Total law of variance

Var (X) = E [Var (X ∣ Y)] + Var (E [X ∣ Y])

Covariance

Covariance is defined as

c o v (X, Y) = E [(X - E [X]) (Y - E [Y])] = E [X Y - X E [Y] - E [X] Y + E [X] E [Y]] = E [X Y] - E [X] E [Y] - E [X] E [Y] + E [X] E [Y] = E [X Y] - E [X] E [Y],

However this is susceptible to catastrophic cancellation ^[2] , which means that subtracting good approximations of two nearby numbers may yield a bad approximation to the difference of the original numbers.

Correlation

Correlation is defined as

ρ_{X, Y} = c o r r (X, Y) = \frac{c o v ( X , Y )}{σ _{X} σ _{Y}} = \frac{E [ ( X - μ _{X} ) ( Y - μ _{Y} ) ]}{σ _{X} σ _{Y}}

where $μ_{X} = E [X]$ , $μ_{Y} = E [Y]$ , $σ_{X}$ and $σ_{Y}$ represents the standard deviation.

Order statistics

The kth order statistic of a statistical sample is equal to its kth-smallest value.

Hoeffding inequality

https://en.wikipedia.org/wiki/Hoeffding%27s_inequality

Hoeffding inequality states an upper bound on the probability that the sum

P (S_{n} - E [S_{n}] \geq t) P (∣ S_{n} - E [S_{n}] ∣ \geq t) \leq exp (- \frac{2 t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}) \leq 2 exp (- \frac{2 t ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})

Boole's inequality

https://en.wikipedia.org/wiki/Boole%27s_inequality

Boole's inequality is also known as the union bound. It states that for any finite set of events the probability that at least one of the events happens is no greater than the sum of the probabilities of the individual events.

P (i = 1 ⋃ \infty A_{i}) \leq i = 1 \sum \infty P (A_{i}) .

References

[1]

Bernoulli process. https://en.wikipedia.org/wiki/Bernoulli_process.

[2]

Catastrophic cancellation. https://en.wikipedia.org/wiki/Catastrophic_cancellation.