Stochastic Processes

Markov chains

A markov chain is a sequence of random variables, $X_{0}, X_{1}, \dots$ , with the following property

Pr (X_{n + 1} = j ∣ X_{0} = x_{0}, \dots, X_{n} = i) = Pr (X_{n + 1} = j ∣ X_{n} = i)

for all $x_{0}, \dots, i, j$ that is included in the state space of the Markov chain. The state space is a discrete set.

Stochastic matrix

A stochastic matrix is a square matrix, $P$ , that satisfy

$P_{i j} \geq 0$ for all $i, j$
For each row $i, \sum_{j} P_{i j} = 1$

N-step transition matrix

Let $X_{0}, X_{1}, \dots$ be a Markov chain with transition matrix $P$ . Then $P^{n}$ is the n-step transition matrix and we can calculate the probability that we go from state $i$ to state $j$ in n steps

P_{i j}^{n} = Pr (X_{n} = j ∣ X_{0} = i), for all i, j

Distribution of Markov chains

The sequence of random variables $X_{0}, X_{1}, \dots$ is generally not identically distributed random variables in the Markov chain. If our Markov chains has the transition matrix $P$ and the initial distribution $α$ , the distribution for $X_{n}$ is

α P^{n}

Namely,

Pr (X_{n} = j) = (α P^{n})_{j}, for all j and n \geq 0

Markov property

The Markov property states that the past and future is independent given the present. The present here could be looked as the most recent past. Let $X_{0}, X_{1}, \dots$ be a Markov chain. Then for all $m < n$

Pr (X_{n + 1} = j ∣ X_{0} = x_{0}, \dots, X_{n - m} = i) = Pr (X_{n + 1} = j ∣ X_{n - m} = i) = Pr (X_{m + 1} = j ∣ X_{0} = i) = P_{i j}^{m + 1}

for all $x_{0}, \dots, i, j$ and $n \geq 0$

Joint distribution

The marginal distributions of Markov chains are determined by the initial distribution $α$ and the transition matrix $P$ . If we consider the join probability of

Pr (X_{5} = i, X_{6} = j, X_{9} = k) for some states i, j, k

The resulting event is then moving to $i$ in five steps, then to $j$ in one step and then to $k$ in three steps. The resulting probability is calculated with

(α P^{5})_{i} P_{i j} P_{j k}^{3}

This is obtained by combining the Markov property with conditional probability and time-homogeneity. Formally, let $X_{0}, X_{1}, \dots$ be a Markov chain with transition matrix $P$ and initial distribution $α$ . Then for all $0 \leq n_{1} < \dots < n_{k}$ and states $i_{1}, \dots, i_{k}$

Pr (X_{n_{1}} = i_{1}, X_{n_{k}} = i_{k}) = (α P^{n_{1}})_{i_{1}} \dots P_{i_{k - 1} i_{k}}^{n_{k} - n_{k - 1}}

Stationary distribution

A stationary distribution is such a distribution $π$ , that if the distribution over states at step $k$ is $π$ , then also the distribution over states at step $k + 1$ is $π$ . That is,

π = π P

To find a stationary distribution, the above equation is redundant, and we must use the fact that $π_{1}, \dots, π_{n} = 1$ . Then we are able to obtained the unique solution.

Limiting distribution

A limiting distribution is such a distribution $π$ that no matter what the initial distribution is, the distribution over states converges to $π$ as the number of steps goes to infinity:

k \to \infty lim π^{(0)} P^{k} = π

Also when a limiting distribution exists, it is always a stationary distribution. However, the converse is not true, a stationary distribution is not always a limiting distribution. Think of a state that is stationary but it is not certain that the chain will converge to that state given some other initial distribution.

Positive matrix

A matrix $M$ is said to be positive if all the entries of the matrix is positive.

Regular transition matrix

A transition matrix $P$ is said to be regular if some power of $P$ is positive.

Limit theorem for regular Markov chains

If the transition matrix is regular, a limiting distribution exists. It is unique as well. All of the limiting probabilities are positive.

Communication class

If a Markov chain has exactly one communication class, all states communicate with each other. Think of it as if every state can eventually communicate with each other. If we have multiple communication classes one state may not be able to communicate with another state in $n$ steps.

Closed communication class

A communication class is closed if it consists of all recurrent states.

Irreducibility

A Markov chain is called irreducible if it has exactly one communication class. Thus, if the matrix is regular we know it is also irreducible. Finite irreducible Markov chains have unique positive stationary distributions if it is aperiodic as well.

Limit theorem for finite irreducible Markov chains

Let $μ_{j} = E [T_{j} ∣ X_{0} = j]$ be the expected return time to j. Then $μ_{j} < \infty$ and the vector $v$ with $v_{j} = \frac{1}{μ _{j}}$ is a stationary distribution. All finite regular Markov chains are finite irreducible Markov chains. Furthermore,

v_{j} = n \to \infty lim \frac{1}{n} m = 0 \sum n - 1 (P^{m})_{i j}

Recurrent state

A recurrent state has the property that a Markov chain starting at this state eventually returns to that state.

Transient state

A transient state has the property that a Markov chain starting at this state has a positive probability of never returning to this state.

Periodicity

The states of a communication class all have the same period. The period of a state is defined as

d (i) = g cd (n > 0 : P_{i j}^{n} > 0)

Thus, if a Markov chain is irreducible and all states have a period greater to one, the Markov chain is periodic.

Aperiodic

When the period is $d (i) = 1$ the state is said to be aperiodic. Thus, if a Markov chain is irreducible and all states have a period equal to one, the Markov chain is aperiodic.

Ergodic

A Markov chain is said to be ergodic if it is irreducible, aperiodic and all states have finite expected return times (all states are positive recurrent). Ergodic Markov chains have positive limiting distributions. That is, let $X_{0}, X_{1}, \dots$ be an ergodic Markov chain. Then there exists a unique positive stationary distribution which also is the limiting distribution for the Markov chain.

Fundamental limit theorem of ergodic Markov chains

There exists a unique positive stationary distribution that is the limiting distribution of the Markov chain.

Time reversibility

An irreducible Markov chain is said to be time reversible if

π_{i} P_{i j} = π_{j} P_{j i} for all i, j

where $π$ is a stationary distribution and $P$ is the transition matrix. The equation above is called the detailed balance condition.

Absorbing chains

A Markov chain is called an absorbing chain if it has at least one absorbing state, that is, a state that is $P_{i i} = 1$ . When dealing with absorbing Markov chains we usually split the matrix into different partitions and write it like

P = (Q 0 R I)

where $Q$ is a $t \times t$ matrix, $R$ is a $t \times (k - t)$ matrix, $0$ is a $(k - t) \times t$ matrix full of 0s, and $I$ is a $(k - t) \times (k - t)$ identity matrix.

Fundamental matrix

The fundamental matrix of an absorbing Markov chain is

F = k = 0 \sum \infty Q^{k} = (I - Q)^{- 1},

The fundamental matrix describes the expected number of visits from $i$ to $j$ .

Absorption probability

The probability that the Markov chain is absorbed in state $j$ when starting in state $i$ is given by

(F R)_{i j}

Absorption time

The expected number of steps until the Markov chain is absorbed when starting in state $i$ is given by

(F 1)_{i}

First hitting time for irreducible chain

First hitting time for irreducible chain is given by modifying the transition matrix $P$ so that the we are interested in, is an absorbing state.

Continuous Markov chains

Markov Property

A continuous-time stochastic process $(X_{t})_{t \geq 0}$ with discrete state space, $S$ , is a continuous-time Markov chain if

Pr (X_{t + s} = j ∣ X_{s} = i, X_{u} = u, 0 \leq u < s) = Pr (X_{t + s} = j ∣ X_{s} = i)

for all $s, t \geq 0, i, j, x_{u} \in S, 0 \leq u < s$ . If the process does not depend on $s$ it is said to be time-homogeneous.

Pr (X_{t + s} = j ∣ X_{s} = i) = Pr (X_{t} = j ∣ X_{0} = i)

for $s \geq 0$ .

Transition function

The transition probabilities can be arranged in a matrix function for each $t \geq 0$ that is called the transition function

P_{i j} (t) = Pr (X_{t} = j ∣ X_{0} = i)

Champman-Kolmogorov Equations

For a continuous Markov chain $(X_{t})_{t \geq 0}$ with transition $P (t)$ ,

P (s + t) = P (s) P (t)

for $s, t \geq 0$ .

Holding times

The holding time, $T_{i}$ at a state $i$ is the length of time that a continuous-time Markov chain stays in $i$ before transitioning to a new state. $T_{i}$ has an exponential distribution.

Absorbing state

For each state $i$ , let $q_{i}$ be the parameter of the exponential distribution for the holding time $T_{i}$ . If $q_{i}$ is defined to be in the interval $(0, \infty)$ , a continuous-time process with $q_{i} = 0$ , $i$ is said to be an absorbing state. This is because when the process visits state $i$ it never leaves.

Explosive

For each state $i$ , let $q_{i}$ be the parameter of the exponential distribution for the holding time $T_{i}$ . If $q_{i}$ is defined to be in the interval $(0, \infty)$ , a continuous-time process with $q_{i} = \infty$ is said to be an explosive. This is because the process may visit state $i$ infinitely many times at one time instant.

Embedded chain

The embedded chain in a continuous-time Markov chain is the discrete-time Markov chain with the transition probabilities for each state. The transition matrix $P$ for the embedded chain is a stochastic matrix with diagonal entries 0.

Transition rates

The $q_{i j}$ is called the transition rates for a continuous-time process. With the transition rates we may obtain the embedded chain transition probabilities and the holding time parameters.

Absorbing chain

In a continuous-time Markov chain we write the $Q$ matrix in the following form

Q = (0 * 0 V)

where $V$ is a $(k - 1) \times (k - 1)$ matrix.

Fundamental matrix

The fundamental matrix for continuous-time Markov chain is defined as

F = - V^{- 1}

Mean time until absorption

The mean time until absorption for a chain that started in $i$ is

a_{i} = (F 1)_{i} = j \sum F_{i j}

Stationary distribution with generator matrix

A continuous-time Markov chain has a stationary distribution $π$ if and only if

π Q = 0

To compute this we need to use fact that $\sum_{i} π_{i} = 1$ . One of the equations in $π Q = 0$ is therefore redundant and we may remove whichever.

Global balance

If $π$ is a stationary distribution of a continuous-time Markov chain. From $π Q = 0$ we get

i \neq = j \sum π_{i} q_{i j} = π_{j} q_{j} for all j

This is called the global balance equations. They say that the transition rates in and out from any state are the same when stationary.

Time reversibility

A continuous-time Markov chain with generator matrix $Q$ and a unique stationary distribution $π$ is time reversible if

π_{i} q_{i j} = π_{j} q_{j i} for all i, j

This is called the local balance or detailed balance equations, and the states that the long-term transition rate from $i$ to $j$ is equal to the long-term transition rate from $j$ to $i$ .

Little's formula

In a queueing system we can describe the long-term properties by the following formula

L = λ W

where $L$ is the long-term average number of customers in the system, $λ$ is the rate of arrivals, $W$ is the long-term average time that a customer spends in the system.

Branching process

In a branching process all nonzero states are transient.

Mean generation size

In a branching process the size of the nth generation is the sum of the individuals in the previous generation.

Z_{n} = i = 1 \sum Z_{n - 1} X_{i}

The long-term generation size could be divided into three cases

n \to \infty lim E (Z_{n}) = n \to \infty lim μ^{n} = ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ 0, 1, \infty, if μ < 1 if μ = 1 if μ > 1

Variance of the generation size

By the total law of variance the following holds

Var (Z_{n}) = E [Var (Z_{n} ∣ Z_{n - 1})] + Var (E [Z_{n} ∣ Z_{n - 1}])

Probability generating function

In the discrete case, the probability generating function of the discrete random variable $X$ is

G (s) = E [s^{X}] = k = 0 \sum \infty s^{k} Pr (X = k)

We can see that $G (1) = 1$ . If we do successive differentiations we obtain

G (0) G^{'} (0) G^{''} (0) G^{(j)} (0) = Pr (X = 0) = Pr (X = 1) = 2 Pr (X = 2) = j! Pr (X = j)

This is useful for computing the probability given the generating function

Pr (X = j) = \frac{G ^{(j)} ( 0 )}{j !}

So if we know the generating function of a distribution we can use this to find out what distribution we have.

Sums of independent random variables

If we let $Z = X_{1} + \dots + X_{n}$ , the probability generating function of $Z$ is

G_{Z} (s) = E [s^{Z}] = E [s^{X_{1} + \dots + X_{n}}] = E (k = 1 \prod n s^{X_{k}}) = k = 1 \prod n E (s^{X_{k}}) = G_{X_{1}} (s) \dots G_{X_{n}} (s)

If $X_{k}$ also is identically distributed we can simplify

G_{Z} (s) = ∣ G_{X} (s) ∣^{n}

Moments

We may find the mean and the variance with the probability generating function

G^{'} (1) G^{''} (1) = E [X] = E [X^{2}] - E [X]

which gives

E [X] Var (X) = G^{'} (1) = G^{''} (1) + G^{'} (1) - G^{'} (1)^{2}

Extinction forever

We can find the probability that a branching process may go extinct in the case $μ > 1$ , (when $μ \leq 1$ the probability of going extinct is 1) In general we can write the generating function for the nth generation as

G_{n} (s) = G (G_{n - 1} (s))

However it is not as useful in practical terms, only for proving that the probability of eventual extinction is the smallest root of the equation

s = G (s)

Markov chains Monte Carlo

Instead of looking at a Markov chain and learn what happens when then number of steps approaches infinity, the limiting distribution, we know start with a target distribution, the limiting distribution and then derive the Markov chain from that. If we can get enough samples of the Markov chain we know that we have an approximate sample from our target distribution. Computing the marginal likelihood function is challenging in many different models.

The law of large numbers

The law of large numbers is central in probability theory. It states that $X_{1}, \dots, X_{n},$ is a sequence of independent and identically distributed random variables with a common mean of $μ < \infty$ , then the following holds with probability 1

n \to \infty lim \frac{X _{1} + \dots + X _{n}}{n} = μ

Also, if $X$ is a random variable with the same distribution as the sequence and $r$ is a bounded, real-valued function, then the sequence $r (X_{1}), \dots, r (X_{n})$ is also an independent and identically distributed sequence with finite mean and a probability of 1 that the following holds

n \to \infty lim \frac{r ( X _{1} ) + \dots + r ( X _{n} )}{n} = E (r (X))

Strong law of large numbers

Let $X_{1}, \dots, X_{n}$ be an ergodic Markov chain with stationary distribution $π$ . Let $X$ be a random variable with distribution $π$ . Let $r$ be a bounded, real-valued function. Then

n \to \infty lim \frac{r ( X _{1} ) + \dots + r ( X _{n} )}{n} = E (r (X))

where $E (r (X)) = \sum_{j} r (j) π_{j}$ . When using this in practice, we may ignore the first $m$ elements in the sequence before computing the average to improve accuracy. This technique is called burn-in.

Metropolis-Hastings algorithm

MMetropolis-Hastings algorithm is one of the most common methods when using Markov chain Monte Carlo. It is a method for obtaining a sequence of random samples from a probability distribution where direct sampling is difficult ^[1] . The sequence is used to approximate the distribution. Metropolis-Hasting works quite well in with multidimensional data and there are other methods which is better when working with single-dimensional distributions. The algorithm constructs a reversible Markov chain whose distribution is $π$ , where $π = (π_{1}, π_{1}, \dots)$ is a discrete probability distribution. Thus, the goal of the algorithm is to construct the Markov chain $X_{0}, X_{1}, \dots$ , with stationary $π$ by simulating $π$ .

Poisson process

A Poisson process is a special type of counting process. Events arrive at specific time instants, starting at $t = 0$ . The we count the number of arrivals that has occurred by the time $t$ . With Poisson processes we may focus on (i) the number of events that occurred between a fixed time interval, (ii) when events occurred, (iii) the behavior of individual events.

Counting process

A counting process $(N_{t})_{t \geq 0}$ a collection of positive integer valued random variables such that $0 \leq s \leq t ⟹ N_{s} \leq N_{t}$ . Contrary to Markov chains, that operate with a sequence of random variables, a counting process is an uncountable collection indexed over a continuous time interval.

Definition

A Poisson process is a counting process with the following definition: Let $λ$ be the parameter of a Poisson process that is a counting process $(N_{t})_{t \geq 0}$ with the following properties

$N_{0} = 0$
$N_{t}$ has a Poisson distribution with parameter $λ t$ for all $t > 0$
$N_{t + s} - N_{s}$ has the same distribution as $N_{t}$ for $s, t > 0$ .
$N_{t} - N_{s}$ and $N_{r} - N_{q}$ are independent random variables for $0 \leq q < r \leq s < t$ .

Stationary increments

Stationary increments is the third rule in the definition above. The distribution of the number of arrivals in an interval only depends on the length of the interval.

Independent increments

Independent increments is the fourth rule in the definition above. The number of arrivals on disjoint intervals are independent random variables.

First arrival times

If we let $X$ denote the first arrival time, then $X > t$ if and only if there are no arrivals in the interval $[0, t]$ . We have

Pr (X > t) = Pr (N_{t} = 0) = e^{- λ t}, t > 0

We can see that $X$ has an exponential distribution with parameters $λ$ .

Nth arrival times

Let $S_{n}$ be the time of the nth arrival in a Poisson process with parameter $λ$ , then $S_{n}$ has a gamma distribution with parameters $n$ and $λ$ according to

f_{S_{n}} = \frac{λ ^{n} t ^{n - 1} e ^{- λ t}}{( n - 1 ) !}

$E [S_{n}] = \frac{n}{λ}$
$Var (S_{n}) = \frac{n}{λ ^{2}}$

Distribution of arrival times

Let $S_{1}, S_{2}, \dots$ be the arrival times of a Poisson process with parameter $λ$ . The joint distribution of $(S_{1}, \dots, S_{n})$ , conditional on $N_{t} = n$ , is the distribution of the order statistics of $n$ independent and identically distributed random variables on $[0, t]$ . We have

f (s_{1}, \dots, s_{n}) = \frac{n !}{t ^{n}}

If we have $n$ uniformly distributed random variables that are independent and identically distributed on $[0, t]$ , conditional on $N_{t} = n$ , they have the same distribution as $(S_{1}, \dots, S_{n})$ .

Memorylessness

Memorylessness means that the waiting time distributions are the same for all observers, and all observers will wait, on average, the same amount of time. Formally, a random variable $X$ is memoryless if

Pr (X > s + t ∣ X > s) = Pr (X > t)

for all $s, t > 0$ .

Thinning

A thinned Poisson process is a kind of a subprocess to another Poisson process that is independent to another thinned process of the same parent process.

Superposition process

If we have $(N_{t}^{(1)})_{t \geq 0}, \dots, (N_{t}^{(n)})_{t \geq 0}$ independent Poisson processes with respective parameters $λ_{1}, \dots, λ_{n}$ , then let $N_{t} = N_{t}^{(1)} + \dots + N_{t}^{(n)}$ for $t \geq 0$ . $(N_{t})_{t \geq 0}$ is then a Poisson process with parameters $λ = λ_{1} + \dots + λ_{n}$ .

Spatial Poisson process

A spatial Poisson process is a collection of random variables $(N_{A})_{A \subseteq R^{d}}$ with parameter $λ$ if

$N_{A}$ has a Poisson distribution with parameter $λ ∣ A ∣$ for each bounded set $A \subseteq R^{d}$ .
$N_{A}$ and $N_{B}$ are independent random variables if $A$ and $B$ are disjoint sets.

Brownian motion

Brownian motion is a continuous stochastic process $(B_{t})_{t \geq 0}$ that has the following properties

$B_{0} = 0$
$B_{t} \sim Normal (0, t)$ , for $t > 0$
$B_{t + s} - B_{s} \sim Normal (0, t)$ , for $s, t > 0$
$B_{t} - B_{s}$ is independent from $B_{r} - B_{q}$ , for $0 \leq q < r \leq s < t$
The function $t \mapsto B_{t}$ is continuous with probability 1

Martingale

A stochastic process $(Y_{t})_{t \geq 0}$ is a martingale if for all $t \geq 0$

$E [Y_{t} ∣ Y_{t}, 0 \geq r \geq s] = Y_{s}$ for all $0 \geq s \geq t$
$E [∣ Y_{t} ∣] < \infty$

Undirected weighted graphs

Limiting distribution

The limiting distribution for undirected weighted graphs is given by the balance functions. In the following example it is

(\frac{w _{1} + w _{2} + w _{4}}{W}, \frac{w _{1} + w _{3} + w _{5}}{W}, \frac{w _{2} + w _{3} + w _{6}}{W})

where $W = w_{1} + w_{2} + w_{4} + w_{1} + w_{3} + w_{5} + w_{2} + w_{3} + w_{6}$ . Thus, the sum of all the edges from each node divided by the sum of the total number of edges.

References

[1]

Metropolis–Hastings algorithm. https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm.