Wellford's online algorithm
It can be quite useful to be able to calculate the mean and variance in a single pass whenever you have a program that updates a population at discrete time steps. This is where Wellford's online algorithm comes in! The name online is based on the fact that the algorithm processes the input piece-by-piece without knowledge of the complete input stream from the start. Let's derive the formulas. To make an online algorithm viable, a recurrence relation is required between the quantities one is interested in. In this case we need to find a recurrence relation for the mean and variance. E.g., we want to find a relation for the mean and variance such that the next value () for each of these quantities is based on the previous value ()
where and are the relations we want to find. So we start by some denotations. Given a population with samples we denote the following
Naive approach
Let's start by deriving a recurrence formula for before getting into the naivety of the approach. It all comes down to a little bit of algebraic manipulation
So (1) would look something like this in Python
Python
class RunningMean:
def __init__(self):
self.n = 0
self._mean = 0
def update(self, new_sample):
self.n += 1
self._mean += (new_sample - self._mean) / self.n
def mean(self): return self._mean
The naivety of the approach comes from the fact that it suffers from numerical instability or catastrophic cancellation as can be seen when we derive a recurrence formula for
We want to relate with as we have a recurrence relation and want to capture the transition of the mean shift. Note that
Squaring (3) yields
Putting (4) into the summation of (2) from to , utilizing the fact that as well as which follows from (1) we get the following
Putting (5) into (2) we have the following
Let and . From (1) we have the following relation
Use the substitute of and and plug (7) into (6)
So (9) would look something like this in Python
Python
import math
class RunningVar:
def __init__(self):
self.n = 0
self._mean = 0
self._var = 0
def update(self, new_sample):
self.n += 1
delta1 = new_sample - self._mean
self._mean += delta1 / self.n
delta2 = new_sample - self._mean
self._var += (delta1 * delta2 - self._var) / self.n
def mean(self): return self._mean
def var(self): return self._var
def std(self): return math.sqrt(self.var())
Here is the unbiased sample variance for completeness
Avoiding numerical instability
Formula (9) suffers from numerical instability as it repeatedly subtracts a small value from a big value that scales with . This means that subtracting good approximations to two nearby numbers may yield a very bad approximation to the difference of the original numbers. What Wellford came up with instead is quite clever. If we multiply both sides of (8) with we get the following
Now we define . Note that
So from (10) we get the following recurrence relation
To get the biased sample variance, we simply divide with the population size
To get the unbiased sample variance
So (11) would look something like this in Python
Python
import math
class Wellford:
def __init__(self):
self.n = 0
self._mean = 0
self.m2 = 0
def update(self, new_sample):
self.n += 1
delta1 = new_sample - self._mean
self._mean += delta1 / self.n
delta2 = new_sample - self._mean
self.m2 = delta1 * delta2
def mean(self): return self._mean
def var(self): return self.m2 / self.n # population var
def std(self): return math.sqrt(self.var())
Running the algorithms
Running the algorithms we can see that (9) performs very poorly in practice, while Wellford's algorithm is a good estimate of the real variance.
Python
rm = RunningMean()
rv = RunningVar()
w = Wellford()
n = 10
mean_sum = 0
for i in range(n):
rm.update(i)
rv.update(i)
w.update(i)
mean_sum += i
mean = mean_sum / n
var_sum = 0
for i in range(n):
var_sum = (i - mean)**2
var = var_sum / n
print("Real")
print("mean", mean)
print("var", var)
print("std", math.sqrt(var))
print()
print("RunningMean")
print("mean", rm.mean())
print()
print("RunningVar")
print("mean", rv.mean())
print("var", rv.var())
print("std", rv.std())
print()
print("Wellford")
print("mean", w.mean())
print("var", w.var())
print("std", w.std())
Text
Real
mean 4.5
var 2.025
std 1.4230249470757708
RunningMean
mean 4.5
RunningVar
mean 4.5
var 8.25
std 2.8722813232690143
Wellford
mean 4.5
var 2.25
std 1.5