Bandits

Notation

Definition	Description
$I$	Problem instance
$A$	The action space.
$A^{+} = {a : μ (a) < μ (a^{*})}$	All the arms that contribute to the regret.
$a$	Arm/action.
$T$	Total number of rounds.
$t$	Each round.
$μ (a) = E [D_{a}]$	The mean reward for the arm $a$ .
$μ^{*} = max_{a \in A} μ (a)$	The optimal mean reward.
$\overset{μ}{ˉ}$	The average reward
$Δ (a) = μ^{*} - μ (a)$	Describes how bad arm $a$ is compared to the best arm. Called gap.
$R (T) = μ^{*} \cdot T - \sum_{t = 1}^{T} μ (a_{t})$	The regret for an algorithm.
$r (a) = \frac{2 l o g T}{N}$	The confidence radius
$n_{t} (a)$	The number of samples from arm $a$ up to round $t$
$r_{t} (a) = \frac{2 l o g T}{n _{t} ( a )}$	The confidence radius
$[μ_{n} - r_{n}, μ_{n} + r_{n}]$	The confidence interval
$H_{t} = ((a_{1}, r_{1}), \dots, (a_{t}, r_{t})) \in (A \times R)^{t}$	The t-history
$H = ((a_{1}^{'}, r_{1}^{'}), \dots, (a_{t}^{'}, r_{t}^{'})) \in (A \times R)^{t}$	The feasible t-history

Stochastic bandits

Multi-armed bandits is a framework for algorithms that make decisions under uncertainty over time. At its core an algorithm has $K$ different possible actions to choose from called arms in $T$ rounds. The algorithm chooses an arm each round and retrieves a reward given the arm. The reward follows a distribution that only depends on the chosen arm. Typically, the algorithm only observes one arm each round and therefore needs to explore different arms to acquire new information. There is a tradeoff between exploration and exploitation. There are three different types of feedback that the algorithm receives after each round based on the reward given a certain arm. These are: bandit feedback, partial feedback, full feedback.

IID rewards

A basic model with independently and identically distributed (IID) rewards, called stochastic bandits, is given by the following algorithm:

Given: K arms, T rounds
    for each round t ∈ T
        pick arm a_t
        observe reward r_t ∈ [0, 1] for a_t

Here we want to maximize the total reward over $T$ rounds. When the algorithm only observes the reward for the currently selected arm in one round we call it bandit feedback. We denote the reward distribution for each arm $a \in A$ as $D_{a}$ . The reward is randomly sampled from this distribution each time arm $a$ is picked. The distribution for each arm is unknown for the algorithm. The rewards are bounded in each round $r_{t} \in [0, 1]$ to ease the calculations. Especially important is the mean reward vector $μ \in [0, 1]^{K}$ . We have that $μ (a) = E [D_{a}]$ and the best mean value is given by $μ^{*} = max_{a \in A} μ (a)$ .

Regret

The regret is a function of $T$ and measures the algorithms cumulative reward of always playing the optimal in relation to the cumulative reward of a playing the best arms of a specific problem set up to round $T$ . It is denoted as:

R (T) = μ^{*} \cdot T - t = 1 \sum T μ (a_{t})

We note that $R (T)$ is a stochastic variables as the arm $a_{t}$ chosen at $t$ is randomly sampled. We call it regret as the algorithm "regrets" not knowing the best arm. If we have a regret bound on the form $C \cdot f (T)$ , where $f (\cdot)$ does not depend on the mean rewards $μ$ and the constant $C$ does not depend on $T$ we call this regret bound instance-dependent if $C$ does depend on $μ$ and instance-independent otherwise.

Non-adaptive exploration

There are two different ways we can explore, either based on the history of rewards or in some fixed way. When basing the exploration in some fixed way the exploration phase does not adapt during its execution and is therefore called non-adaptive.

Uniform exploration

One way to choose arms is to pick them uniformly regardless of previous results. Then we pick the arms that empirically perform best for exploitation. The algorithm has the following structure:

Exploration:
    try each arm N times
Selection:
    pick the arm a* with the highest average reward
Exploitation:
    play a* for the rest of the rounds

The parameter $N$ is fixed here, but we will see that we can pick a value that is dependent on $T$ and $K$ to minimize the regret. The average reward should be a good estimate of the true reward, $∣ \overset{μ}{ˉ} (a) - μ (a) ∣$ . By utilizing the Hoeffding inequality we can write

Pr {∣ \overset{μ}{ˉ} (a) - μ (a) ∣ \leq r (a)} \geq 1 - \frac{2}{T ^{4}}

where we define $r (a) = \frac{2 l o g T}{N}$ . A clean event is the event where this equation holds for all arms simultaneously. A bad event is the complement of the clean event. If $K = 2$ and we have a clean event. Let $a^{*}$ be the best arm. If the algorithm chooses the other arm $a$ it must be because it has better average reward than $a^{*}$ . We have $\overset{μ}{ˉ} (a) > \overset{μ}{ˉ} (a^{*})$ . We rearrange the equation according to the clean event equation we got from the Hoeffding inequality. Thus, we have:

μ (a^{*}) - μ (a) \leq r (a) + r (a^{*}) = O (\frac{lo g T}{N})

This means that we have at most $O (\frac{l o g T}{N})$ regret each round for the exploitation phase. The exploration have at most 1 regret each round. To derive an upper bound we acknowledge that the first $N$ rounds are used for exploration and the remaining $T - 2 N$ rounds are used for exploitation.

R (T) \leq exploration N + exploitation O (\frac{lo g T}{N} \times (T - 2 N))

Setting $N = T^{2 / 3} (lo g T)^{1 / 3}$ we get the following:

R (T) \leq O (T^{2 / 3} (lo g T)^{1 / 3})

In the case we have a bad event we have the following:

E [R (T)] = E [R (T) ∣ clean event] \times Pr [clean event] + E [R (T) ∣ bad event] \times Pr [bad event] \leq E [R (T) ∣ clean event] + T \times O (T^{- 4}) \leq O (lo g T \times T^{2 / 3})

When we have $K > 2$ , we have the same argument but instead we get $R (T) \leq N K + O (\frac{l o g T}{N} \times T)$ . We can set $N = (T / K)^{2 / 3} (lo g T)^{1 / 3}$ to achieve the same result.

Epsilon greedy

One drawback with this algorithm is that it has poor performance during the exploration phase. The $ϵ$ -greedy algorithm does not have this issue:

for each round t ∈ T:
    e_t <- uniform probability
    if e_t <= threshold:
        explore
    else:
        exploit

With the exploration probability of $ϵ_{t} = t^{- 1 / 3} \cdot (K lo g t)^{1 / 3}$ we get the regret bound $E [R (t)] \leq t^{2 / 3} \cdot O (K lo g t)^{1 / 3}$ for each round $t$ . However, both these algorithms do not depend on the history of the observed rewards in the exploration phase. We could do better.

Adaptive exploration

We could react directly on the history of rewards to select more suitable candidates for exploration. This is called adaptive exploration. To define this framework of adaptive exploration we let $n_{t} (a)$ be the number of samples from an arm $a$ in the rounds $1, \dots, t$ . Here we let $\overset{μ_{t}}{ˉ}$ be the average reward of the arm $a$ this far. Again with the help of the Hoeffding inequality we want to derive

Pr {∣ \overset{μ}{ˉ}_{t} (a) - μ (a) ∣ \leq r_{t} (a)} \geq 1 - \frac{2}{T ^{4}}

where we define $r_{t} (a) = \frac{2 l o g T}{n _{t} ( a )}$ . Here $r_{t} (a)$ is called the confidence radius. In the case $n_{t} (a)$ is fixed we have the same scenario as in the uniform case, but $n_{t} (a)$ is a random variable so it cannot be fixed. The samples from $a$ is not completely independent either, because $n_{t}$ may depend on previous rewards of $a$ . To build a solid argumentation we introduce something called a reward tape, that is, a $1 \times T$ table where each cell is independently sampled from $D_{a}$ . The $j$ th time an arm $a$ is drawn the reward is taken from the $j$ th cell in the arm's reward tape. We let $\overset{v}{ˉ}_{j} (a)$ be the average reward for arm $a$ in the first $j$ times that it is drawn. By the Hoeffding inequality we have

\forall j Pr {∣ \overset{v}{ˉ}_{j} (a) - μ (a) ∣ \leq r_{t} (a)} \geq 1 - \frac{2}{T ^{4}}

With the help of the Boole's inequality we get

Pr {\forall a \forall j ∣ \overset{v}{ˉ}_{j} (a) - μ (a) ∣ \leq r_{t} (a)} \geq 1 - \frac{2}{T ^{2}}

The following is implied and is the clean event in the following derivations

E Pr [E] : = {\forall a \forall t ∣ \overset{v}{ˉ}_{j} (a) - μ (a) ∣ \leq r_{t} (a)} \geq 1 - \frac{2}{T ^{2}}

The Upper Confidence Bound (UCB) is defined as

UCB_{t} = \overset{μ}{ˉ}_{t} (a) + r_{t} (a)

The Lower Confidence Bound (UCB) is defined as

LCB_{t} = \overset{μ}{ˉ}_{t} (a) + r_{t} (a)

for an arm $a$ at the round $t$ . The confidence interval is given by $[LCB_{t}, UCB_{t} (a)]$ .

Higher-confidence elimination

We can now introduce the first algorithm based on this framework. Namely the higher-confidence elimination algorithm. The idea here is that we alternate between arms until we find an arm that is much better than the other.

while alternating between arm a and a'
    if UCB_t(a) < LCB_t(a') and round t is even
        abandon arm a and use a' forever

Higher-confidence elimination has a regret of

E [R (t)] = O (t lo g T), \forall t \leq T

Successive elimination

The higher-confidence elimination algorithm operates on $K = 2$ . Successive elimination generalizes to $K > 2$ .

set all arms to active state
    for each phase
        try all active arms (multiple phases)
        deactivate all arms a such that UCB_t(a) < LCB_t(a') for some a'

Successive elimination has a regret of

E [R (t)] = O (K t lo g T), \forall t \leq T

Optimism under uncertainty

An algorithm that picks the best possible observation this far is called UCB1. The arm chosen in each round is picked either because the average reward $\overset{μ}{ˉ}_{t} (a)$ for that arm is large or because the confidence interval $r_{t} (a)$ is large (meaning that the arm has not been explored that much).

try each arm once
    for each round t ∈ T
        pick argmax_a∈A UCB_t(a)

where $UCB_{t} = \overset{μ}{ˉ}_{t} (a) + r_{t} (a)$

Bayesian bandits

Bayesian bandits have like stochastic bandits, $K$ arms and $T$ rounds. Additionally, we introduce the Bayesian assumption, that is, the problem instance $I$ is drawn from some known distribution $P$ . The problem instance is specified by the mean reward vector $μ$ with the reward distribution $D_{a}$ when we are fixing $K$ and $T$ . The known distribution $P$ is called the prior distribution or the Bayesian prior. We want to optimize Bayesian regret, that is, the expected regret for a specific problem instance $I$ in expectation over all the instances like follows

BR (T) : = I \sim P E [E [R (T) ∣ I]] = I \sim P E [μ^{*} \cdot T - t \in [T] \sum μ (a_{t})]

The t-history is a random variable that depends on the reward vector $μ$ . The following denotes the t-history

H_{t} = ((a_{1}, r_{1}), \dots, (a_{t}, r_{t})) \in (A \times R)^{t}

The following denotes the feasible t-history if for some bandit algorithm it satisfies $Pr [H_{t} = H] > 0$

H = ((a_{1}^{'}, r_{1}^{'}), \dots, (a_{t}^{'}, r_{t}^{'})) \in (A \times R)^{t}

If it exists we call such an algorithm to be H-consistent.

Thompson sampling

for each round t ∈ T
    observe H_{t-1} = H for some feasible (t-1)-history H
    draw arm a_t independently from p_t(·| H)

where $p_{t} (a ∣ H) : = Pr [a^{*} = a ∣ H_{t - 1} = H]$ for each arm $a$ .

Thompson sampling independent priors

When we have independent priors we can simplify the Thompson sampling algorithm to

for each round t ∈ T
    observe H_{t-1} = H for some feasible (t-1)-history H
    for each arm a, sample mean reward mu_t(a) independently from P_H^a
    choose the arm with the largest mu_t(a)