Notation

    Definition Description
    Problem instance
    The action space.
    All the arms that contribute to the regret.
    Arm/action.
    Total number of rounds.
    Each round.
    The mean reward for the arm .
    The optimal mean reward.
    The average reward
    Describes how bad arm is compared to the best arm. Called gap.
    The regret for an algorithm.
    The confidence radius
    The number of samples from arm up to round
    The confidence radius
    The confidence interval
    The t-history
    The feasible t-history

    Stochastic bandits

    Multi-armed bandits is a framework for algorithms that make decisions under uncertainty over time. At its core an algorithm has different possible actions to choose from called arms in rounds. The algorithm chooses an arm each round and retrieves a reward given the arm. The reward follows a distribution that only depends on the chosen arm. Typically, the algorithm only observes one arm each round and therefore needs to explore different arms to acquire new information. There is a tradeoff between exploration and exploitation. There are three different types of feedback that the algorithm receives after each round based on the reward given a certain arm. These are: bandit feedback, partial feedback, full feedback.

    IID rewards

    A basic model with independently and identically distributed (IID) rewards, called stochastic bandits, is given by the following algorithm:

    Given: K arms, T rounds
        for each round t ∈ T
            pick arm a_t
            observe reward r_t ∈ [0, 1] for a_t

    Here we want to maximize the total reward over rounds. When the algorithm only observes the reward for the currently selected arm in one round we call it bandit feedback. We denote the reward distribution for each arm as . The reward is randomly sampled from this distribution each time arm is picked. The distribution for each arm is unknown for the algorithm. The rewards are bounded in each round to ease the calculations. Especially important is the mean reward vector . We have that and the best mean value is given by .

    Regret

    The regret is a function of and measures the algorithms cumulative reward of always playing the optimal in relation to the cumulative reward of a playing the best arms of a specific problem set up to round . It is denoted as:

    We note that is a stochastic variables as the arm chosen at is randomly sampled. We call it regret as the algorithm "regrets" not knowing the best arm. If we have a regret bound on the form , where does not depend on the mean rewards and the constant does not depend on we call this regret bound instance-dependent if does depend on and instance-independent otherwise.

    Non-adaptive exploration

    There are two different ways we can explore, either based on the history of rewards or in some fixed way. When basing the exploration in some fixed way the exploration phase does not adapt during its execution and is therefore called non-adaptive.

    Uniform exploration

    One way to choose arms is to pick them uniformly regardless of previous results. Then we pick the arms that empirically perform best for exploitation. The algorithm has the following structure:

    Exploration:
        try each arm N times
    Selection:
        pick the arm a* with the highest average reward
    Exploitation:
        play a* for the rest of the rounds

    The parameter is fixed here, but we will see that we can pick a value that is dependent on and to minimize the regret. The average reward should be a good estimate of the true reward, . By utilizing the Hoeffding inequality we can write

    where we define . A clean event is the event where this equation holds for all arms simultaneously. A bad event is the complement of the clean event. If and we have a clean event. Let be the best arm. If the algorithm chooses the other arm it must be because it has better average reward than . We have . We rearrange the equation according to the clean event equation we got from the Hoeffding inequality. Thus, we have:

    This means that we have at most regret each round for the exploitation phase. The exploration have at most 1 regret each round. To derive an upper bound we acknowledge that the first rounds are used for exploration and the remaining rounds are used for exploitation.

    Setting we get the following:

    In the case we have a bad event we have the following:

    When we have , we have the same argument but instead we get . We can set to achieve the same result.

    Epsilon greedy

    One drawback with this algorithm is that it has poor performance during the exploration phase. The -greedy algorithm does not have this issue:

    for each round t ∈ T:
        e_t <- uniform probability
        if e_t <= threshold:
            explore
        else:
            exploit

    With the exploration probability of we get the regret bound for each round . However, both these algorithms do not depend on the history of the observed rewards in the exploration phase. We could do better.

    Adaptive exploration

    We could react directly on the history of rewards to select more suitable candidates for exploration. This is called adaptive exploration. To define this framework of adaptive exploration we let be the number of samples from an arm in the rounds . Here we let be the average reward of the arm this far. Again with the help of the Hoeffding inequality we want to derive

    where we define . Here is called the confidence radius. In the case is fixed we have the same scenario as in the uniform case, but is a random variable so it cannot be fixed. The samples from is not completely independent either, because may depend on previous rewards of . To build a solid argumentation we introduce something called a reward tape, that is, a table where each cell is independently sampled from . The th time an arm is drawn the reward is taken from the th cell in the arm's reward tape. We let be the average reward for arm in the first times that it is drawn. By the Hoeffding inequality we have

    With the help of the Boole's inequality we get

    The following is implied and is the clean event in the following derivations

    The Upper Confidence Bound (UCB) is defined as

    The Lower Confidence Bound (UCB) is defined as

    for an arm at the round . The confidence interval is given by .

    Higher-confidence elimination

    We can now introduce the first algorithm based on this framework. Namely the higher-confidence elimination algorithm. The idea here is that we alternate between arms until we find an arm that is much better than the other.

    while alternating between arm a and a'
        if UCB_t(a) < LCB_t(a') and round t is even
            abandon arm a and use a' forever

    Higher-confidence elimination has a regret of

    Successive elimination

    The higher-confidence elimination algorithm operates on . Successive elimination generalizes to .

    set all arms to active state
        for each phase
            try all active arms (multiple phases)
            deactivate all arms a such that UCB_t(a) < LCB_t(a') for some a'

    Successive elimination has a regret of

    Optimism under uncertainty

    An algorithm that picks the best possible observation this far is called UCB1. The arm chosen in each round is picked either because the average reward for that arm is large or because the confidence interval is large (meaning that the arm has not been explored that much).

    try each arm once
        for each round t ∈ T
            pick argmax_a∈A UCB_t(a)

    where

    Bayesian bandits

    Bayesian bandits have like stochastic bandits, arms and rounds. Additionally, we introduce the Bayesian assumption, that is, the problem instance is drawn from some known distribution . The problem instance is specified by the mean reward vector with the reward distribution when we are fixing and . The known distribution is called the prior distribution or the Bayesian prior. We want to optimize Bayesian regret, that is, the expected regret for a specific problem instance in expectation over all the instances like follows

    The t-history is a random variable that depends on the reward vector . The following denotes the t-history

    The following denotes the feasible t-history if for some bandit algorithm it satisfies

    If it exists we call such an algorithm to be H-consistent.

    Thompson sampling

    for each round t ∈ T
        observe H_{t-1} = H for some feasible (t-1)-history H
        draw arm a_t independently from p_t(·| H)

    where for each arm .

    Thompson sampling independent priors

    When we have independent priors we can simplify the Thompson sampling algorithm to

    for each round t ∈ T
        observe H_{t-1} = H for some feasible (t-1)-history H
        for each arm a, sample mean reward mu_t(a) independently from P_H^a
        choose the arm with the largest mu_t(a)