# reward .pdf

À propos / Télécharger Aperçu

**reward.pdf**

Ce document au format PDF 1.5 a été généré par LaTeX with hyperref package / pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 24/03/2014 à 15:03, depuis l'adresse IP 138.231.x.x.
La présente page de téléchargement du fichier a été vue 723 fois.

Taille du document: 6.8 Mo (21 pages).

Confidentialité: fichier public

### Aperçu du document

Primates prediction of reward for adversarial

multi-armed bandit problem

Alexis JACQ

March 22, 2014

Contents

1 Introduction

2

2 primate prediction of reward

2

3 The Exp.3 algorithm facing a switching strategy

4

4 Network view of the Exp.3

7

5 Exp.3 network with reward prediction

5.1 Empirical learning of the reward . . . .

5.2 Integrate-and-fire dopamine effect . . . .

5.3 First algorithm . . . . . . . . . . . . . .

5.4 Evolution of dopamine level . . . . . . .

5.5 It brings about a more reactive learning

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

8

. 8

. 8

. 9

. 10

. 11

6 Improvements

11

6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

7 Long game

14

8 Learning sequences of actions

15

8.1 Network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

9 Conclusion

21

1

1

Introduction

We are introducing an algorithm that solves adversarial multi-armed bandit (AMAB) problems

with a reactive behavior facing an adversary with moving strategies. It is inspired from the Exp.3

algorithm. As primates, this algorithm uses a prediction of rewards [1], but here it is used to

regulates the reinforcement impacts. Then we are going to show different applications where this

algorithm seams to be better than classical Exp.3 (especially for learning sequences of actions).

2

primate prediction of reward

Dopamine seams to be involved in “good” rewarding [2]. Then, conditioning experiments with

primate have shown that [1]:

(1) primate gets a dopamine peak when receive a good reward (figure 1, top).

(2) when primate learns that a conditioning signal (CS) predicts a good reward, it gets a dopamine

peak just after the CS, and the reward does not cause dopamine peak anymore (figure 1,

middle).

(3) if we don’t reward a primate that learned the CS it gets a “negative” peak of dopamine (figure

1, bottom).

In the learning phase of this experiments, the primate has to learn by association a behavior after

the CS to get the reward. This gives the context of our learning algorithm : there is a fixed number

of different CS, and for each CS we want to learn the best action to maximize the following reward.

But the supervisor can suddenly decides to change the rules of the game : he can move the action

that leads to the best reward. Then, the learning agent has to forget its strategy for a new one. In

such a way, the supervisor can be seen as an adversarial player that leads the game.

2

Figure 1: Changes in dopamine neurons’ output code for an error in the prediction of

appetitive events. (Top) Before learning, a

drop of appetitive fruit juice occurs in the absence of prediction— hence a positive error

in the prediction of reward. The dopamine

neuron is activated by this unpredicted occurrence of juice. (Middle) After learning, the conditioned stimulus predicts reward,

and the reward occurs according to the prediction— hence no error in the prediction

of reward. The dopamine neuron is activated by the reward-predicting stimulus but

fails to be activated by the predicted reward

(right). (Bottom) After learning, the conditioned stimulus predicts a reward, but the

reward fails to occur because of a mistake in

the behavioral response of the primate. The

activity of the dopamine neuron is depressed

exactly at the time when the reward would

have occurred. The depression occurs more

than 1 s after the conditioned stimulus without any intervening stimuli, revealing an internal representation of the time of the predicted

reward. Neuronal activity is aligned on the

electronic pulse that drives the solenoid valve

delivering the reward liquid (top) or the onset of the conditioned visual stimulus (middle

and bottom). Each panel shows the peri-event

time histogram and raster of impulses from

the same neuron. Horizontal distances of dots

correspond to real-time intervals. Each line

of dots shows one trial. Original sequence of

trials is plotted from top to bottom.

3

3

The Exp.3 algorithm facing a switching strategy

The Exp.3 algorithm (see algo. 1) performs a stochastic exploration of different possible actions. By this way, an adversarial player cannot deterministically predicts an Exp.3 behavior. It is

one of the best algorithm when each pair (action, adversarial action) send a deterministic reward.

The rapidity to learn the best action is optimal. In [3] they show by a matching lower bound that

O(T −1/2 ) is the best possible to learn an action in time T (it is reached by best variations of Exp.3).

Algorithm 1 Exp.3 during learning-period T

parameters : β, η ∈ (0,1]

initialization :

Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

for t = 1 : T do

pull action at following the distribution P[at = aj ] = (1 − β)

Waj

K

P

Wai

+

β

;

K

i=1

receive reward rt = r(at , t)

update the weight Wat = Wat exp

η rt

K.P[at = aj ]

end for

Now, if an agent is given a reward r for action a and −r for action b and then the rules switch (−r

for a and r for b) the Exp.3 will forget the action a with almost the same time it used to learn it

(as far as β is small). If β is large, the agent does more explorations and quickly detects a new

good action, but it doesn’t perform optimal exploitation of the best actions (figure 2).

Then, if an action a give the best reward r during a long period, the weight Wa becomes very

close to 1, and the reward impact η.rt /(K.P[at = aj ]) becomes very small. Thus, if the action

reward turns to be −r, the agent needs the same long period to forget a (figure 3).

4

1

1.2

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

0

200

400

600

800

1000

1200

0

200

400

600

800

1000

1200

Figure 2: Frequency of using action a (red) and action b (green) during a game. Before the switching

time ts = 400, action a sends reward 1 and b −1. After ts a sends reward −1 and b 1. There is

a choice of 10 action (including a and b). The frequency is computed using a sliding window that

smooths the action occurrences (0 or 1). This is a mean over 100 runs of the same game. Left :

η = 0.3 and β = 0.01. Right : η = 0.3 and β = 0.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

500

1000

1500

2000

2500

3000

Figure 3: Frequency of using action a (red) and action b (green) during a game with long period

before the switching time ts = 1000. There is a choice of 10 action (including a and b). We use

parameters η = 0.1 and β = 0.01. We can see that after ts it needs a time period = 1000 to play

action b more often than a and a period 2000 to definitely forget a.

5

We would like to build an algorithm that learn the rewards given by each action to regulate the

reward impact. As the primate, it would predict the reward and use a null reward impact to stop

the exponential increasing of Wa . And to instantly forget an action it would get a negative reward

in front of a frustration : if an agent is given a reward R = r for action a and R = 0 for action b

and then the rules switch (R = 0 for a and R = r for b) we expect that the agent instantly forget

a by frustration (it use a negative impact even without negative punishment R = −r).

6

4

Network view of the Exp.3

We can see the Exp.3 algorithm like a network when a node representing the CS is connected to

each nodes representing actions (see figure 4). The edges are oriented (CS→actions) and weighted

according to the Exp.3 weights. When the CS is spiked, it makes one action node spikes chosen

following the Exp.3 distribution :

P[aj spikes | CS] = (1 − β)

Waj (CS)

K

P

Wai (CS)

+

β

K

i=1

And then the weight of the used edge is updated by the reward obtained r :

η.r

t+1

t

Waj (CS) = Waj (CS) exp

K.P[aj |CS]

We don’t update the others :

Wat+1

(CS) = Wati (CS) ∀i 6= j

i

(We also can see the reward as a node that leads to reinforce the edges when it is spiked)

a1

Wat1

Wat2

CS

a2

Wat3

a3

Figure 4: A network representation of the Exp.3 algorithm.

7

5

Exp.3 network with reward prediction

Let us come back to the primate dopamine evolution. Imagine that nodes in the above network

are neurones (or group of neurones). When the CS node spikes it makes an action spikes and

the agent do the action and receive a corresponding reward. Then, the reward active a reward

node/neurone that causes a dopamine peak. This dopamine reinforces the edge between CS and

the chosen action. Now, if the agent can learn the corresponding reward, imagine that CS also

causes a peak of dopamine with the learned level. We assume that this prior dopamine peak has

an inhibition effect on the reinforcement impact.

This process could produce a dopamine evolution like on the figure 1. We can model it with

the following network :

D

(-)

(+)

reinf.

Wat1

a1

CS

R

Figure 5: Network with regulated dopamine reinforcement. The D-node represents the dopamine

circuit that reinforce the CS—action edge. D is inhibited by CS and activated by the reward R.

5.1

Empirical learning of the reward

Let the agent learns the empirical reward given by action a after CS Cs following this stochastic

way :

ˆ s , a) ← (1 − µ).R(C

ˆ s , a) + µ.R

R(C

where R is the last reward given by (Cs , a) and µ is a parameter ∈ (0,1].

5.2

Integrate-and-fire dopamine effect

The reinforcement impact must occurs at the reward-time. In practice, the agent should have

to also learn empirically the average reward-time after the CS. Here we are in a multi-armed bandit

context, so there is no time between CS, action and reward.

ˆ receives a

When an action a is chosen after a CS the agent (that learned an empirical reward R)

8

ˆ (inhibition from CS) +R (activation from R). Then,

reward R. It causes a dopamine level D = −R

the weight of the edge Wa is reinforced as follow :

η.D

Wa (CS) ← Wa (CS) exp

K.P[aj |CS]

5.3

First algorithm

We can rewrite the Exp.3 algorithm with this modification :

Algorithm 2 Learning-reward Exp.3 (LR-Exp.3)

parameters : β, η, µ ∈ (0,1]

initialization :

Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

ˆ

R(a)

= 0 ∀ a ∈ a1 ...aK ;

. Empirical reward of actions

for t = 1 : T do

Pull action at following the distribution P[at = aj ] = (1 − β)

Waj

K

P

i=1

Receive reward rt = r(at , t);

ˆ t );

Set the dopamine level Dt = r(at , t) − R(a

η Dt

Update the weight Wat = Wat exp

;

K.P[at = aj ]

ˆ

ˆ

Update the empirical reward R(a)

← (1 − µ).R(a)

+ µ.R;

end for

9

Wai

+

β

;

K

5.4

Evolution of dopamine level

With such a reinforcement, the agent doesn’t need punishment to forget an action. When the

ˆ and decreases the action weight. What is more,

strategy move, it gets a frustration D = 0 − R

ˆ = 0, the first reward has a full impact, and the following decrease up to zero :

if we initialize R

ˆ → R with the speed of (1 − µ)t → 0 when

for a fixed reward R (given by CS and action a) R

t → ∞. Off course we can improve this speed using a time-dependent parameter µ = 1/t (it brings

classical concentration inequality). But if we do that, the agent cannot face a moving reward R

when adversary switches to a new strategy. Now, when it forgets the action a it also forgets the

ˆ

empirical reward R(a)

because the new R equals zero (or an other little value). Finally, the new

ˆ

good action (b) gives a large reward with a small empirical reward R(b)

and this new action is

instantly reinforced. The figure 6 shows the evolution of the dopamine level given by LR-Exp.3

during the same game used in figure 2 and 3 (it also shows evolution of the empirical rewards

associated with each action) :

1

0.8

0.6

0.4

0.2

0

−0.2

−0.4

dopamine level

emp. reward of a

−0.6

emp. reward of b

−0.8

−1

0

200

400

600

800

1000

1200

Figure 6: Evolution of the dopamine level (blue) during a T = 1200 game. The switch time of

strategy is set at ts = 400. There are not punishment : before ts action a sends reward 1 and b 0,

after ts a sends reward 0 and b 1. There is a choice of 10 action (including a and b). The red curve

ˆ

ˆ

shows the evolution of the empirical reward R(a)

and the green one shows the evolution of R(b).

The used parameters are η = µ = 0.1 and β = 0.01. We can see that the evolution of the dopamine

level has the same aspect than the dopamine peaks of the primate at the reward-time (figure 1).

10

5.5

It brings about a more reactive learning

Now, let’s take a look on the frequency of actions chosen by LR-Exp.3 :

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

200

400

600

800

1000

1200

Figure 7: Frequency of using action a (red) and action b (green) during a game with LR-Exp3.

Before the switching time ts = 400, action a sends reward 1 and b send reward 0. After ts a sends

reward 0 and b send reward 1. There is a choice of 10 action (including a and b). The frequency

is computed using a sliding window that smooths the action occurrences (0 or 1). This is a mean

over 100 runs of the same game. The used parameters are η = µ = 0.1 and β = 0.01. We can see

that the algorithm detects the switching almost instantly and move to explore other actions, while

Exp.3 needs a minimum time to forget.

6

Improvements

Since the reinforcement of the best action is limited by inhibition from the empirical reward, the

algorithm doesn’t need the exploration term β/K anymore for chose an action. But we keep it to

normalize the reinforcement term (K.P[at = aj ]) : otherwise the reinforcement at the beginning is

too large. We will even use a large β to have a good upper bound of the reinforcement-normalization

term (indeed for a large one this bound is rightly β).

If the rules of the game has just move and we are looking for a new best-action. We hop that

this action has a small empirical reward in front of the reward it will send. The best should be to

forget an empirical reward very quickly to make sure it will be selected when it’s the best again.

1

To do that we can accelerate the empirical learning when D is negative (e.g. using µ

˜ = µ K seams

to gives good results).

In the same idea, we can force to forget the action that causes frustration (if this is a false alarm,

11

since the empirical reward is also forgotten the action send large dopamine peaks D and is quickly

selected again). In other terms, the learner agent doesn’t like jokes. A simple idea is to add a real

parameter to increase η when D is negative.

Finally, in all the above figures of games, there was 10 actions. One action send a reward R = 1 and

all the other was bad (R = −1 for classical Exp.3 and R = 0 for LR-Exp.3). But we would like to

have similar results when each action send a rewards in [-1,1] for Exp.3 and in [0,1] for LR-Exp.3.

ˆ i ) (frustration over the best reward of all actions).

The intuitive way is to use D = R − max R(a

ai ∈a1 ..aK

All this modification lead us to write this new algorithm :

Algorithm 3 Improved LR-Exp.3

parameters : β, η, µ ∈ (0,1], > 1

initialization :

Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

ˆ

R(a)

= 0 ∀ a ∈ a1 ...aK ;

. Empirical reward of actions

for t = 1 : T do

Pull action at following the distribution P[at = aj ] =

Waj

K

P

;

Wai

i=1

Receive reward rt = r(at , t);

ˆ t );

Set the dopamine level Dt = r(at , t) − R(a

1

If D < 0 set η˜ = .η and µ

˜ = µK ;

Otherwise η˜ = η; and µ

˜ = µ;

η˜ Dt

;

K.(1 − β)P[at = aj ] + β

ˆ

ˆ

Update the empirical reward R(a)

← (1 − µ

˜).R(a)

+µ

˜.R;

Update the weight Wat = Wat exp

end for

6.1

Results

The following figures shows the frequency of actions chosen by improved LR-Exp.3 still using

the T = 1200 period game with a switch-time at ts = 400.

12

1.2

1

0.8

0.6

0.4

0.2

0

−0.2

0

200

400

600

800

1000

1200

Figure 8: Frequency of using action a (red) and action b (green) during a game with improved

LR-Exp.3. Before the switching time ts = 400, action a sends reward 1 and b send reward 0. After

ts a sends reward 0 and b send reward 1. There is a choice of 10 action (including a and b). Only

one action gives a positive reward r = 1 at the time. The frequency is computed using a sliding

window that smooths the action occurrences (0 or 1). This is a mean over 100 runs of the same

game. The used parameters are η = 0.3, µ = 0.1, β = 0.9 and = 10.

1.2

1

0.8

0.6

0.4

0.2

0

−0.2

0

200

400

600

800

1000

1200

Figure 9: Frequency of using action a (red) and action b (green) during a game with improved

LR-Exp.3. There is a choice of 3 actions (including a and b). Before the switching time ts = 400,

action a sends reward 1, b send reward 0.5 and an other action send reward 0.8. After ts a sends

reward 0.4, b send reward 1 and the other action send reward 0.5. The frequency is computed using

a sliding window that smooths the action occurrences (0 or 1). This is a mean over 100 runs of the

same game. The used parameters are η = 0.1, µ = 0.1, β = 0.9 and = 10.

13

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2

0

200

400

600

800

1000

0

1200

200

400

600

800

1000

1200

Figure 10: Frequency of using action a (red) and action b (green) during a game with improved

LR-Exp.3 (left) and classical Exp.3 (right). There is a choice of 3 actions (including a and b).

Before the switching time ts = 400, action a sends reward 1, b send reward 0.5 and an other action

send reward 0.8. After ts a sends reward 0.7 (hard to detect), b send reward 1 and the other

action send reward 0.5. The frequency is computed using a sliding window that smooths the action

occurrences (0 or 1). This is a mean over 100 runs of the same game. The used parameters for

imp. LR-Exp.3 are η = 0.1, µ = 0.1, β = 0.9 and = 10. The used parameters for classical Exp.3

are η = 0.3, β = 0.01 (that seams to be the best for this particular problem).

7

Long game

We now ask : what about a longer game ? all the above tests were on a game with only one

revolution of the rules. But what appends if we play against a Markovian adversary that often

move the rules ?

Here we present the result during a period T = 10000 game. There are 3 possible actions. At

each time-step t, one action gives a reward r = 1, an other gives r = 7.6, and the last one gives

r = 0. There are two Markovian adversaries : M1 and M2. M1 moves the number of the best

action, M2 moves the number of the second action. If at a time the best and second action have

the same number, only the best action is listened and we use reward r = 1 if the agent chose the

corresponding number. When M1 or M2 decides to move the rules, it choice a new number of

action randomly between the two other number. Their probability to stay in a stat depends on

the current number (for M1 P = 0.999 for each number; for M2 P = 0.999 for 1, 0.993 for 2, 0.995

for 3). Following figures compare regret curves (difference between best score possible obtained

by medium player and score obtained by algorithm) for this game given by classical Exp.3 and

improved LR-Exp3. To be sure to span the possible parameters for Exp.3 we generate them using

Halton sequence in the unity rectangle and then in the “unity/10” rectangle (Halton sequences

have the property to optimize the discrepancy over a given space).

14

5000

1

4500

0.9

4000

0.8

3500

0.7

3000

0.6

2500

0.5

2000

0.4

1500

0.3

1000

0.2

500

0.1

0

0

0

2000

4000

6000

8000

0

10000

0.2

0.4

0.6

0.8

1

Figure 11: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red). Each curve

is a mean over 20 such game with fixed parameters. 100 times we test new parameters for Exp.3

given by an Halton sequence on unity rectangle. The parameters of improved RL-Exp.3 are fixed

: η = 0.3, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity rectangle. In Y-axis

the parameter η, in X-axis the parameter β.

5000

0.1

4500

0.09

4000

0.08

3500

0.07

3000

0.06

2500

0.05

2000

0.04

1500

0.03

1000

0.02

500

0.01

0

0

0

2000

4000

6000

8000

10000

0

0.02

0.04

0.06

0.08

0.1

Figure 12: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red). Each curve

is a mean over 20 such game with fixed parameters. 100 times we test new parameters for Exp.3

given by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are

fixed : η = 0.3, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle.

In Y-axis the parameter η, in X-axis the parameter β.

8

Learning sequences of actions

15

Imagine an agent that has to find and learn the best sequence of action. It receives a reward

only if it just realizes an effective sequence. Example : with 3 possible actions, the sequence [a,b,c,c]

gives a reward r = 1 and [a,b,a] gives r = 0.5 and all the other sequence (including singletons) gives

zero reward. We want the learner to be able to quickly find (and then to chose almost always) the

best sequence [a,b,c,c]. As we don’t lead the agent by making small (increasing) rewards for the

sub-sequences, this problem is difficult to solve with simple Exp.3. And it becomes even harder if

an adversary sometimes moves the rules.

8.1

Network structure

To solve such a problem we need to encode the possible sequences. For that, we set a maximal

depth dm . This depth gives the maximal length of sequence the agent can learn. Then, imagine

that the agent can see its own actions : an action becomes a CS for the following one. Each action

leads to a following action just as the network described in figure 4. In such a way, we are building

a kind of reinforcement random walk into a decision tree :

Wat1

a1

Wat1

Wat1 |a1

a1

Wat2 |a1

a2

Wat1 |a1 ,a1

...

Wat2 |a1 ,a1

a2

...

Wat1 |a2

Wat2 |a2

a1

a2

Note (1) : on this graph, the first node at the left represents any leaf of the tree : we repeat

recursively the tree at each leaf. From a cyclic point of view, we could imagine that the tree is

plunged into a surface of revolution. We can draw such a tree assuming depth = 2 and only two

different actions (with more nodes the drawing becomes to be unreadable) :

16

a1

a11

a11

a2

a12

a12

a1

a22

a2

a21

a21

a22

Note (2) : an other (but equivalent) way to see it : we can also imagine a superior order Markov

chain in a clique. The order of the chain is the depth and there is one node in the clique for each

action.

In the decision tree, at each time-step of the walk, we remember the last dm actions done. If

this last actions realize an effective sequence (of length l < dm ) the agent is given the associated

reward. When the agent receives a reward, it reinforces accordingly the last l edges used. The

naive algorithm is to use connected classical Exp.3 to chose the following action and to reinforce a

path. Thus, an action is chosen accordingly to :

P[at = aj ] = (1 − β)

Waj

K

P

Wai

+

β

K

i=1

where W are the weight of edges that spring out of the last node left. The following algorithm

describe such a reinforcement from a current depth d (assuming dm = 4) :

17

Algorithm 4 Exp.3 reinforcement of the path from depth d

inputs :

the

the

the

the

last 4 actions visited [a1 , a2 , a3 , a4 ]

weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]

depth d

parameters η, β

comp. the reward : r = r([a1 , a2 , a3 , a4 ])

for i = 1 : 4 do

d ← (d − 1) mod (4)

. We can make the parameters depend on the depth

p ←− (1 − β(d))

Wi−1 (ai )

+ β(d)/4

4

P

Wi−1 (aj )

j=1

r

Wi−1 (ai ) ←− Wi−1 (ai ) exp(η(d) )

p

end for

Now, we can also try with connected improved LR-Exp.3 : an action is chosen accordingly to :

P[at = aj ] =

Waj

K

P

Wai

i=1

where W are the weight of edges that spring out of the last node left. Then the reinforcement of

the rewarded path becomes (still assuming dm = 4) :

18

Algorithm 5 LR-Exp.3 reinforcement of the path from depth d

inputs :

the

the

the

the

the

last 4 actions visited [a1 , a2 , a3 , a4 ]

weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]

depth d

parameters µ, η, β,

ˆ 4 , d);

empirical reward R(a

comp. the reward : r = r([a1 , a2 , a3 , a4 ])

ˆ 4 , d)

set the dopamine level : D = r − R(a

1

if D < 0 set η˜ = .η and µ

˜ = µK ;

otherwise η˜ = η; and µ

˜ = µ;

ˆ 4 , d) ← (1 − µ

ˆ

update the empirical reward R(a

˜).R(a)

+µ

˜.R;

for i = 1 : 4 do

p ←− (1 − β)

. Here the parameters don’t depend on the depth

Wi−1 (ai )

+ β/4

4

P

Wi−1 (aj )

j=1

Wi−1 (ai ) ←− Wi−1 (ai ) exp

η˜.D

p

end for

8.2

Results

The following figure shows the regret curves given by both algorithms. We can see that the

naive algorithm has difficulty to find the optimal sequence while the improved LR-Exp.3 reactively

finds the optimal one (when a flat level is reached it means that the algorithm is looping the optimal

sequence). We try with different depths and different numbers of action. At the middle of the game

we move the rules of the game, and the optimal sequence becomes a new one.

19

700

0.1

0.09

600

0.08

500

0.07

0.06

400

0.05

300

0.04

0.03

200

0.02

100

0.01

0

0

0

500

1000

1500

2000

0

0.02

0.04

0.06

0.08

0.1

Figure 13: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 3 actions

and depth d = 3. The game is over a T = 2000 period time. There is only one sequence that is

effective. At the switching-time ts = 1000 we change the effective sequence. The red curve is a

mean over 200 such game with fixed parameters. 200 times we test new parameters for Exp.3 given

by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are fixed :

η = 0.2, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle. In Y-axis

the parameter η, in X-axis the parameter β.

2500

0.1

0.09

2000

0.08

0.07

1500

0.06

0.05

1000

0.04

0.03

500

0.02

0.01

0

0

0

2000

4000

6000

8000

10000

0

0.02

0.04

0.06

0.08

0.1

Figure 14: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 4 actions

and depth d = 4. The game is over a T = 10000 period time. There is only one sequence that

is effective. At the switching-time ts = 5000 we change the effective sequence. The red curve is a

mean over 50 such game with fixed parameters. 50 times we test new parameters for Exp.3 given

by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are fixed :

η = 0.2, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle. In Y-axis

the parameter η, in X-axis the parameter β.

20

1000

0.1

900

0.09

800

0.08

700

0.07

600

0.06

500

0.05

400

0.04

300

0.03

200

0.02

100

0.01

0

0

0

500

1000

1500

2000

2500

3000

0

0.02

0.04

0.06

0.08

0.1

Figure 15: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 3 actions

and depth d = 3. The game is over a T = 3000 period time. There is two effective sequences,

one with r = 1 and one with r = 0.5. At the switching-time ts = 1500 we change the effectives

sequence. The red curve is a mean over 50 such game with fixed parameters. 50 times we test

new parameters for Exp.3 given by an Halton sequence on unity/10 rectangle. The parameters of

improved RL-Exp.3 are fixed : η = 0.2, µ = 0.1, β = 0.9. One can see that there are more chances

to find an effective sequence, that explains the better results of Exp.3. Right : the visited Halton

sequence in the unity/10 rectangle. In Y-axis the parameter η, in X-axis the parameter β.

9

Conclusion

This algorithm seams to be more reactive facing a switching strategy, but it also seams to be

better suited to quickly find sequences of actions. It could bring a possible explanation of the primate’s behavior described in figure 1 (if we assume that a neurone actives the following connected

ones with a kind of reinforcement distribution).

It could be interesting to study this algorithm with a more theoretical and deeper approach.

References

[1] Schultz,W. (1999) The Reward Signal of Midbrain Dopamine Neurons. News in Physiological

Sciences, Vol. 14, No. 6, 249-255.

[2] Schultz,W. (1997) Dopamine neurons and their role in reward mechanisms. Current Opinion in

Neurobiology, Volume 7, Issue 2, Pages 191–197.

[3] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E. (2002) The NonStochastic Multiarmed

Bandit Problem. SIAM Journal on Computing Vol. 32, No. 1, pp. 48–77.

21