Fichier PDF

Partage, hébergement, conversion et archivage facile de documents au format PDF

Partager un fichier Mes fichiers Convertir un fichier Boite à outils PDF Recherche PDF Aide Contact



reward .pdf



Nom original: reward.pdf

Ce document au format PDF 1.5 a été généré par LaTeX with hyperref package / pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 24/03/2014 à 15:03, depuis l'adresse IP 138.231.x.x. La présente page de téléchargement du fichier a été vue 527 fois.
Taille du document: 6.8 Mo (21 pages).
Confidentialité: fichier public




Télécharger le fichier (PDF)









Aperçu du document


Primates prediction of reward for adversarial
multi-armed bandit problem
Alexis JACQ
March 22, 2014

Contents
1 Introduction

2

2 primate prediction of reward

2

3 The Exp.3 algorithm facing a switching strategy

4

4 Network view of the Exp.3

7

5 Exp.3 network with reward prediction
5.1 Empirical learning of the reward . . . .
5.2 Integrate-and-fire dopamine effect . . . .
5.3 First algorithm . . . . . . . . . . . . . .
5.4 Evolution of dopamine level . . . . . . .
5.5 It brings about a more reactive learning

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

8
. 8
. 8
. 9
. 10
. 11

6 Improvements
11
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7 Long game

14

8 Learning sequences of actions
15
8.1 Network structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
8.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9 Conclusion

21

1

1

Introduction

We are introducing an algorithm that solves adversarial multi-armed bandit (AMAB) problems
with a reactive behavior facing an adversary with moving strategies. It is inspired from the Exp.3
algorithm. As primates, this algorithm uses a prediction of rewards [1], but here it is used to
regulates the reinforcement impacts. Then we are going to show different applications where this
algorithm seams to be better than classical Exp.3 (especially for learning sequences of actions).

2

primate prediction of reward

Dopamine seams to be involved in “good” rewarding [2]. Then, conditioning experiments with
primate have shown that [1]:
(1) primate gets a dopamine peak when receive a good reward (figure 1, top).
(2) when primate learns that a conditioning signal (CS) predicts a good reward, it gets a dopamine
peak just after the CS, and the reward does not cause dopamine peak anymore (figure 1,
middle).
(3) if we don’t reward a primate that learned the CS it gets a “negative” peak of dopamine (figure
1, bottom).
In the learning phase of this experiments, the primate has to learn by association a behavior after
the CS to get the reward. This gives the context of our learning algorithm : there is a fixed number
of different CS, and for each CS we want to learn the best action to maximize the following reward.
But the supervisor can suddenly decides to change the rules of the game : he can move the action
that leads to the best reward. Then, the learning agent has to forget its strategy for a new one. In
such a way, the supervisor can be seen as an adversarial player that leads the game.

2

Figure 1: Changes in dopamine neurons’ output code for an error in the prediction of
appetitive events. (Top) Before learning, a
drop of appetitive fruit juice occurs in the absence of prediction— hence a positive error
in the prediction of reward. The dopamine
neuron is activated by this unpredicted occurrence of juice. (Middle) After learning, the conditioned stimulus predicts reward,
and the reward occurs according to the prediction— hence no error in the prediction
of reward. The dopamine neuron is activated by the reward-predicting stimulus but
fails to be activated by the predicted reward
(right). (Bottom) After learning, the conditioned stimulus predicts a reward, but the
reward fails to occur because of a mistake in
the behavioral response of the primate. The
activity of the dopamine neuron is depressed
exactly at the time when the reward would
have occurred. The depression occurs more
than 1 s after the conditioned stimulus without any intervening stimuli, revealing an internal representation of the time of the predicted
reward. Neuronal activity is aligned on the
electronic pulse that drives the solenoid valve
delivering the reward liquid (top) or the onset of the conditioned visual stimulus (middle
and bottom). Each panel shows the peri-event
time histogram and raster of impulses from
the same neuron. Horizontal distances of dots
correspond to real-time intervals. Each line
of dots shows one trial. Original sequence of
trials is plotted from top to bottom.

3

3

The Exp.3 algorithm facing a switching strategy

The Exp.3 algorithm (see algo. 1) performs a stochastic exploration of different possible actions. By this way, an adversarial player cannot deterministically predicts an Exp.3 behavior. It is
one of the best algorithm when each pair (action, adversarial action) send a deterministic reward.
The rapidity to learn the best action is optimal. In [3] they show by a matching lower bound that
O(T −1/2 ) is the best possible to learn an action in time T (it is reached by best variations of Exp.3).

Algorithm 1 Exp.3 during learning-period T
parameters : β, η ∈ (0,1]
initialization :
Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

for t = 1 : T do
pull action at following the distribution P[at = aj ] = (1 − β)

Waj
K
P

Wai

+

β
;
K

i=1

receive reward rt = r(at , t)

update the weight Wat = Wat exp

η rt
K.P[at = aj ]



end for

Now, if an agent is given a reward r for action a and −r for action b and then the rules switch (−r
for a and r for b) the Exp.3 will forget the action a with almost the same time it used to learn it
(as far as β is small). If β is large, the agent does more explorations and quickly detects a new
good action, but it doesn’t perform optimal exploitation of the best actions (figure 2).
Then, if an action a give the best reward r during a long period, the weight Wa becomes very
close to 1, and the reward impact η.rt /(K.P[at = aj ]) becomes very small. Thus, if the action
reward turns to be −r, the agent needs the same long period to forget a (figure 3).

4

1

1.2

1

0.8

0.8
0.6
0.6
0.4
0.4
0.2
0.2
0

0

−0.2

−0.2
0

200

400

600

800

1000

1200

0

200

400

600

800

1000

1200

Figure 2: Frequency of using action a (red) and action b (green) during a game. Before the switching
time ts = 400, action a sends reward 1 and b −1. After ts a sends reward −1 and b 1. There is
a choice of 10 action (including a and b). The frequency is computed using a sliding window that
smooths the action occurrences (0 or 1). This is a mean over 100 runs of the same game. Left :
η = 0.3 and β = 0.01. Right : η = 0.3 and β = 0.

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

500

1000

1500

2000

2500

3000

Figure 3: Frequency of using action a (red) and action b (green) during a game with long period
before the switching time ts = 1000. There is a choice of 10 action (including a and b). We use
parameters η = 0.1 and β = 0.01. We can see that after ts it needs a time period = 1000 to play
action b more often than a and a period 2000 to definitely forget a.

5

We would like to build an algorithm that learn the rewards given by each action to regulate the
reward impact. As the primate, it would predict the reward and use a null reward impact to stop
the exponential increasing of Wa . And to instantly forget an action it would get a negative reward
in front of a frustration : if an agent is given a reward R = r for action a and R = 0 for action b
and then the rules switch (R = 0 for a and R = r for b) we expect that the agent instantly forget
a by frustration (it use a negative impact even without negative punishment R = −r).

6

4

Network view of the Exp.3

We can see the Exp.3 algorithm like a network when a node representing the CS is connected to
each nodes representing actions (see figure 4). The edges are oriented (CS→actions) and weighted
according to the Exp.3 weights. When the CS is spiked, it makes one action node spikes chosen
following the Exp.3 distribution :
P[aj spikes | CS] = (1 − β)

Waj (CS)
K
P

Wai (CS)

+

β
K

i=1

And then the weight of the used edge is updated by the reward obtained r :


η.r
t+1
t
Waj (CS) = Waj (CS) exp
K.P[aj |CS]
We don’t update the others :
Wat+1
(CS) = Wati (CS) ∀i 6= j
i
(We also can see the reward as a node that leads to reinforce the edges when it is spiked)

a1
Wat1
Wat2

CS

a2

Wat3

a3
Figure 4: A network representation of the Exp.3 algorithm.

7

5

Exp.3 network with reward prediction

Let us come back to the primate dopamine evolution. Imagine that nodes in the above network
are neurones (or group of neurones). When the CS node spikes it makes an action spikes and
the agent do the action and receive a corresponding reward. Then, the reward active a reward
node/neurone that causes a dopamine peak. This dopamine reinforces the edge between CS and
the chosen action. Now, if the agent can learn the corresponding reward, imagine that CS also
causes a peak of dopamine with the learned level. We assume that this prior dopamine peak has
an inhibition effect on the reinforcement impact.
This process could produce a dopamine evolution like on the figure 1. We can model it with
the following network :

D
(-)
(+)

reinf.

Wat1

a1

CS

R

Figure 5: Network with regulated dopamine reinforcement. The D-node represents the dopamine
circuit that reinforce the CS—action edge. D is inhibited by CS and activated by the reward R.

5.1

Empirical learning of the reward

Let the agent learns the empirical reward given by action a after CS Cs following this stochastic
way :
ˆ s , a) ← (1 − µ).R(C
ˆ s , a) + µ.R
R(C
where R is the last reward given by (Cs , a) and µ is a parameter ∈ (0,1].

5.2

Integrate-and-fire dopamine effect

The reinforcement impact must occurs at the reward-time. In practice, the agent should have
to also learn empirically the average reward-time after the CS. Here we are in a multi-armed bandit
context, so there is no time between CS, action and reward.
ˆ receives a
When an action a is chosen after a CS the agent (that learned an empirical reward R)

8

ˆ (inhibition from CS) +R (activation from R). Then,
reward R. It causes a dopamine level D = −R
the weight of the edge Wa is reinforced as follow :


η.D
Wa (CS) ← Wa (CS) exp
K.P[aj |CS]

5.3

First algorithm

We can rewrite the Exp.3 algorithm with this modification :
Algorithm 2 Learning-reward Exp.3 (LR-Exp.3)
parameters : β, η, µ ∈ (0,1]
initialization :
Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

ˆ
R(a)
= 0 ∀ a ∈ a1 ...aK ;

. Empirical reward of actions

for t = 1 : T do
Pull action at following the distribution P[at = aj ] = (1 − β)

Waj
K
P
i=1

Receive reward rt = r(at , t);
ˆ t );
Set the dopamine level Dt = r(at , t) − R(a


η Dt
Update the weight Wat = Wat exp
;
K.P[at = aj ]
ˆ
ˆ
Update the empirical reward R(a)
← (1 − µ).R(a)
+ µ.R;
end for

9

Wai

+

β
;
K

5.4

Evolution of dopamine level

With such a reinforcement, the agent doesn’t need punishment to forget an action. When the
ˆ and decreases the action weight. What is more,
strategy move, it gets a frustration D = 0 − R
ˆ = 0, the first reward has a full impact, and the following decrease up to zero :
if we initialize R
ˆ → R with the speed of (1 − µ)t → 0 when
for a fixed reward R (given by CS and action a) R
t → ∞. Off course we can improve this speed using a time-dependent parameter µ = 1/t (it brings
classical concentration inequality). But if we do that, the agent cannot face a moving reward R
when adversary switches to a new strategy. Now, when it forgets the action a it also forgets the
ˆ
empirical reward R(a)
because the new R equals zero (or an other little value). Finally, the new
ˆ
good action (b) gives a large reward with a small empirical reward R(b)
and this new action is
instantly reinforced. The figure 6 shows the evolution of the dopamine level given by LR-Exp.3
during the same game used in figure 2 and 3 (it also shows evolution of the empirical rewards
associated with each action) :

1
0.8
0.6
0.4
0.2
0
−0.2
−0.4

dopamine level
emp. reward of a

−0.6

emp. reward of b
−0.8
−1

0

200

400

600

800

1000

1200

Figure 6: Evolution of the dopamine level (blue) during a T = 1200 game. The switch time of
strategy is set at ts = 400. There are not punishment : before ts action a sends reward 1 and b 0,
after ts a sends reward 0 and b 1. There is a choice of 10 action (including a and b). The red curve
ˆ
ˆ
shows the evolution of the empirical reward R(a)
and the green one shows the evolution of R(b).
The used parameters are η = µ = 0.1 and β = 0.01. We can see that the evolution of the dopamine
level has the same aspect than the dopamine peaks of the primate at the reward-time (figure 1).

10

5.5

It brings about a more reactive learning

Now, let’s take a look on the frequency of actions chosen by LR-Exp.3 :
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

200

400

600

800

1000

1200

Figure 7: Frequency of using action a (red) and action b (green) during a game with LR-Exp3.
Before the switching time ts = 400, action a sends reward 1 and b send reward 0. After ts a sends
reward 0 and b send reward 1. There is a choice of 10 action (including a and b). The frequency
is computed using a sliding window that smooths the action occurrences (0 or 1). This is a mean
over 100 runs of the same game. The used parameters are η = µ = 0.1 and β = 0.01. We can see
that the algorithm detects the switching almost instantly and move to explore other actions, while
Exp.3 needs a minimum time to forget.

6

Improvements

Since the reinforcement of the best action is limited by inhibition from the empirical reward, the
algorithm doesn’t need the exploration term β/K anymore for chose an action. But we keep it to
normalize the reinforcement term (K.P[at = aj ]) : otherwise the reinforcement at the beginning is
too large. We will even use a large β to have a good upper bound of the reinforcement-normalization
term (indeed for a large one this bound is rightly β).
If the rules of the game has just move and we are looking for a new best-action. We hop that
this action has a small empirical reward in front of the reward it will send. The best should be to
forget an empirical reward very quickly to make sure it will be selected when it’s the best again.
1
To do that we can accelerate the empirical learning when D is negative (e.g. using µ
˜ = µ K seams
to gives good results).
In the same idea, we can force to forget the action that causes frustration (if this is a false alarm,
11

since the empirical reward is also forgotten the action send large dopamine peaks D and is quickly
selected again). In other terms, the learner agent doesn’t like jokes. A simple idea is to add a real
parameter to increase η when D is negative.
Finally, in all the above figures of games, there was 10 actions. One action send a reward R = 1 and
all the other was bad (R = −1 for classical Exp.3 and R = 0 for LR-Exp.3). But we would like to
have similar results when each action send a rewards in [-1,1] for Exp.3 and in [0,1] for LR-Exp.3.
ˆ i ) (frustration over the best reward of all actions).
The intuitive way is to use D = R − max R(a
ai ∈a1 ..aK

All this modification lead us to write this new algorithm :
Algorithm 3 Improved LR-Exp.3
parameters : β, η, µ ∈ (0,1], > 1
initialization :
Wa = 1 ∀ a ∈ a1 ...aK ;

. Weights of actions

ˆ
R(a)
= 0 ∀ a ∈ a1 ...aK ;

. Empirical reward of actions

for t = 1 : T do
Pull action at following the distribution P[at = aj ] =

Waj
K
P

;

Wai

i=1

Receive reward rt = r(at , t);
ˆ t );
Set the dopamine level Dt = r(at , t) − R(a
1

If D < 0 set η˜ = .η and µ
˜ = µK ;
Otherwise η˜ = η; and µ
˜ = µ;

η˜ Dt
;
K.(1 − β)P[at = aj ] + β
ˆ
ˆ
Update the empirical reward R(a)
← (1 − µ
˜).R(a)

˜.R;


Update the weight Wat = Wat exp

end for

6.1

Results

The following figures shows the frequency of actions chosen by improved LR-Exp.3 still using
the T = 1200 period game with a switch-time at ts = 400.

12

1.2

1

0.8

0.6

0.4

0.2

0

−0.2
0

200

400

600

800

1000

1200

Figure 8: Frequency of using action a (red) and action b (green) during a game with improved
LR-Exp.3. Before the switching time ts = 400, action a sends reward 1 and b send reward 0. After
ts a sends reward 0 and b send reward 1. There is a choice of 10 action (including a and b). Only
one action gives a positive reward r = 1 at the time. The frequency is computed using a sliding
window that smooths the action occurrences (0 or 1). This is a mean over 100 runs of the same
game. The used parameters are η = 0.3, µ = 0.1, β = 0.9 and = 10.

1.2

1

0.8

0.6

0.4

0.2

0

−0.2
0

200

400

600

800

1000

1200

Figure 9: Frequency of using action a (red) and action b (green) during a game with improved
LR-Exp.3. There is a choice of 3 actions (including a and b). Before the switching time ts = 400,
action a sends reward 1, b send reward 0.5 and an other action send reward 0.8. After ts a sends
reward 0.4, b send reward 1 and the other action send reward 0.5. The frequency is computed using
a sliding window that smooths the action occurrences (0 or 1). This is a mean over 100 runs of the
same game. The used parameters are η = 0.1, µ = 0.1, β = 0.9 and = 10.

13

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

−0.2

−0.2
0

200

400

600

800

1000

0

1200

200

400

600

800

1000

1200

Figure 10: Frequency of using action a (red) and action b (green) during a game with improved
LR-Exp.3 (left) and classical Exp.3 (right). There is a choice of 3 actions (including a and b).
Before the switching time ts = 400, action a sends reward 1, b send reward 0.5 and an other action
send reward 0.8. After ts a sends reward 0.7 (hard to detect), b send reward 1 and the other
action send reward 0.5. The frequency is computed using a sliding window that smooths the action
occurrences (0 or 1). This is a mean over 100 runs of the same game. The used parameters for
imp. LR-Exp.3 are η = 0.1, µ = 0.1, β = 0.9 and = 10. The used parameters for classical Exp.3
are η = 0.3, β = 0.01 (that seams to be the best for this particular problem).

7

Long game

We now ask : what about a longer game ? all the above tests were on a game with only one
revolution of the rules. But what appends if we play against a Markovian adversary that often
move the rules ?
Here we present the result during a period T = 10000 game. There are 3 possible actions. At
each time-step t, one action gives a reward r = 1, an other gives r = 7.6, and the last one gives
r = 0. There are two Markovian adversaries : M1 and M2. M1 moves the number of the best
action, M2 moves the number of the second action. If at a time the best and second action have
the same number, only the best action is listened and we use reward r = 1 if the agent chose the
corresponding number. When M1 or M2 decides to move the rules, it choice a new number of
action randomly between the two other number. Their probability to stay in a stat depends on
the current number (for M1 P = 0.999 for each number; for M2 P = 0.999 for 1, 0.993 for 2, 0.995
for 3). Following figures compare regret curves (difference between best score possible obtained
by medium player and score obtained by algorithm) for this game given by classical Exp.3 and
improved LR-Exp3. To be sure to span the possible parameters for Exp.3 we generate them using
Halton sequence in the unity rectangle and then in the “unity/10” rectangle (Halton sequences
have the property to optimize the discrepancy over a given space).

14

5000

1

4500

0.9

4000

0.8

3500

0.7

3000

0.6

2500

0.5

2000

0.4

1500

0.3

1000

0.2

500

0.1
0

0
0

2000

4000

6000

8000

0

10000

0.2

0.4

0.6

0.8

1

Figure 11: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red). Each curve
is a mean over 20 such game with fixed parameters. 100 times we test new parameters for Exp.3
given by an Halton sequence on unity rectangle. The parameters of improved RL-Exp.3 are fixed
: η = 0.3, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity rectangle. In Y-axis
the parameter η, in X-axis the parameter β.

5000

0.1

4500

0.09

4000

0.08

3500

0.07

3000

0.06

2500

0.05

2000

0.04

1500

0.03

1000

0.02

500

0.01

0

0
0

2000

4000

6000

8000

10000

0

0.02

0.04

0.06

0.08

0.1

Figure 12: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red). Each curve
is a mean over 20 such game with fixed parameters. 100 times we test new parameters for Exp.3
given by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are
fixed : η = 0.3, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle.
In Y-axis the parameter η, in X-axis the parameter β.

8

Learning sequences of actions

15

Imagine an agent that has to find and learn the best sequence of action. It receives a reward
only if it just realizes an effective sequence. Example : with 3 possible actions, the sequence [a,b,c,c]
gives a reward r = 1 and [a,b,a] gives r = 0.5 and all the other sequence (including singletons) gives
zero reward. We want the learner to be able to quickly find (and then to chose almost always) the
best sequence [a,b,c,c]. As we don’t lead the agent by making small (increasing) rewards for the
sub-sequences, this problem is difficult to solve with simple Exp.3. And it becomes even harder if
an adversary sometimes moves the rules.

8.1

Network structure

To solve such a problem we need to encode the possible sequences. For that, we set a maximal
depth dm . This depth gives the maximal length of sequence the agent can learn. Then, imagine
that the agent can see its own actions : an action becomes a CS for the following one. Each action
leads to a following action just as the network described in figure 4. In such a way, we are building
a kind of reinforcement random walk into a decision tree :

Wat1

a1

Wat1

Wat1 |a1

a1

Wat2 |a1

a2

Wat1 |a1 ,a1

...

Wat2 |a1 ,a1

a2

...

Wat1 |a2
Wat2 |a2

a1

a2
Note (1) : on this graph, the first node at the left represents any leaf of the tree : we repeat
recursively the tree at each leaf. From a cyclic point of view, we could imagine that the tree is
plunged into a surface of revolution. We can draw such a tree assuming depth = 2 and only two
different actions (with more nodes the drawing becomes to be unreadable) :

16

a1

a11

a11

a2

a12

a12

a1
a22
a2

a21

a21

a22

Note (2) : an other (but equivalent) way to see it : we can also imagine a superior order Markov
chain in a clique. The order of the chain is the depth and there is one node in the clique for each
action.
In the decision tree, at each time-step of the walk, we remember the last dm actions done. If
this last actions realize an effective sequence (of length l < dm ) the agent is given the associated
reward. When the agent receives a reward, it reinforces accordingly the last l edges used. The
naive algorithm is to use connected classical Exp.3 to chose the following action and to reinforce a
path. Thus, an action is chosen accordingly to :
P[at = aj ] = (1 − β)

Waj
K
P

Wai

+

β
K

i=1

where W are the weight of edges that spring out of the last node left. The following algorithm
describe such a reinforcement from a current depth d (assuming dm = 4) :

17

Algorithm 4 Exp.3 reinforcement of the path from depth d
inputs :
the
the
the
the

last 4 actions visited [a1 , a2 , a3 , a4 ]
weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]
depth d
parameters η, β

comp. the reward : r = r([a1 , a2 , a3 , a4 ])

for i = 1 : 4 do
d ← (d − 1) mod (4)
. We can make the parameters depend on the depth

p ←− (1 − β(d))

Wi−1 (ai )
+ β(d)/4
4
P
Wi−1 (aj )
j=1

r
Wi−1 (ai ) ←− Wi−1 (ai ) exp(η(d) )
p
end for

Now, we can also try with connected improved LR-Exp.3 : an action is chosen accordingly to :
P[at = aj ] =

Waj
K
P

Wai

i=1

where W are the weight of edges that spring out of the last node left. Then the reinforcement of
the rewarded path becomes (still assuming dm = 4) :

18

Algorithm 5 LR-Exp.3 reinforcement of the path from depth d
inputs :
the
the
the
the
the

last 4 actions visited [a1 , a2 , a3 , a4 ]
weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]
depth d
parameters µ, η, β,
ˆ 4 , d);
empirical reward R(a

comp. the reward : r = r([a1 , a2 , a3 , a4 ])
ˆ 4 , d)
set the dopamine level : D = r − R(a
1

if D < 0 set η˜ = .η and µ
˜ = µK ;
otherwise η˜ = η; and µ
˜ = µ;
ˆ 4 , d) ← (1 − µ
ˆ
update the empirical reward R(a
˜).R(a)

˜.R;
for i = 1 : 4 do
p ←− (1 − β)

. Here the parameters don’t depend on the depth
Wi−1 (ai )
+ β/4
4
P
Wi−1 (aj )
j=1


Wi−1 (ai ) ←− Wi−1 (ai ) exp

η˜.D
p



end for

8.2

Results

The following figure shows the regret curves given by both algorithms. We can see that the
naive algorithm has difficulty to find the optimal sequence while the improved LR-Exp.3 reactively
finds the optimal one (when a flat level is reached it means that the algorithm is looping the optimal
sequence). We try with different depths and different numbers of action. At the middle of the game
we move the rules of the game, and the optimal sequence becomes a new one.

19

700

0.1
0.09

600
0.08
500

0.07
0.06

400

0.05
300

0.04
0.03

200

0.02
100
0.01
0

0
0

500

1000

1500

2000

0

0.02

0.04

0.06

0.08

0.1

Figure 13: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 3 actions
and depth d = 3. The game is over a T = 2000 period time. There is only one sequence that is
effective. At the switching-time ts = 1000 we change the effective sequence. The red curve is a
mean over 200 such game with fixed parameters. 200 times we test new parameters for Exp.3 given
by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are fixed :
η = 0.2, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle. In Y-axis
the parameter η, in X-axis the parameter β.

2500

0.1
0.09

2000

0.08
0.07

1500

0.06
0.05

1000

0.04
0.03

500

0.02
0.01

0

0
0

2000

4000

6000

8000

10000

0

0.02

0.04

0.06

0.08

0.1

Figure 14: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 4 actions
and depth d = 4. The game is over a T = 10000 period time. There is only one sequence that
is effective. At the switching-time ts = 5000 we change the effective sequence. The red curve is a
mean over 50 such game with fixed parameters. 50 times we test new parameters for Exp.3 given
by an Halton sequence on unity/10 rectangle. The parameters of improved RL-Exp.3 are fixed :
η = 0.2, µ = 0.1, β = 0.9. Right : the visited Halton sequence in the unity/10 rectangle. In Y-axis
the parameter η, in X-axis the parameter β.

20

1000

0.1

900

0.09

800

0.08

700

0.07

600

0.06

500

0.05

400

0.04

300

0.03

200

0.02

100

0.01

0

0
0

500

1000

1500

2000

2500

3000

0

0.02

0.04

0.06

0.08

0.1

Figure 15: Left : regret curves given by Exp.3 (blue) and improved RL-Exp.3 (red) with 3 actions
and depth d = 3. The game is over a T = 3000 period time. There is two effective sequences,
one with r = 1 and one with r = 0.5. At the switching-time ts = 1500 we change the effectives
sequence. The red curve is a mean over 50 such game with fixed parameters. 50 times we test
new parameters for Exp.3 given by an Halton sequence on unity/10 rectangle. The parameters of
improved RL-Exp.3 are fixed : η = 0.2, µ = 0.1, β = 0.9. One can see that there are more chances
to find an effective sequence, that explains the better results of Exp.3. Right : the visited Halton
sequence in the unity/10 rectangle. In Y-axis the parameter η, in X-axis the parameter β.

9

Conclusion

This algorithm seams to be more reactive facing a switching strategy, but it also seams to be
better suited to quickly find sequences of actions. It could bring a possible explanation of the primate’s behavior described in figure 1 (if we assume that a neurone actives the following connected
ones with a kind of reinforcement distribution).
It could be interesting to study this algorithm with a more theoretical and deeper approach.

References
[1] Schultz,W. (1999) The Reward Signal of Midbrain Dopamine Neurons. News in Physiological
Sciences, Vol. 14, No. 6, 249-255.
[2] Schultz,W. (1997) Dopamine neurons and their role in reward mechanisms. Current Opinion in
Neurobiology, Volume 7, Issue 2, Pages 191–197.
[3] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E. (2002) The NonStochastic Multiarmed
Bandit Problem. SIAM Journal on Computing Vol. 32, No. 1, pp. 48–77.

21


Documents similaires


Fichier PDF reward
Fichier PDF 5 sora pnas 2001
Fichier PDF 6 hall neurosci 2002
Fichier PDF french market
Fichier PDF icmitieee2017
Fichier PDF white paper sleep


Sur le même sujet..