emotional learning .pdf



Nom original: emotional_learning.pdf

Ce document au format PDF 1.5 a été généré par LaTeX with hyperref package / pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 03/03/2014 à 12:14, depuis l'adresse IP 138.231.x.x. La présente page de téléchargement du fichier a été vue 723 fois.
Taille du document: 348 Ko (18 pages).
Confidentialité: fichier public


Aperçu du document


Ideas of algorithm architectures for emotional
machine learning
Alexis JACQ
March 1, 2014

Contents
1 Preliminary comments

2

2 Motivation

2

3 First elements : the reinforcement learning context

2

4 Emotional learning
4.1 Definition used . . . . . . . . .
4.2 Model . . . . . . . . . . . . . .
4.2.1 Qualitative emotions . .
4.2.2 Quantitative mixtures of

. . . . . .
. . . . . .
. . . . . .
emotions

5 Conclusion and forward

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

11
11
11
11
13
14

6 Appendix : Network of connected Exp.3
16
6.1 Learning a sequence of actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 When only leaves pull actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1

1

Preliminary comments

This document is about a project on which I work during my spare time. This is an old project
and each year I improve it with the helpful knowledge I can extract from my studies and my readings.
I am aware that I lack references and make simplifications or hypothesis without good justifications. I am actively looking for papers to confirm or challenge my assumptions.
I have no results (or just a few), hence the ”ideas” in the title.

2

Motivation

Human learning is a large landscape with a foreground, a background and, in between, towns,
forests and mountains. One can try to study each component one by one, but if you want to
approach a human behavior (or animal behavior, to be more humble) you probably need to take
care of all in one glance.
The first step of learning, in a mammal’s life, is an education taught by family. During this
phase, the attention of the learner depends almost entirely on his emotions : for example, it could
be that the mother gets attention because of love, the big brother because of admiration and the
father because of fear. Of course, more basic emotions from the body (like hunger, pain, comfort
etc...) have the same impact on the attention. Note that it leads us to set definitions (see section 3).
Furthermore, the current emotion determines associations between different observations and between decisions and observations. In term of reinforcement learning, the current emotion determines
what kind of self-reward/self-punishment is going to be activated for a given observation. Such an
effect is possible only if the learner can make links between his actions and his observations. An
intuitive way to set it is to assume that the learner observes his own actions.
Reinforcement learning in neural network seems to be one of the best approaches to model human learning, since Hebb conjectures [1] up to more recent discoveries about neuroplasticity [2]
and reward circuits [3].

3

First elements : the reinforcement learning context

The world is an adversarial context since it is always changing and most of the time depending
on any agent’s behavior. And a learning agent in the world must chose actions in real time and
converge as quickly as possible to the best choices. That’s why I’m introducing here an adversarial
multi-armed bandit context. The Exp.3 algorithm and improvements (Exp.4, Exp.5...) are successful to solve this kind of problem [4]. But here we also want to learn complex actions (like sequences

2

of actions) and to make links between different actions, different observations and between actions
and observations (For example, we hope to make an agent able to learn a sequence of actions by
imitation).
The exp.3 algorithm is a stochastic choice of an arm in a set of n arms ai ∀i = 1..n with probabilities
Pit ∀i = 1..n that evolve in time :

a1
P1t
P2t

a2

P3t
P4t

a3

...

3

Then we can intuitively set a network of exp.3 that encodes a sequence of actions at time t (to
simplify the drawing I assume only 2 actions) :

P1t

a1

t
P1|1

a1

t
P2|1

P2t

t
P1|1,1

...

t
P2|1,1

a2

...

a2
t
P1|2
t
P2|2

a1

a2
Note (1) : on this graph, the first node at the left represents any leaf of the tree : we repeat
recursively the tree at each leaf. From a cyclic point of view, we could imagine that the tree is
plunged into a surface of revolution. We can draw such a tree assuming depth = 2 and only two
different actions (with more nodes the drawing becomes to be unreadable) :

a1

a11

a11

a2

a12

a12

a1
a22
a2

a21

a21

a22

Note (2) : an other (but equivalent) way to see it : we can also imagine a superior order Markov
chain in a clique. The order of the chain is the depth and there is one node in the clique for each arm.
4

If at each depth an action is pulled the agent is performing a sequence of actions. If it gets a
reward at a given depth all the circuit from beginning to the current node is reinforced following
the exp.3 rules at each node (we can play with a decreasing reinforcement while backtracking the
path or we can fix exp.3 parameters η and β at each depth). We obtain a naive algorithm that
learns sequences of actions (figure 1).

4000
3500
3000
2500
2000
1500
1000
500
0
−500
0

2000

4000

6000

8000

10000

Figure 1: A network of connected Exp.3 learning a sequence of 4 actions with each time a choice
between 4 different actions. This is the regret curve : the cumulative reward if the learner always
perform the goal-sequence minus the actual cumulative reward of the learner : as soon as the regret
curve become constant the learner has learned the good sequence. This plot is an average on 100
regret curves. (See appendix for more explanations about this result).

Now, if only the leaf-action (last action of the sequence) is pulled and gives a reward as a single
action (that reinforce all the path) the network seems to act more sensitively than a simple exp.3
(figure 2). Note that it could be interesting to make links with the Boltzmann machine [5] if at a
given depth d we consider the sum :
X
t
htj +
Pi|j,d
.vit
j

vit

where, at time t,
is the activity of the node coding for action i at depth d (the visible variable)
and htj the activity of the node coding for action j at depth d − 1 (the hidden variable). In this
way we may be able to build a kind of exp.3-reinforcement Boltzmann machine...

5

400
EXP3
NetworkEXP3

350
300
250
200
150
100
50
0
−50
0

200

400

600

800

1000

Figure 2: A network of connected Exp.3 that pull just the last action is faster than simple Exp.3 to
detect a adversarial switching of strategy. Here the curves are the cumulative reward won playing
against an enemy that always plays “1” up to time 333 and then switch to always plays “2” up to
the end. If the learner plays 1 against 1 or 2 against 2, he gets the reward = -1, otherwise he gets
the reward = 1. The plot is obtained from an average on 100 such curves. (See appendix for more
explanations about this result).

Let’s go back to our subject : to be more general, we introduce a possibility to pull the action a at
t
depth d and time t, with probability Ppull
(a, d). It can also be reinforced by the reward as the first
step of the backtracking path following exp.3 rules :

6

P1t

pull

pull

t
Ppull
(a, 1)

t
Ppull
(a, 2)

a1

t
P1|1

a1

t
P2|1

P2t

a2

t
P1|1,1

...

t
P2|1,1

a2

...

t
P1|2
t
P2|2

a1

a2
Finally, the learner must be able to change these rules given its observations. An reasonable
assumption is to even consider reward as observation. We can add observations of actions to be
able to make association between the action and the reward (there are certainly simpler ways to
identify the action to reward but I’m tempted to hope that this one may lead to model drafts of
self-consciousness...). We can induce such associations with an integrate-and-fire network where
inputs are the level of the reward (float number OR ∈ [−1, 1], we take the absolute value for the
integration) and the binary values of activation of the actions observed (OA = 0 or 1, observation
of sequence of action A = a1 , ..., ad up to a depth d, that imply as much nodes as in the tree of
action sequences to encode observation of it all : m = n + n2 + n3 + ... + nD where D is the maximal
depth) :

7

OA1

Reinforce A1

Reinforce A2

Reinforce A...

Reinforce Am

Σ

Σ

Σ

Σ

O A2

OA..

OAm

OR

If the node Σ(0Ai , 0R ) (integrate-and-fire node) is activated, the corresponding sequence of action
Ai is reinforced by the OR reward coefficient.
With such an algorithm, a learner can also learn from the actions of another agent if it can imagine
the reward received by the other agent. By imagine I mean that, given information like communication about the other’s reward the learner can also active the same reward (but the way to active
the reward was not a direct observation of a reward, so we can assume that in such a way this
imagined reward is less intense than a direct reward). Other possibility : both learner and other
agent win the same reward from the action sequence of this one.
Note : I said above that there is, at each depth (once can assume the current depth d), only
a possibility to pull the current action (ai,d at time t). If it is pulled, the learner observes the action
at a current observed depth d0 . If it is not pulled, and if the following action aj at depth d + 1
is then pulled, the learner observes the action at the current observed depth d0 : if we keep this
assumption, we lose the information that improved the single-exp.3 in figure 2.
To solve this problem we finally introduce an imaginary-circuit : imagination is, according to
the simplest and shortest definition, a learned logic sequence of false observations. As an action
ai is activated at depth d in the action-sequence tree, the same action ai at the same depth d
corresponding to the same sequence is activated in observations (if in addition the action is pulled,
the corresponding observation has two reasons to be activated, and an error (wrong move) can be
detected).

8

P1t

a1

t
P1|1

a1

t
P2|1

P2t

a2

t
P1|1,1

...

t
P2|1,1

a2

...

t
P1|2
t
P2|2

a1

a2
Decision tree

OA1

O A3

...

OA2

O A4

...

O A5

O A6
Observations

Then, a risk of wrong learning occurs if the learner gets a reward while he was just imagining an
action : we can simply add an imaginary-level of activation between 0 and 1 that is too small for
integrate-and-fire threshold even if the reward level is 1 (or -1). But, in real life with humans or
animals, are such errors actually impossible ?
A last detail : it is also possible and relevant to explore the possibility of activating the reward
with same such imaginary-circuit (as humans we can imagine different kinds of pleasures that follow
sequences of actions). It is possible if, while we apply a reinforcement, we also reinforce a back-edge
that leads to the observation of the reinforcement.

9

Once again, the imagined reward has a little level, and so has the imagined action activation : to
makes sense of this imagined observation we can accept integration and fire with a small threshold
and the possibility of the above kind of error (Does the imagination of a reward make sense if this
“reward” hasn’t any little reinforcement impact ?) :

...

...

a
Decision tree
r1

r2

Σ

O Ai

R
Observations

• r1 leads to reinforcing all the sequence up to ai , r2 leads to reinforcing the green edge
• The decision tree is taken at the depth of the sequence of actions Ai

10

4
4.1

Emotional learning
Definition used

“ Emotion : 1. A person’s internal state of being and involuntary physiological response to
an object or a situation, based on or tied to physical state and sensory data”. (emotion, Wiktionary)
According to this definition, by emotion I mean a state of behavior caused by typical observations where rewards, observations and decisions are distorted. That’s why I take into account basic
sensations like hunger as well as the abstract feelings like love.
Observations are distorted : when I am in fear, I will probably be more attentive to movements,
shapes, etc... If I’m in love of a girl, I will be more efficient to distinguish her in the crowd of a
full amphitheater (but I will not hear the maths teacher’s theorem). Finally, if I’m hungry in the
subway station, I will (despite me) pay attention to every food advertising pictures.
Rewards are distorted : for hunger, this is a trivial result : I feel more gratification for eating
something when I’m hungry. Also, often, the first kiss of a love story brings more pleasure than
the next ones... When I’m sad, a joke from a friend gives less joy than when I’m happy. And we
could find a lot of similar phenomena in each kind of emotion.
Decisions are distorted : once again, that’s obvious : If I have to chose between $1 and an apple I will chose the apple if I’m hungry and the coins if I am not. And a more general result :
happy people say “yes !” and smile while sad people say “no...” and bob down.
Caused by typical observations : for hunger we are mainly talking about information from the
body (and sometimes visual or olfactory observations can induce hunger but mostly it just induces
the conscious realization of a hunger that we were already experiencing unconsciously...).

4.2

Model

Somehow, an emotional state bias observations, rewards and decisions, and observations can
move the emotional state. On real world, an emotion seems to be a quantitative value (we can be
very hungry or just a little) and we generally feel mixture of different emotions. But first, I will
make the coarse assumption that we have just one qualitative emotion at the time.
4.2.1

Qualitative emotions

Let E1 and E2 be two different emotional states. A simple example is hunger and fear : if we
want to model a mouse that run in a map where she find candies. Sometime, candies are poisoned
and the mouse become fearful. But when it starts getting too hungry, it dominates its fear and
tries the candies again. In this example, the mouse learns just one action, but we can imagine now
that it has to press one button to get food and some time the button gives an electric shock...

11

In this last particular case, we just need 4 kinds of observation : reward from food, punishment
from electric shock, need for food, and observation of action sequences. An intuitive simplification
is to assume that observations of action sequences do not contribute to move the emotional state.
How do observations change the emotional state : the naive model that could work here is an
integrate-and-fire network : E1 (fear) is activated when α1 Os (el. shock) + β1 Of (eating food) +
γ1 On (need for food) exceeds a threshold, with α1 > 0, β1 ≤ 0 and γ1 ≤ 0. In the same way, E2
(hunger) is activated if α2 Os + β2 Of + γ1 On > threshold with α1 < 0, β1 > 0 and γ1 > 0. Here,
Os , On and Of are the level of activation of the observation nodes (that may also represent the
level of a signal than the frequency of activation) :

Of

β1
β2
γ1

Os

E1 activated

Σ

E2 activated

γ2
α1

On

Σ

α2

It is possible to have such signals that different emotional states are activated. The solution is to
chose the emotion that is activated with the biggest signal.
How does the emotional state disturb observations, rewards and decisions : here we assume that
we are using the reinforcement structure described in section 2 that governs the processes between
observations rewards an decisions. Now, let us imagine that there is such a structure for each
emotional state, with specific edge weights. Somehow, there is one “brain” for each emotion. The
following drawing summarizes this idea (assuming only two emotions E1 and E2 ) :

12

Dec. tree
E1

E2
Move the state

Σ

Obs. Actions

Σ

Obs. rewards

Σ

Other obs.

Other obs.

Emotion 1

Σ

Obs. rewards

Obs. Actions

Emotion 2

Legend :

Exp.3 node

Σ

Integrate-and-fire node

deterministic edges with stated weight
probabilistic edges with weight that can be reinforced
edges that induce a reinforcement
edges that move the emotional state
edges that induce “imaginations”
Of course, we could imagine exactly the same model but with one structure (not one for each
emotion) where edges have a weight for each emotion.
4.2.2

Quantitative mixtures of emotions

Now, we would like to assume behaviors that do not depend on a qualitative emotion but on the
contribution of many emotions. For example, our mouse should be able to feel 30% of fear and 70%
of hunger. After the above section, the first idea is to look at a resulting state where the weight of
edges are a linear combination of their weight in each emotional state. That’s especially intuitive
since the integrate-and-fire model that activates the states returns a specific level of signal for each
13

emotion (we can assume a threshold equals to zero : if there is a positive signal si for emotion Ei
we take Ei in account but if si < 0 we don’t count Ei ). In other terms, if the integrate-and-fire
network between observation and emotion returns signals s1 for E1 , ... sn for En (assuming n
different states), the resulting emotional state is given by :
Etot =

n
X

si 1{si >0} Ei

i=1

That means for each edge e with weight we :
we =

n
X

si 1{si >0} wei

i=1

Where wei are the weights of edge e in each emotional state i. Note that we are talking about edges
that can be either probabilistic or deterministic (if it’s an Exp.3 edge, the weight is the probability
of the edge).

5

Conclusion and forward

All these ideas are intuitive but naive : it is almost sure that implementation may bring lot
of problems (in particular complexity problem). An other important problem is the order of processes : on the one hand, the algorithm performs a random walk in the decision tree that runs
permanently and, on the other hand, it receives observations at the same time (the optimal way
to implement this algorithm may be using several processors : for each emotion a processor for
decisions and another for observations, and a processor to compute the new/resultant emotional
state).
This algorithm could be generalized : an idea should be to try a continuum between decision
tree and observations. I mean that the random walk could be in all the nodes and not only in the
decision tree. In this way, the learner agent may be able to imagine as well any observation than
any reward than any action.
Plasticity does not only depend on rewards : the simple fact that one activate several times the
same path of neurones induce a reinforcement. We could imagine that the random walk reinforces
each steps it uses (with little reinforcements) and rewards reinforces just the decision tree (with big
reinforcements like Exp.3).
Instead of Exp.3 algorithms, it could be also interesting and maybe relevant to explore other
multi-armed bandit algorithms like UCB or Thompson Sampling. I didn’t try to look for papers
about sequential learning but I’m sure that I could also improve my structure of decision tree.
I intend to implement step by step this algorithm. I have already started with small try on
MATLAB, but I want to implement it with a more basic and faster language, like C or C++.

14

This is an epigenetic algorithm. But to find the good parameters (for example the stated edges
weights of integrate-and-fire networks) a good way may be to use a genetic algorithm with selection
of the best agents and mutations.

References
[1] Hebb, D.O. (1949) The Organization of Behavior : A Neuropsychological Theorie. New York :
Wiley.
[2] Bear, M.F. ,Connors, B.W., Paradisio, M.A. (1996) Neuroscience : Exploring the Brain. Baltimore : Williams and Wilkins.
[3] Schultz, W.,Dayan, P., Montague, P.R. (1997) A Neural Substrate of Prediction and Reward.
Science, Vol. 275, No. 1593.
[4] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E. (2002) The NonStochastic Multiarmed
Bandit Problem. SIAM Journal on Computing Vol. 32, No. 1, pp. 48–77.
[5] Ackley, D.H., Hinton, G.E., Sejnowski, T.J. (1985) A Learning Algorithm for Boltzmann Machines. Cognitive Science, Vol. 9, pp. 147-169
[6] Dayan, P, Abbott, L.F. (2001) Theoretical Neuroscience : Computational and Mathematical
Modeling of Neural Systems. The MIT Press

15

6

Appendix : Network of connected Exp.3

Note : The following implementations are coded in MATLAB.

6.1

Learning a sequence of actions

Here we are implementing the network of the decision tree where at each node the action is
systematically pulled. The maximal depth of the tree is D = 4 and each node has 4 sons. The
possible actions are write from 1 to 4 and the goal sequence is this order of actions : 1 → 2 → 3 →
4. Each depth d is associated with Exp.3 constants η(d) = (D−d)
100 and β(d) = 0.1∀d.
We are randomly walking in the tree and at each depth we look the vector of the 4 last actions.
If this vector is [1 2 3 4] the reward is positive (r = 1). There is reward r = 0 for [X 2 3 4] (X 6=
1), for [X Y 3 4] (Y 6= 2) and for [X Y Z 4] (Z 6= 3). All the other vectors (for ex, [4 3 2 1]) give
negative reward (r = −1).
Then at each step we reinforce the path following the Exp.3 iteration :
Algorithm 1 Exp.3 reinforcement of the path from depth d
inputs :
the last 4 actions visited [a1 , a2 , a3 , a4 ]
the weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]
the depth d

comp. the reward : r = r([a1 , a2 , a3 , a4 ])

for i = 1 : 4 do
d ← (d − 1) mod (4)

p ←− (1 − β(d))

Wi−1 (ai )
+ β(d)/4
4
P
Wi−1 (aj )
j=1

r
Wi−1 (ai ) ←− Wi−1 (ai ) exp(η(d) )
p
end for

16

6.2

When only leaves pull actions

Here we make Exp.3 algorithms play the balanced game with the payoff gains matrix :

1
2

1
-1,1
1,-1

2
1,-1
-1,1

We simulate an enemy that always plays “1” and then switch with proba 0.001 to a new state
where always plays “2” and then again (the enemy is a Markov chain with two states with a little
probability of switch). If the Exp.3-learner plays 1 against 1 or 2 against 2, he get reward = -1,
otherwise he get reward = 1.
Then we compare the cumulative gain obtained by 3 different Exp.3 (different η and β parameters) and the network composed of this three Exp.3 :
η = 0.1 β = 0.001 η = 0.01 β = 0.1
η=1β=0

1

1

1

pull

2

2

2

pull

0
(*)

If we run such a game with horizon-time = 10000 we obtain the following result (that’s an average
over 100 different runs, but with the same enemy’s Markov chain) :

17

6000
EXP3
netwokEXP3
EXP3 eta*100
EXP3 eta*10

4000

2000

0

−2000

−4000

−6000
0

2000

4000

6000

8000

10000

Figure 3: Here the η and β values are the same that described above with the same color-code.
The green curve is the cumulative reward obtained by the network (*).

This good result could be explained by a better sensitivity of variations in different time-scales. We
can zoom on a 1000-time run with one switching to better observe this phenomenon :
400
EXP3
NetworkEXP3

350
300
250
200
150
100
50
0
−50
0

200

400

600

800

1000

Figure 4: A network of connected Exp.3 that pull just the last action is faster than simple Exp.3 to
detect a adversarial switching of strategy. Here the curves are the cumulative reward won playing
against an enemy that always plays “1” up to time 333 and then switch to always plays “2” up to
the end. The plot is obtained from an average on 100 such curves. The blue curve have η and β
parameters η = 0.01 β = 0.1 and the network’s parameters are the same than in network (7).

18




Télécharger le fichier (PDF)

emotional_learning.pdf (PDF, 348 Ko)

Télécharger
Formats alternatifs: ZIP







Documents similaires


musique triste
emotional learning
reward
carte des emotions
father s brain is sensitive to chidcare experiences
is the human brain capable of identifying a fake smile

Sur le même sujet..