# emotional learning .pdf

À propos / Télécharger Aperçu

**emotional_learning.pdf**

Ce document au format PDF 1.5 a été généré par LaTeX with hyperref package / pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 03/03/2014 à 12:14, depuis l'adresse IP 138.231.x.x.
La présente page de téléchargement du fichier a été vue 801 fois.

Taille du document: 348 Ko (18 pages).

Confidentialité: fichier public

### Aperçu du document

Ideas of algorithm architectures for emotional

machine learning

Alexis JACQ

March 1, 2014

Contents

1 Preliminary comments

2

2 Motivation

2

3 First elements : the reinforcement learning context

2

4 Emotional learning

4.1 Definition used . . . . . . . . .

4.2 Model . . . . . . . . . . . . . .

4.2.1 Qualitative emotions . .

4.2.2 Quantitative mixtures of

. . . . . .

. . . . . .

. . . . . .

emotions

5 Conclusion and forward

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

11

11

11

13

14

6 Appendix : Network of connected Exp.3

16

6.1 Learning a sequence of actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

6.2 When only leaves pull actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1

1

Preliminary comments

This document is about a project on which I work during my spare time. This is an old project

and each year I improve it with the helpful knowledge I can extract from my studies and my readings.

I am aware that I lack references and make simplifications or hypothesis without good justifications. I am actively looking for papers to confirm or challenge my assumptions.

I have no results (or just a few), hence the ”ideas” in the title.

2

Motivation

Human learning is a large landscape with a foreground, a background and, in between, towns,

forests and mountains. One can try to study each component one by one, but if you want to

approach a human behavior (or animal behavior, to be more humble) you probably need to take

care of all in one glance.

The first step of learning, in a mammal’s life, is an education taught by family. During this

phase, the attention of the learner depends almost entirely on his emotions : for example, it could

be that the mother gets attention because of love, the big brother because of admiration and the

father because of fear. Of course, more basic emotions from the body (like hunger, pain, comfort

etc...) have the same impact on the attention. Note that it leads us to set definitions (see section 3).

Furthermore, the current emotion determines associations between different observations and between decisions and observations. In term of reinforcement learning, the current emotion determines

what kind of self-reward/self-punishment is going to be activated for a given observation. Such an

effect is possible only if the learner can make links between his actions and his observations. An

intuitive way to set it is to assume that the learner observes his own actions.

Reinforcement learning in neural network seems to be one of the best approaches to model human learning, since Hebb conjectures [1] up to more recent discoveries about neuroplasticity [2]

and reward circuits [3].

3

First elements : the reinforcement learning context

The world is an adversarial context since it is always changing and most of the time depending

on any agent’s behavior. And a learning agent in the world must chose actions in real time and

converge as quickly as possible to the best choices. That’s why I’m introducing here an adversarial

multi-armed bandit context. The Exp.3 algorithm and improvements (Exp.4, Exp.5...) are successful to solve this kind of problem [4]. But here we also want to learn complex actions (like sequences

2

of actions) and to make links between different actions, different observations and between actions

and observations (For example, we hope to make an agent able to learn a sequence of actions by

imitation).

The exp.3 algorithm is a stochastic choice of an arm in a set of n arms ai ∀i = 1..n with probabilities

Pit ∀i = 1..n that evolve in time :

a1

P1t

P2t

a2

P3t

P4t

a3

...

3

Then we can intuitively set a network of exp.3 that encodes a sequence of actions at time t (to

simplify the drawing I assume only 2 actions) :

P1t

a1

t

P1|1

a1

t

P2|1

P2t

t

P1|1,1

...

t

P2|1,1

a2

...

a2

t

P1|2

t

P2|2

a1

a2

Note (1) : on this graph, the first node at the left represents any leaf of the tree : we repeat

recursively the tree at each leaf. From a cyclic point of view, we could imagine that the tree is

plunged into a surface of revolution. We can draw such a tree assuming depth = 2 and only two

different actions (with more nodes the drawing becomes to be unreadable) :

a1

a11

a11

a2

a12

a12

a1

a22

a2

a21

a21

a22

Note (2) : an other (but equivalent) way to see it : we can also imagine a superior order Markov

chain in a clique. The order of the chain is the depth and there is one node in the clique for each arm.

4

If at each depth an action is pulled the agent is performing a sequence of actions. If it gets a

reward at a given depth all the circuit from beginning to the current node is reinforced following

the exp.3 rules at each node (we can play with a decreasing reinforcement while backtracking the

path or we can fix exp.3 parameters η and β at each depth). We obtain a naive algorithm that

learns sequences of actions (figure 1).

4000

3500

3000

2500

2000

1500

1000

500

0

−500

0

2000

4000

6000

8000

10000

Figure 1: A network of connected Exp.3 learning a sequence of 4 actions with each time a choice

between 4 different actions. This is the regret curve : the cumulative reward if the learner always

perform the goal-sequence minus the actual cumulative reward of the learner : as soon as the regret

curve become constant the learner has learned the good sequence. This plot is an average on 100

regret curves. (See appendix for more explanations about this result).

Now, if only the leaf-action (last action of the sequence) is pulled and gives a reward as a single

action (that reinforce all the path) the network seems to act more sensitively than a simple exp.3

(figure 2). Note that it could be interesting to make links with the Boltzmann machine [5] if at a

given depth d we consider the sum :

X

t

htj +

Pi|j,d

.vit

j

vit

where, at time t,

is the activity of the node coding for action i at depth d (the visible variable)

and htj the activity of the node coding for action j at depth d − 1 (the hidden variable). In this

way we may be able to build a kind of exp.3-reinforcement Boltzmann machine...

5

400

EXP3

NetworkEXP3

350

300

250

200

150

100

50

0

−50

0

200

400

600

800

1000

Figure 2: A network of connected Exp.3 that pull just the last action is faster than simple Exp.3 to

detect a adversarial switching of strategy. Here the curves are the cumulative reward won playing

against an enemy that always plays “1” up to time 333 and then switch to always plays “2” up to

the end. If the learner plays 1 against 1 or 2 against 2, he gets the reward = -1, otherwise he gets

the reward = 1. The plot is obtained from an average on 100 such curves. (See appendix for more

explanations about this result).

Let’s go back to our subject : to be more general, we introduce a possibility to pull the action a at

t

depth d and time t, with probability Ppull

(a, d). It can also be reinforced by the reward as the first

step of the backtracking path following exp.3 rules :

6

P1t

pull

pull

t

Ppull

(a, 1)

t

Ppull

(a, 2)

a1

t

P1|1

a1

t

P2|1

P2t

a2

t

P1|1,1

...

t

P2|1,1

a2

...

t

P1|2

t

P2|2

a1

a2

Finally, the learner must be able to change these rules given its observations. An reasonable

assumption is to even consider reward as observation. We can add observations of actions to be

able to make association between the action and the reward (there are certainly simpler ways to

identify the action to reward but I’m tempted to hope that this one may lead to model drafts of

self-consciousness...). We can induce such associations with an integrate-and-fire network where

inputs are the level of the reward (float number OR ∈ [−1, 1], we take the absolute value for the

integration) and the binary values of activation of the actions observed (OA = 0 or 1, observation

of sequence of action A = a1 , ..., ad up to a depth d, that imply as much nodes as in the tree of

action sequences to encode observation of it all : m = n + n2 + n3 + ... + nD where D is the maximal

depth) :

7

OA1

Reinforce A1

Reinforce A2

Reinforce A...

Reinforce Am

Σ

Σ

Σ

Σ

O A2

OA..

OAm

OR

If the node Σ(0Ai , 0R ) (integrate-and-fire node) is activated, the corresponding sequence of action

Ai is reinforced by the OR reward coefficient.

With such an algorithm, a learner can also learn from the actions of another agent if it can imagine

the reward received by the other agent. By imagine I mean that, given information like communication about the other’s reward the learner can also active the same reward (but the way to active

the reward was not a direct observation of a reward, so we can assume that in such a way this

imagined reward is less intense than a direct reward). Other possibility : both learner and other

agent win the same reward from the action sequence of this one.

Note : I said above that there is, at each depth (once can assume the current depth d), only

a possibility to pull the current action (ai,d at time t). If it is pulled, the learner observes the action

at a current observed depth d0 . If it is not pulled, and if the following action aj at depth d + 1

is then pulled, the learner observes the action at the current observed depth d0 : if we keep this

assumption, we lose the information that improved the single-exp.3 in figure 2.

To solve this problem we finally introduce an imaginary-circuit : imagination is, according to

the simplest and shortest definition, a learned logic sequence of false observations. As an action

ai is activated at depth d in the action-sequence tree, the same action ai at the same depth d

corresponding to the same sequence is activated in observations (if in addition the action is pulled,

the corresponding observation has two reasons to be activated, and an error (wrong move) can be

detected).

8

P1t

a1

t

P1|1

a1

t

P2|1

P2t

a2

t

P1|1,1

...

t

P2|1,1

a2

...

t

P1|2

t

P2|2

a1

a2

Decision tree

OA1

O A3

...

OA2

O A4

...

O A5

O A6

Observations

Then, a risk of wrong learning occurs if the learner gets a reward while he was just imagining an

action : we can simply add an imaginary-level of activation between 0 and 1 that is too small for

integrate-and-fire threshold even if the reward level is 1 (or -1). But, in real life with humans or

animals, are such errors actually impossible ?

A last detail : it is also possible and relevant to explore the possibility of activating the reward

with same such imaginary-circuit (as humans we can imagine different kinds of pleasures that follow

sequences of actions). It is possible if, while we apply a reinforcement, we also reinforce a back-edge

that leads to the observation of the reinforcement.

9

Once again, the imagined reward has a little level, and so has the imagined action activation : to

makes sense of this imagined observation we can accept integration and fire with a small threshold

and the possibility of the above kind of error (Does the imagination of a reward make sense if this

“reward” hasn’t any little reinforcement impact ?) :

...

...

a

Decision tree

r1

r2

Σ

O Ai

R

Observations

• r1 leads to reinforcing all the sequence up to ai , r2 leads to reinforcing the green edge

• The decision tree is taken at the depth of the sequence of actions Ai

10

4

4.1

Emotional learning

Definition used

“ Emotion : 1. A person’s internal state of being and involuntary physiological response to

an object or a situation, based on or tied to physical state and sensory data”. (emotion, Wiktionary)

According to this definition, by emotion I mean a state of behavior caused by typical observations where rewards, observations and decisions are distorted. That’s why I take into account basic

sensations like hunger as well as the abstract feelings like love.

Observations are distorted : when I am in fear, I will probably be more attentive to movements,

shapes, etc... If I’m in love of a girl, I will be more efficient to distinguish her in the crowd of a

full amphitheater (but I will not hear the maths teacher’s theorem). Finally, if I’m hungry in the

subway station, I will (despite me) pay attention to every food advertising pictures.

Rewards are distorted : for hunger, this is a trivial result : I feel more gratification for eating

something when I’m hungry. Also, often, the first kiss of a love story brings more pleasure than

the next ones... When I’m sad, a joke from a friend gives less joy than when I’m happy. And we

could find a lot of similar phenomena in each kind of emotion.

Decisions are distorted : once again, that’s obvious : If I have to chose between $1 and an apple I will chose the apple if I’m hungry and the coins if I am not. And a more general result :

happy people say “yes !” and smile while sad people say “no...” and bob down.

Caused by typical observations : for hunger we are mainly talking about information from the

body (and sometimes visual or olfactory observations can induce hunger but mostly it just induces

the conscious realization of a hunger that we were already experiencing unconsciously...).

4.2

Model

Somehow, an emotional state bias observations, rewards and decisions, and observations can

move the emotional state. On real world, an emotion seems to be a quantitative value (we can be

very hungry or just a little) and we generally feel mixture of different emotions. But first, I will

make the coarse assumption that we have just one qualitative emotion at the time.

4.2.1

Qualitative emotions

Let E1 and E2 be two different emotional states. A simple example is hunger and fear : if we

want to model a mouse that run in a map where she find candies. Sometime, candies are poisoned

and the mouse become fearful. But when it starts getting too hungry, it dominates its fear and

tries the candies again. In this example, the mouse learns just one action, but we can imagine now

that it has to press one button to get food and some time the button gives an electric shock...

11

In this last particular case, we just need 4 kinds of observation : reward from food, punishment

from electric shock, need for food, and observation of action sequences. An intuitive simplification

is to assume that observations of action sequences do not contribute to move the emotional state.

How do observations change the emotional state : the naive model that could work here is an

integrate-and-fire network : E1 (fear) is activated when α1 Os (el. shock) + β1 Of (eating food) +

γ1 On (need for food) exceeds a threshold, with α1 > 0, β1 ≤ 0 and γ1 ≤ 0. In the same way, E2

(hunger) is activated if α2 Os + β2 Of + γ1 On > threshold with α1 < 0, β1 > 0 and γ1 > 0. Here,

Os , On and Of are the level of activation of the observation nodes (that may also represent the

level of a signal than the frequency of activation) :

Of

β1

β2

γ1

Os

E1 activated

Σ

E2 activated

γ2

α1

On

Σ

α2

It is possible to have such signals that different emotional states are activated. The solution is to

chose the emotion that is activated with the biggest signal.

How does the emotional state disturb observations, rewards and decisions : here we assume that

we are using the reinforcement structure described in section 2 that governs the processes between

observations rewards an decisions. Now, let us imagine that there is such a structure for each

emotional state, with specific edge weights. Somehow, there is one “brain” for each emotion. The

following drawing summarizes this idea (assuming only two emotions E1 and E2 ) :

12

Dec. tree

E1

E2

Move the state

Σ

Obs. Actions

Σ

Obs. rewards

Σ

Other obs.

Other obs.

Emotion 1

Σ

Obs. rewards

Obs. Actions

Emotion 2

Legend :

Exp.3 node

Σ

Integrate-and-fire node

deterministic edges with stated weight

probabilistic edges with weight that can be reinforced

edges that induce a reinforcement

edges that move the emotional state

edges that induce “imaginations”

Of course, we could imagine exactly the same model but with one structure (not one for each

emotion) where edges have a weight for each emotion.

4.2.2

Quantitative mixtures of emotions

Now, we would like to assume behaviors that do not depend on a qualitative emotion but on the

contribution of many emotions. For example, our mouse should be able to feel 30% of fear and 70%

of hunger. After the above section, the first idea is to look at a resulting state where the weight of

edges are a linear combination of their weight in each emotional state. That’s especially intuitive

since the integrate-and-fire model that activates the states returns a specific level of signal for each

13

emotion (we can assume a threshold equals to zero : if there is a positive signal si for emotion Ei

we take Ei in account but if si < 0 we don’t count Ei ). In other terms, if the integrate-and-fire

network between observation and emotion returns signals s1 for E1 , ... sn for En (assuming n

different states), the resulting emotional state is given by :

Etot =

n

X

si 1{si >0} Ei

i=1

That means for each edge e with weight we :

we =

n

X

si 1{si >0} wei

i=1

Where wei are the weights of edge e in each emotional state i. Note that we are talking about edges

that can be either probabilistic or deterministic (if it’s an Exp.3 edge, the weight is the probability

of the edge).

5

Conclusion and forward

All these ideas are intuitive but naive : it is almost sure that implementation may bring lot

of problems (in particular complexity problem). An other important problem is the order of processes : on the one hand, the algorithm performs a random walk in the decision tree that runs

permanently and, on the other hand, it receives observations at the same time (the optimal way

to implement this algorithm may be using several processors : for each emotion a processor for

decisions and another for observations, and a processor to compute the new/resultant emotional

state).

This algorithm could be generalized : an idea should be to try a continuum between decision

tree and observations. I mean that the random walk could be in all the nodes and not only in the

decision tree. In this way, the learner agent may be able to imagine as well any observation than

any reward than any action.

Plasticity does not only depend on rewards : the simple fact that one activate several times the

same path of neurones induce a reinforcement. We could imagine that the random walk reinforces

each steps it uses (with little reinforcements) and rewards reinforces just the decision tree (with big

reinforcements like Exp.3).

Instead of Exp.3 algorithms, it could be also interesting and maybe relevant to explore other

multi-armed bandit algorithms like UCB or Thompson Sampling. I didn’t try to look for papers

about sequential learning but I’m sure that I could also improve my structure of decision tree.

I intend to implement step by step this algorithm. I have already started with small try on

MATLAB, but I want to implement it with a more basic and faster language, like C or C++.

14

This is an epigenetic algorithm. But to find the good parameters (for example the stated edges

weights of integrate-and-fire networks) a good way may be to use a genetic algorithm with selection

of the best agents and mutations.

References

[1] Hebb, D.O. (1949) The Organization of Behavior : A Neuropsychological Theorie. New York :

Wiley.

[2] Bear, M.F. ,Connors, B.W., Paradisio, M.A. (1996) Neuroscience : Exploring the Brain. Baltimore : Williams and Wilkins.

[3] Schultz, W.,Dayan, P., Montague, P.R. (1997) A Neural Substrate of Prediction and Reward.

Science, Vol. 275, No. 1593.

[4] Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E. (2002) The NonStochastic Multiarmed

Bandit Problem. SIAM Journal on Computing Vol. 32, No. 1, pp. 48–77.

[5] Ackley, D.H., Hinton, G.E., Sejnowski, T.J. (1985) A Learning Algorithm for Boltzmann Machines. Cognitive Science, Vol. 9, pp. 147-169

[6] Dayan, P, Abbott, L.F. (2001) Theoretical Neuroscience : Computational and Mathematical

Modeling of Neural Systems. The MIT Press

15

6

Appendix : Network of connected Exp.3

Note : The following implementations are coded in MATLAB.

6.1

Learning a sequence of actions

Here we are implementing the network of the decision tree where at each node the action is

systematically pulled. The maximal depth of the tree is D = 4 and each node has 4 sons. The

possible actions are write from 1 to 4 and the goal sequence is this order of actions : 1 → 2 → 3 →

4. Each depth d is associated with Exp.3 constants η(d) = (D−d)

100 and β(d) = 0.1∀d.

We are randomly walking in the tree and at each depth we look the vector of the 4 last actions.

If this vector is [1 2 3 4] the reward is positive (r = 1). There is reward r = 0 for [X 2 3 4] (X 6=

1), for [X Y 3 4] (Y 6= 2) and for [X Y Z 4] (Z 6= 3). All the other vectors (for ex, [4 3 2 1]) give

negative reward (r = −1).

Then at each step we reinforce the path following the Exp.3 iteration :

Algorithm 1 Exp.3 reinforcement of the path from depth d

inputs :

the last 4 actions visited [a1 , a2 , a3 , a4 ]

the weight of edges that spring out of the last 4 nodes left [W0 , W1 , W2 , W3 ]

the depth d

comp. the reward : r = r([a1 , a2 , a3 , a4 ])

for i = 1 : 4 do

d ← (d − 1) mod (4)

p ←− (1 − β(d))

Wi−1 (ai )

+ β(d)/4

4

P

Wi−1 (aj )

j=1

r

Wi−1 (ai ) ←− Wi−1 (ai ) exp(η(d) )

p

end for

16

6.2

When only leaves pull actions

Here we make Exp.3 algorithms play the balanced game with the payoff gains matrix :

1

2

1

-1,1

1,-1

2

1,-1

-1,1

We simulate an enemy that always plays “1” and then switch with proba 0.001 to a new state

where always plays “2” and then again (the enemy is a Markov chain with two states with a little

probability of switch). If the Exp.3-learner plays 1 against 1 or 2 against 2, he get reward = -1,

otherwise he get reward = 1.

Then we compare the cumulative gain obtained by 3 different Exp.3 (different η and β parameters) and the network composed of this three Exp.3 :

η = 0.1 β = 0.001 η = 0.01 β = 0.1

η=1β=0

1

1

1

pull

2

2

2

pull

0

(*)

If we run such a game with horizon-time = 10000 we obtain the following result (that’s an average

over 100 different runs, but with the same enemy’s Markov chain) :

17

6000

EXP3

netwokEXP3

EXP3 eta*100

EXP3 eta*10

4000

2000

0

−2000

−4000

−6000

0

2000

4000

6000

8000

10000

Figure 3: Here the η and β values are the same that described above with the same color-code.

The green curve is the cumulative reward obtained by the network (*).

This good result could be explained by a better sensitivity of variations in different time-scales. We

can zoom on a 1000-time run with one switching to better observe this phenomenon :

400

EXP3

NetworkEXP3

350

300

250

200

150

100

50

0

−50

0

200

400

600

800

1000

Figure 4: A network of connected Exp.3 that pull just the last action is faster than simple Exp.3 to

detect a adversarial switching of strategy. Here the curves are the cumulative reward won playing

against an enemy that always plays “1” up to time 333 and then switch to always plays “2” up to

the end. The plot is obtained from an average on 100 such curves. The blue curve have η and β

parameters η = 0.01 β = 0.1 and the network’s parameters are the same than in network (7).

18