# IdrissHassineRapport PFE .pdf

À propos / Télécharger Aperçu

**IdrissHassineRapport_PFE.pdf**

Ce document au format PDF 1.5 a été généré par TeX / MiKTeX pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 27/10/2015 à 16:34, depuis l'adresse IP 194.57.x.x.
La présente page de téléchargement du fichier a été vue 584 fois.

Taille du document: 679 Ko (57 pages).

Confidentialité: fichier public

### Aperçu du document

Ministère de l’Enseignement Supérieur et de la Recherche Scientifique

Université Tunis El Manar

Département Technologies de l’Information et de la Communication

Projet de Fin d’Etudes

présenté par

Idriss HASSINE

pour obtenir

le Diplôme National d’Ingénieur

en Informatique

Automatic image annotation using topic models

réalisé au sein de

LR-SITI-ENIT

Encadrant

Mr. Hamid AMIRI Professeur (ENIT)

Année Universitaire 2013/2014

Signatures

Encadrant Entreprise

Encadrant ENIT

Abstract

Image annotation is an important open problem in computer vision research. Because an image

is captured without caption, the goal of image annotation is to assign relevant text keywords from

a word vocabulary to the image, to describe its visual content. Image annotation may be useful for

many tasks such as object recognition, or text-based image retrieval. Therefore a lot of attention in

recent years has been paid to the design and development of systems allowing automatic annotation

of images.

Specifically, we are interested in topic models and thereby we evaluate their performance in image

annotation.

Keywords: Automatic Image Annotation (AIA), topic models.

Acknowledgements

This project would not have been possible without the help and support of many friends and

colleagues.

Foremost, I thank my advisor Mr.Hamid Amiri. He has been an exemplary teacher, mentor, and

collaborator. I also thank Mr. Marouan Haj Ayech for his insightful comments and suggestions.

I am fortunate to have interacted with a number of superb colleagues, have all been influential

to my project through collaboration, discussion, constructive criticism, and I would like to recognize

their contribution to this work.

My experience in LR-SITI-ENIT would not have been possible without the generous material aid

from officials of the research unit.

I thank all my friends and family for their support and distraction. I especially thank my parents

Habib Hassine and Faten Hacheni and sister Nesrine Hassine who have given me a lifetime of love

and care. Finally, I thank Mohamed Hassine my brother and my dear friend for his kindness, patience,

humor, friendship and support for me.

Dedicated to my parents, Faten Hacheni and Habib Hassine.

Contents

Introduction

1

1

3

3

3

4

4

5

5

6

State of the art

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Automatic image annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3 Topic models for automatic image annotation . . . . . . . . . . . . . . . . . . . . .

1.3.1 Latent Dirichlet Allocation(LDA) . . . . . . . . . . . . . . . . . . . . . . .

1.3.2 Model Gaussian-Multinomial Mixture (GM-Mixture) . . . . . . . . . . . . .

1.3.3 Model Gaussian-Multinomial LDA (GM-LDA) . . . . . . . . . . . . . . . .

1.3.4 Correspondence Latent Dirichlet Allocation (CorrLDA) . . . . . . . . . . .

1.3.5 Supervised Latent Dirichlet Allocation with Multivariate Binary Response

Variables (sLDA-bin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3.6 Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA) . . .

1.3.7 Max-Margin Latent Dirichlet Allocation (MMLDAa ) . . . . . . . . . . . . .

1.3.8 Correspondence CTM (corrCTM) . . . . . . . . . . . . . . . . . . . . . . .

1.4 Annotation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1.1 PASCAL VOC07 . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1.2 Corel 5K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1.3 LabelMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.1.4 UIUC-Sport . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.2 Visual feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.2.1 Scale Invariant Feature Transform (SIFT) . . . . . . . . . . . . . .

1.4.2.2 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.3 The performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.3.1 Caption perplexity . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.3.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.3.3 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4.3.4 Mean Average Precision (MAP) . . . . . . . . . . . . . . . . . . .

1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

7

8

9

9

9

9

9

9

10

10

10

10

10

11

11

12

12

12

2

3

Detailed study of used topic models

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Data representation . . . . . . . . . . . . . . . . . . . . . .

2.3 Framework of topic modeling . . . . . . . . . . . . . . . . .

2.3.1 Learning : . . . . . . . . . . . . . . . . . . . . . .

2.3.1.1 EM algorithm . . . . . . . . . . . . . . .

2.3.1.2 Variational EM algorithm . . . . . . . . .

2.3.1.3 Varitional Inference . . . . . . . . . . . .

2.3.2 Inference Algorithm . . . . . . . . . . . . . . . . .

2.4 Topic models for image annotation . . . . . . . . . . . . . .

2.4.1 Correspondence LDA (CorrLDA) . . . . . . . . . .

2.4.1.1 Variational Inference . . . . . . . . . . .

2.4.1.2 Maximizing Variational Parameters . . . .

2.4.1.3 Maximizing Model Parameters . . . . . .

2.4.1.4 Variational EM . . . . . . . . . . . . . . .

2.4.2 Correspondence Correlated Topic Model (CorrCTM)

2.4.2.1 Variational Inference . . . . . . . . . . .

2.4.2.2 Maximizing Variational Parameters . . . .

2.4.2.3 Maximizing Parameter Estimation . . . .

2.4.2.4 Learning algorithm . . . . . . . . . . . .

2.5 Comparison between LDA and CTM based topic models . .

2.5.1 Topic proportions θ . . . . . . . . . . . . . . . . .

2.5.2 Variational Parameters . . . . . . . . . . . . . . . .

2.5.3 Inference Process . . . . . . . . . . . . . . . . . . .

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .

Experiences and Results

3.1 Introduction . . . . . . . . . . . . . . . . . .

3.2 Technology and material environment . . . .

3.2.1 Hardware environment . . . . . . . .

3.2.2 Technology and software environment

3.3 Data configuration . . . . . . . . . . . . . .

3.4 Training models . . . . . . . . . . . . . . . .

3.5 Testing models . . . . . . . . . . . . . . . .

3.6 Test set likelihood . . . . . . . . . . . . . . .

3.7 Word Perplexity . . . . . . . . . . . . . . . .

3.8 Conclusion . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

14

14

14

15

15

16

16

17

18

18

18

20

21

22

24

25

26

27

28

30

30

31

31

31

.

.

.

.

.

.

.

.

.

.

32

32

32

32

32

33

33

35

37

38

39

Conclusion and outlook

40

Bibliography

41

Appendix A : Indexing of textual words (Corel 5K)

42

Appendix B : LDA

44

List of Figures

1.1

1.2

1.3

1.4

1.5

Graphic model LDA (Blei 2003) . . . . . . . . . . . . . . . . . . . .

GM-Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

GM-LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CorrLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

.

.

.

.

.

4

5

6

6

8

2.1

2.2

General process for a topic model . . . . . . . . . . . . . . . . . . . . . . . . . . .

(Left) Graphical model representation of our modified CorrLDA. (Right) Graphical

model representation of the variational distribution used to approximate the posterior

in Corr-LDA. Note that the variational graphical model creates independency between

z,θ and y, which enables us to factorize the joint probability. . . . . . . . . . . . . .

(left) Proposed CorrCTM, (right) Variational distribution used to approximate the

posterior for each image in CorrCTM. . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

Training CorrLDA . . . . . . . . . . . .

Different versions of the learning models .

Testing CorrLDA . . . . . . . . . . . . .

Result of testing CorrLDA . . . . . . . .

Annotation of CorrLDA . . . . . . . . . .

Test data set . . . . . . . . . . . . . . . .

The average test likelihood . . . . . . . .

Word Perplexity . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19

24

34

34

35

36

36

37

38

38

List of Tables

2.1

2.2

2.3

Symbol and Description (CorrLDA) . . . . . . . . . . . . . . . . . . . . . . . . . .

Parameter and Update Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Symbol and Description (CorrCTM) . . . . . . . . . . . . . . . . . . . . . . . . . .

18

22

25

3.1

Symbol and Description (LDA)

44

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Algorithms

2.1

2.2

2.3

2.4

2.5

2.6

2.7

3.1

3.2

EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Variational EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

Inference Algorithm, note that r is a set of regions belonging to image I

Variational Inference for corrLDA . . . . . . . . . . . . . . . . . . . .

Parameter Estimation for corrLDA . . . . . . . . . . . . . . . . . . . .

Variational Inference for CorrCTM . . . . . . . . . . . . . . . . . . .

Parameter Estimation for CorrCTM . . . . . . . . . . . . . . . . . . .

Variational Inference for LDA . . . . . . . . . . . . . . . . . . . . . .

Parameter Estimation for LDA . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

16

16

17

23

23

29

30

45

46

List of Acronyms

LDA

GM-Mixture

GM-LDA

CorrLDA

sLDA-bin

Tr-mmLDA

MMLDAa

CTM

CorrCTM

sCTM-bin

trmmCTM

SIFT

R

P

mR

mP

MAP

p

Mult

N(.,.)

E-Step

M-Step

L

Unif

SVMs

maxiter

Γ

Dir

Eq

Latent Dirichlet Allocation

Gaussian-Multinomial Mixture model

Gaussian-Multinomial Latent Dirichlet Allocation

Correspondence Latent Dirichlet Allocation

Supervised Latent Dirichlet Allocation with Multivariate Binary Response

Topic regression Multi-Modal Latnet Dirichlet Allocation

Max-Margin Latent dirichlet Allocation

Correlated Topic Model

Correspondence CTM

Supervised CTM with Multivariate Binary Response

Topic regression Multi-Modal CTM

Scale-invariant feature transform

Recall

Precision

Mean recall

Mean precision

Mean Average Precision

Probability

Multinomial law

Normal law

Estimation Step

Maximization Step

Lower bound

Uniform law

Support Vector Machines

Maximum number of iterations

Gamma function

Dirichlet law

Posterior Expectation

Introduction

Background : Image annotation is an important open problem in computer vision research. Because a large amount of images are available without caption, the goal of image annotation is to

assign relevant text keywords from a word vocabulary to the image, to describe its visual content.

Image annotation can be applied in object recognition (propose keywords for image), or text-based

image retrieval (propose images from a combination of keywords). That is why a lot of attention in

recent years has been paid to the design and development of automatically annotating images.

Generally, an annotation algorithm is performed in two processes. The first offline process consists

in learning patterns of association between image and text from a set of annotated images. The second

online process allows to predict textual annotations for any unlabeled image. Previous approaches can

be mainly categorized into 2 types.

For the first type, a classifier is learned for each vocabulary word which is treated as "concept

class". This is a supervised learning model, and often SVMs or boosting classifiers are typically used.

In practice, this line of approaches would inevitably suffer from scalability problem, as the annotation

performance sharply decreases when the discriminative boundaries for large scale classifiers become

ambiguous.

In the case of second type, the joint probabilistic distribution between image feature and text word

is learned by a generative latent factor framework which assumes that each image is formed with a

small set of hidden factors, and these factors correspondingly associate with text words.

Contributions : In this project we will use the second approach, and we will consider the following

objectives:

– Implementing the most commonly used topic models .

– Evaluating annotation performance of each model .

– Implementing an interactive system for image annotation, which takes an image without annotation as input and generates an annotation as output composed of words from a fixed vocabulary.

The host laboratory : This work was performed in the LR-SITI-ENIT laboratory .

This laboratory is a part of the National Engineering School of Tunis. Research areas of the laboratory are various such as image processing, pattern recognition and computer graphics. In the area

of pattern recognition, LR-SITI-ENIT has developed new approaches for both 2D and 3D problems

of object segmentation and recognition. Implemented algorithms are used in various fields such as

the biomedical field and industry field which need machine learning.

The main work of LR-SITI-ENIT focuses on :

1

– The development of models and learning methods for prediction and classification.

– The proposed methods for segmentation and characterization of 2D and 3D images.

Application areas favored by Researchers are document images and medical images.

Report structure : This report is organized as follows. In chapter 1, we will describe the topic

models of the state of the art and some criteria adopted by researchers to evaluate their models.

In chapter 2, we will describe the general process of automatic image annotation using topic

models. Then, we will present the models that we used in our work.

In chapter 3, we will present the technology and materials environment used in our project. Then,

we will put some screen prints to describe the work performed. Finally, we will conduct a comparative

study between the implemented topic models by evaluating the perfermances of each model.

2

Chapter 1

State of the art

1.1

Introduction

Automatic image annotation is the ability to provide a set of words for a given image in order to

describe its visual content using a system without any intervention of the human being. This work is

the subject of extensive research on image processing.

In this chapter we will talk about the automatic image annotation then we will present some models

used to annotate images. The main role of these models is to provide the automatic annotation of

images. It is necessary to note that these models can perform other tasks such as annotating an image

using a single word region, disambiguating a word by using visual content, searching images...

Probabilistic models for automatic annotation consist in estimating for each word ω the posterior

probability :

– p(ω | b) where b describes the known information on the blob (for example, the vector that

presents the visual content of the blob) for a corresponding model.

– p(ω | d) where d describes the known image information (for example, the ensemble β d all the

visual vector image) .

In order to determine the ability of a system to annotate an image, we needed a corpus of images that

d . So, the aim of the system is to

are manually annotated previously by a bag of reference words WRe

f

estimate an annotation to a given picture that reflects the reality using a set of words in a specific way.

Firstly, we will briefly introduce the automatic annotation of images, then we will describe some

probabilistic models of the state of the art which are based essentially on the distribution of Dirichlet.

We will study the contribution of each model to image annotation compared to other models . Finally,

we will present the criteria used to measure performance, their characteristics and the necessary conditions to use .

1.2

Automatic image annotation

The objective of automatic annotation is to associate a set of words to a given image via a computer

system.

3

The principle of this method is:

– First, the computer system must learn to annotate images based on images annotated previously.

– Then, the system must be able to annotate a new image that did not have a text description.

For probabilistic models, automatic annotation consists in estimating the posterior probability :

– p (ω| i), where i describes the information generated from the image (for example, all the visual

vector image).

– If a segmented image is available, we must first estimate the posterior probability p (ω | b)

where b describes the information generated from an image region (for example, the vector

with the visual content of the image region).

1.3

Topic models for automatic image annotation

In this section, we will focus on the models based on Latent Dirichlet Allocation(LDA). Therefore,

first we introduce LDA and then, we will study LDA extensions to image annotation problem : GMMixture, CorrLDA,sLDA-bin, tr-mmLDA, MMLDAa and CorrCTM.

1.3.1

Latent Dirichlet Allocation(LDA)

In this part, we will focus on the models based on the Dirichlet distribution . The Dirichlet distribution estimates the probability vector Θd = (p1 , p2 , ... , pr ) where p j is the probability that the

concept (also called hidden or latent class) z j is in the document, depending on the frequency of occurrences of each concept α j in the document.

Figure 1.1: Graphic model LDA (Blei 2003)

Latent Dirichlet Allocation(LDA) is a probabilistic model proposed by Blei (Blei 2003). The

purpose of designing LDA is to model discrete data, in particular large collection of textual document

(see Figure 1.1 and the part B of the appendix).

The number of dimensions corresponds to the number of descriptors of the hidden concepts. Latent Dirichlet Allocation (LDA) is a generative probabilistic model of a corpus. It is a bayesian

hierarchical model that contains three levels and considers that words and concepts are interchangeable.

The main idea is that the documents are described as random mixtures over latent topics, where

each topic is defined by a distribution of words.

In the next section we will describe some models of the state of the art.

4

1.3.2

Model Gaussian-Multinomial Mixture (GM-Mixture)

GM-Mixture is a simple mixing model (Blei 2004) that does not use the LDA, but it is important

to describe in order to present the next models that are inspired from LDA.

GM-Mixture uses a single random variable Z, and therefore only one component z. It is generally

used to express the relationship between an image and its annotation. The strength of this model is

that it provides high associativity between the blobs and image caption.[1]

the principle of this model : we denote by r the number of dimension of GM-Mixture,the annotation for an image d follows the following steps:

1. Choose a value z from {1,2,...,r} according to p(Z| θ ) ∼ Mult(θ ).

2. For each blob b of the image:

– We predict that the visual features of b are selected according to p(Bd | z, µ1:r ,σ1:r )∼ N(µz , σz ).

– We predict that ω is selected according to p(Wd | z, β1:r )∼ Mult(βz ).

The following Figure (1.2) describes the process of this model.

Figure 1.2: GM-Mixture

1.3.3

Model Gaussian-Multinomial LDA (GM-LDA)

GM-LDA model extends the GM-MIXTURE model but uses LDA. The strength of this model is

that it provides high flexibility between the blobs and image caption. [1]

The principle of this model : we denote by r the number of dimension of GM-LDA, the visualtextual annotation follows the following generative process :

1. Choose proportions θ d according to p(θ | α) ∼ Mult(α).

n

o

2. For each p ∈ 1, 2, ..., nβ d

– Select a component z p from {1, 2, ..., r} according to p(Zd | θ d ) ∼ Mult(θ d ).

– Choose the vector visual b p according to p(Bd | z p , µ1:r , σ1:r ) ∼ N(µz p , σz p ).

For each j ∈ {1, 2, ..., nω d }

– Select a component y j from {1, 2, ..., r} according to p(Yd | θ d ) ∼ Mult(θ d ).

5

– Choose the word ω j according to p(Wd | y j , β1:r ) ∼ N(βy j ).

The following Figure (1.3) describes the process of this model.

Figure 1.3: GM-LDA

1.3.4

Correspondence Latent Dirichlet Allocation (CorrLDA)

This model inherits the flexibility of GM-LDA model and the associativity of the GM-Mixture

model. Presented by (Blei & Jordan, 2003), this model has a partial interchangeability. It begins by

generating blocks in a first phase and then the words in a second phase. [1]

The principle of this model is :

1. Choose proportions θ d according to p(θ | α) ∼ Dir(α).

n

o

2. For each p ∈ 1, 2, ..., nβ d

– Select a component z p from {1, 2, ..., r} according to p(Zd | θ d ) ∼ Mult(θ d ).

– Choose the vector visual b p according to p(Bd | z p , µ1:r , σ1:r ) ∼ N(µz p , σz p )

For each j ∈ {1, 2, ..., nω d }

n

o

– Select a component y j from {1, 2, ..., r} according to p(Yd | θ d ) ∼ Uni f ( 1, 2, ..., nβ d ).

– Choose the word ω j according to p(Wd | y j , β1:r ) ∼ Mult(βy j ).

The following Figure (1.4) describes the process of this model.

Figure 1.4: CorrLDA

6

1.3.5

Supervised Latent Dirichlet Allocation with Multivariate Binary Response

Variables (sLDA-bin)

Supervised LDA is modeled on LDA which employs hidden variables given a collection of documents (As a bag of words). LDA decomposes the distribution of word counts from each document

into contributions from K topics.In LDA, a document is modeled as a proportion topics (mixture of

topics). Each topic is a multinomial distribution over words. If the number of topics K is much smaller

than the size of the vocabulary, LDA could be regarded as effecting the reduction of dimensionality of

learning a small set of projection directions (subjects) that account for most correlations in the data.

Working with labeled documents or documents paired with their values associated response, the

purpose is to predict the response values given to the document. This is precisely the objective which

contributed to the appearance of supervised LDA (sLDA). [2]

Particularly, sLDA added to LDA a linear regression module allowing to a variable answer for

1 N

real-value to be predicted linearly from heading empirical proportion z¯= N

∑n=1 zn .

1.3.6

Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

Tr-mmLDA is a new model of statistics topic whose objective is to perform image and video annotation.This model draws originality from a new latent variable regression approach that is very useful

for capturing correlations between image or video functions and annotation text. This contribution

can not use the old technique that is based on sharing a set of latent topics between the two methods

of data.

It is a new orientation that incorporates a regression model for performing a correlation between

the two sets of topics, which captures more general forms of association while ensuring that the

number of topics in both modality data will be different.[3]

It is assumed that we have K image topics and L text topics and that we have an LDA model containing Z={z1 , z2 , ..., zN }and topic proportion θ . A real-valued topic proportion variable for caption

text x ∈ RL such as x = A¯z +µ + n where :

– A is an L ×K regression coefficients matrix.

– µ is a vector of the mean parameters.

– n ∼N(n; 0;Λ) is a zero-mean uncorrelated Gaussian noise with a diagonal precision matrix Λ .

The principle of this model [3] :

1. Take an image topic proportion θ | α ∼ Dir (α).

2. For each rn , n ∈ {1, 2, ..., N}.

Take topic assignement zn = k| θ ∼ Mult(θk ).

Take visual word rn =t| zn = k∼ Mult(βktr ).

1 N

3. Given the empirical image topic proportion z¯= N

∑n=1 zn , extract a true-valued topic proportion variable for the legend text x|¯z,A,µ, Λ∼ N(x;A¯z+µ,Λ).

4. Calculate topic proportion ηl =

exp(xl )

.

∑Lk=1 exp(xk )

7

5. For each caption word ωm , m∈ {1, 2, ..., M}.

Take topic assignment sm = l| η ∼ Mult(ηl ).

Take caption word ω m = t| sm =l∼ Mult(βltω ).

The following Figure (1.5) describes the process of this model.

Figure 1.5: Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

1.3.7

Max-Margin Latent Dirichlet Allocation (MMLDAa )

In this part, we will describe the problem of the classification and image annotation presenting

the MMLDA model. This is a new hierarchical model that incorporates discriminant learning with

maximum margin. It is based on sLDA (supervised LDA) and MedLDA (discriminative maximum

entropy LDA). This model is inspired from MedLDA. The objective is to enhance the suitability for

the vision tasks. The feature vector used for classification in MedLDA is built only on the representation of latent topic. However, for many vision applications, representations raw entities usually give

excellent performance. But, It was not possible to solve many image classification problems by the

use of the MedLDA until the appearance of MMLDA.[4]

There are two types of MMLDA :

– MMLDAc : for image classification.

– MMLDAa : for images annotation.

MMLDAa is used for the problem of image annotation. The purpose of this model is to use a set of

tags to annotate a given image. It is not a problem of standard classification because multiple tags can

be associated with a single image.

8

1.3.8

Correspondence CTM (corrCTM)

Most of the models mentioned so far are inspired from the famous LDA model that is based on

the proportion of topics. These topics are independent and have no correlation between them. So,

corr-CTM has made the yield of annotation service more efficient.[5]

Corr-CTM is derived from the CTM. By comparing this model with the LDA, this model has

made the following two changes :

1. Using a normal distribution multivariate instead of Dirichlet in order to capture topic correlation.

2. Adding a normal additional logistic for topics normally distributed.

These two contributions have given the CTM model an adjustment better than LDA in the representation of textual collection thanks to modeling the correlated space instead of modeling the space of

independent topics (LDA). Corr-CTM first uses a covariance matrix for the capture correlation and

then a normal logistic function instead of the Dirichlet function in second stage.

1.4

1.4.1

Annotation Performance

Datasets

In this section, we will present the various data sets used as a support to test the efficacy of each

model.

1.4.1.1

PASCAL VOC07

It contains 10000 pictures grouped into 20 classes object and annotated while keeping tags that

appear at least 5 times, which provides a basis of 6200 pictures with vocabulary of a capacity of 186

words in total.

1.4.1.2

Corel 5K

It contains 5000 images, each image is annotated manually from 1 to 5 keywords with a vocabulary

of capacity of 260 words.

1.4.1.3

LabelMe

It contains 2687 words, 2149 for training and 538 for testing. This base removes words that have

an annotation repeated at least 3 times for a given vocabulary from 136 entities . LabelME contains 8

different types of scenes, "coast", "forest", "highway", "inside the city", "mountain", "open country",

"street" and "high-rise building".

9

1.4.1.4

UIUC-Sport

It contains 8 categories of sporting events: rowing (250 images), badminton (200 images), polo

shirts (182 images), petanque (137 images), snowboard (190 images), croquet (236 images), sailing

(190 images), and climbing (194 images). The images are classified into easy and medium according

to judgments from human subjects. Details on the distance of foreground objects are also supplied

for each image.

1.4.2

Visual feature extraction

After detecting the characteristics of a given image, it should be noted that each image is abstracted

by several local patches. These are entity representation methods that describe the way in which

patches are represented as numerical vectors.

These vectors are called feature descriptors. A good descriptor must have the capacity to handle

changes in intensity, scale, rotation and having affinity to some extent. Among these descriptors we

can mention the Scale-Invariant Feature Transform (SIFT) that will be described in the next section.

1.4.2.1

Scale Invariant Feature Transform (SIFT)

SIFT : Scale-Invariant Feature Transform, is an algorithm that was developed in 1999 by “David

Lowe”. This algorithm is used in the area of computer vision for detecting and identifying similar

components between the various digital images.

The basic step of the proposed Lowe method is to calculate what we called "SIFT descriptors"

for the images to study. Its digital information is derived from the local analysis of an image and

characterizes the visual content of the image seen differently from different sides, the scale (zoom

and resolution of the sensor), framing, angle of observation and the exposure (brightness).

Therefore, two photographers who take a picture for the same object will have all the possibility

of having similar SIFT descriptors. These two values can be increasingly similar if the moments of

the shooting and angles of views are close.

The applications of the method are numerous and are growing. It has been used in the early

twenty-first century fields such as object detection, mapping and navigation, photo stitching, 3D modeling, research by image content, tracking and match moving video.

1.4.2.2

K-means Clustering

K-means clustering is a process of vector quantization. It often used to reduce dimensionality of

input data while keeping pertinent information. This method is intended to divide n observations into

k clusters in unsupervised fashion. The main idea behind K-means is to assign the nearest cluster to

each observation .

1.4.3

The performance measures

Several methods to test the performance of auto-annotation are proposed :

10

– Asking novice users to assess the results obtained from the system.

– Measuring the performance based on keywords associated with the image, such as the words

around an image of web.

– Measuring the performance based on keywords associated with the image drawn manually by

professionals.

The whole is annotated and is so called ground truth.

In the next part, we will present the most used criteria in measuring the performance of the image

automatic annotation.

1.4.3.1

Caption perplexity

In theory of information, the perplexity is a measure on how a probability distribution or a probability model predicts samples. It can be used to compare the probability models. Perplexity is a

test to evaluate the models often used as part of language modeling. It is monotonously decreasing

the likelihood of the test data, and is algebraically similar to the reciprocal of the geometric mean

per-word likelihood.[8]

The lower a score of perplexity is, the better the performance generalization is. For a test set of M

documents, the formula of caption perplexity is:

o

n M

∑d=1 log p(wd )

perplexity = exp −

M

N

∑d=1

1.4.3.2

d

Precision and Recall

Recall and precision are two classical measures in information retrieval. Precision is the ratio of

the number of relevant documents found to the total number of selected documents. The recall is the

ratio of the number of the relevant documents found to the total number of relevant documents.[7]

We denote by:

– n : the number of relevant documents.

– r : the number of relevant documents found.

– w : the number not relevant documents found.

The recall R and precision P are then defined by :

R=

r

n and P

=

r

r+w

In the context of use of these two measurements for the automatic image annotation, we must

apply for each word of lexicon a query q = {ω} containing only the word. Then, we calculate the

number nW 6=∅ of words for which at least one image was found :

nW 6=∅ = ∑ω∈w | {ω | R (ω) > 0 and P(ω)> 0}| .

The next step consists in measuring the mean recall mR and the mean precision on all the words

of W6=∅ :

r(ω)

mR = ∑ω∈W6=∅ n(ω)

11

and

r(ω)

mP = ∑ω∈W6=∅ r(ω)+w(ω)

where :

– n(ω) : the number of test images annotated by ω .

– r(ω) : the number of images that initially contains the word caption, and the system annotated

with the word.

– w(ω): the number of Images not initially annotated by this word and annotated by the system

with this word.

It is important to indicate that the number nW 6=∅ of words which at least one image was found when

calculating mP and mR because a system can obtain a high average recall and high average precision

by specifying only the most frequent words.

1.4.3.3

F-measure

A popular measure of a test’s accuracy combining precision and recall is as follows :

F=

2×(recall×precision)

(recall+precision)

This is called F1 measure because precision and recall are weighted equally.This is a special case

of the general measure Fη ( for positive real values η ).

Fη =

1.4.3.4

(1+η 2 )(recall×precision)

(η 2 ×recall+precision)

Mean Average Precision (MAP)

For a set of queries mean average precision is the mean of the average precision values for each

query.

Q

MAP =

∑q=1 AveP(q)

Q

Where :

– Ave : function that calculates the average.

– P : the precision value for each query.

– Q : the number of queries.

1.5

Conclusion

In this chapter, we presented the different topic models used for image annotation and the different

criteria to evaluate them. We can note that the appearance of these models follows a chronological

order. Sometimes, a new model is the fusion of two other models by changing the laws of parameters.

12

There are several different models introduced by many researchers. These models need to be

evaluated to test the contribution of each model to image annotation. In this context, we presented the

criteria used for testing the performance of these topic models such as caption perplexity and testing

the value of average likelihood.

In the next chapter, we will present the models that we will study and specify the difference in

structure between them. We will describe algorithms for each model.

This study will be very detailed, we will start by mathematical expressions of the model parameters then we will try to find the appropriate equation that maximizes lower bound to obtain the

algorithm corresponding to each model.

13

Chapter 2

Detailed study of used topic models

2.1

Introduction

In this chapter, we will describe the topic models for image annotation, design them differently

according to our perception and finally implement them.

We will start with describing data used for our work, then we will study in depth the correspondence LDA (CorrLDA) model . We will present the whole learning process associated to this model

called Variational EM. So, we will start by calculating the lower bound, maximizing it and ending by

writing the learning algorithm of this model.

The same process is applied to Correspondence Correlated Topic Model (Corr-CTM).

2.2

Data representation

In this part, we will describe the data used in our work.

– A corpus C is a collection of documents: C = {doc1 , ..., docd , ..., docD }.

– A document docd is a pair of an image rd and its annotation wd : docd = {rd , wd }.

– An image is a set of visual words rd = {rd1 , ..., rdn , ..., rdNd }.

– An annotation is a set of textual words wd = {wd1 , ..., wdm , ..., wdMd }.

T

j

Vr

1

– A visual word rdn = rdn , ..., rdn , ...rdn , where ∃i = 1 and j = 0 , ∀ j 6= i.

T

j

w

– A textual word wdm = w1dm , ..., wdm , ...wVdm

, where ∃i = 1 and j = 0 , ∀ j 6= i.

2.3

Framework of topic modeling

A general process associated to a topic model is divided into two stages: Learning and Inference

(see Figure 2.1).

14

Figure 2.1: General process for a topic model

2.3.1

Learning :

Topic models are probabilistic models containing many hidden variables. So, learning such models

requires more powerful algorithms than tradiotional EM. Therefore, common alternative approches

are either Gibbs sampling or variational EM. In our case, we will use the second approach because it

is more cited in the litterature.

In the next section, we will compare EM to Variational EM.

2.3.1.1

EM algorithm

Traditionally, Expectation-Maximization (EM) is used to learn probabilistic models. Indeed, EM

is an iterative algorithm which performs two steps :

Expectation-Step to approximate the log-likelihood and Maximization-Step to estimate model

parameters which maximize the corpus log-likelihood.

However, topic models which are special cases of probabilistic models did not use traditional EM.

They rather apply a variant of EM called Varitional EM.

The main difference between EM and Variational EM is the introduction of Variational Inference

in the E-Step (see the following algorithms (2.1) and (2.2)).

15

Algorithm 2.1 EM algorithm

repeat

E-Step:

Q(π|π (t) )=E[logLZ|X,π (t) (π;x,z)]

M-Step:

π (t+1) =argmax Q(π|π (t) )

until (convergence)

Where :

– z={z1 ,z2 ,...,zd ,...,zD } : hidden latent variables.

– x={x1 ,x2 ,...,xd ,...,xD } : sample of training observations.

– π : model parameters.

– logL : log-likelihood.

– Q : expectation of logL.

2.3.1.2

Variational EM algorithm

Variational EM algorithm, described in the algorithm below (Algorithm 2.2), is an iterative algorithm, whose each iteration consists of two steps :

– E-Step : finds for each document the optimizing values of the variational parameters.

– M-Step : maximizes the resulting lower bound on the log likelihood with respect to the model

parameters.

The algorithm converges once the perplexity of the model given training corpus does not change .

Therefore, Variational EM algorithm is based on Variational Inference procedure that we will

explain in the subsequent subsection.

Algorithm 2.2 Variational EM algorithm

repeat

E-Step:

for each observation (xd ) do

[Vpd ]=Variational Inference(xd ,π (t) )

end for

M-Step:

π (t+1) =argmax (Lxd )

until (convergence)

Where :

– Vp : Variational parameters of xd .

– xd : dth observation.

– Lxd : lower bound on logL(xd ).

2.3.1.3

Varitional Inference

As indicated in the LDA paper [7] , E-Step typically amounts to an inference step :

16

p(Z | X, π)= p(X|Z,π)p(X|π)

p(X|π)

p(Z | X, π) is intractable because it is difficult to calculate p(X | π). To resolve this inferential

problem, Variational EM approximates logLxd =log p(X = xd | π) by lower bound Lxd . In LDA paper,

this idea is expressed in the following equation :

logLxd = Lxd + DKL (q(V pxd ) k p(Z | xd , π))

Where q(Vpxd ) is called variational distribution and Vpxd are the variational paramters associated

to xd .

In EM algorithm, the ultimate goal is to find model parameters that maximize the log-likelihood

given training data. Since the log-likelihood of training data is the sum of log-likelihood of all document xd=1:D , so maximizing logL(corpus) is equivalent to to maximizing logL(xd ) for each document

xd .

From the equation of logLxd , for each document xd , maximizing logL(xd ) amounts to minimizing

DKL and maximizing Lxd . Therefore, Variational EM maximizes a lower bound (Lxd ) for each document in procedure called Variational Inference and returns optimal values of variational paramaters.

2.3.2

Inference Algorithm

Once we obtain variational and model parameters, it remains to predict annotations from an unknown image. The first step is to perform Variational Inference for the unknown image by fixing model

parameters which have already been obtained from a training set.

For predicting annotations, we approximate the conditional distribution over words in the textual

vocabulary. The conditional probability p(w|r) can be treated as the predicted confidence score of

each annotation word in word (w) vocabulary given the wholly codewords of the unknown image (r).

Then, we can choose the most promising annotations by ranking the confidence scores.

The next algorithm 2.3 describes briefly this procedure.

Algorithm 2.3 Inference Algorithm, note that r is a set of regions belonging to image I

for each document docd ={rd ,wd }in test data set do

for each textual word wi in textual vocabulary do

doc f ictive ={rd ,wi }

[Vp]=Variational Inference(doc f ictive , π)

compute p(wi |r) in function of Vp

end for

Rank values of p(wi |r)

w=

ˆ top-5 of ranked list

end for

Where doc f ictive ={rd ,wi } is an imaginary document composed of regions of docd and one word

from textual vocabulary.

17

2.4

Topic models for image annotation

In this section, we will present models for image annotation. We will carefully study models

oriented to the task of annotation.

2.4.1

Correspondence LDA (CorrLDA)

We introduce correspondence LDA (CorrLDA) as a model combining exibility of GM-LDA with

associability of GM-Mixture. With this model, we reach simultaneous dimensionality reduction in the

representation of region descriptions and words, while also modeling the conditional correspondence

between their respective reduced representations.

The table below (2.1) presents symbol and description used in CorrLDA.

Symbol

D

K

M

N

Vr

Vw

v

w

y

z

θ

α

δ

β

γ

Φ

λ

Description

Number of documents in training set.

Number of topics.

Number of textual word in document d.

Number of visual word in document d.

Size of visual vocabulary.

Size of textual vocabulary.

j

j

Visual word vn =1 only if the nth visual word in document d is indexed j , else vn =0.

j

j

Textual word wn =1 only if the mth textual word in document d is indexed j , else wn =0.

Discrete indexing variable y∈ {1, · · · ,M}.

Latent topic. zin = 1 if zn is the ith latent topic, else zin = 0

Multinomial. θi , i ∈ {1, · · · ,K}.

Dirichlet prior for θ . α i , i ∈ {1, · · · ,K}.

Multinomial. δi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vr }

Multinomial. βi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vw }

Variational Dirichlet. γ i , i ∈ {1, · · · ,K}.

Variational Multinomial. Φni , n ∈ {1, · · · ,N}, i ∈ {1, · · · , K}

Variational Multinomial. λmn , m ∈ {1, · · · ,M}, n ∈ {1, · · · , N}

Table 2.1: Symbol and Description (CorrLDA)

2.4.1.1

Variational Inference

Simplifying the Graphical Model

To start our derivation, firstly we have to simplify the original graphical model described in chapter

1 by introducing three variational parameters. We only concentrate on the hidden variables θ ,z,y and

give a simplified graphical model in the following figure (Figure 2.2).[9]

18

Figure 2.2: (Left) Graphical model representation of our modified CorrLDA. (Right) Graphical model

representation of the variational distribution used to approximate the posterior in Corr-LDA. Note that

the variational graphical model creates independency between z,θ and y, which enables us to factorize

the joint probability.

The expression of factorized distribution on the latent variables of a document is the following:

M

q(θ , z, y | γ, Φ, λ )= q(θ | γ)∏N

n=1 q(zn ,Φn )∏m=1 q(ym ,λm ).

Note that γ is the variational Dirichlet, Φn is a variational multinomial over K topics, and λm is a

variational multinomial over M words.

– Finding the Lower Bound : By looking at the graphical model in Figure 2.2, we can factorize

q(θ , z, y | γ, Φ, λ ) into q(θ | λ )q(z | Φ)q(y | λ ) . So, we can write the expression of lower bound

as follows :

L[γ, Φ, λ ; α, δ , β ]=Eq [logp(θ | α)]+Eq [logp(z | θ )] +Eq [logp(r | z, δ )] +Eq [logp(y | N)]

+Eq [logp(ω | y, z, β )] -Eq [logq(θ | γ)] -Eq [logq(z | Φ)]-Eq [logq(y | λ )] .

– Expanding Factors : Now we need to expand each term to make L[γ, Φ, λ ; α, µ, σ , β ] as

an explicit function of model parameters α, µ, σ , β and variational parameters γ, Φ, λ . These

equations can be used in several other occasions.

K

K

Eq [logp(θ | α)]=log Γ(∑Kj=1 α j ) − ∑K

i=1 log Γ(αi ) + ∑i=1 (αi − 1)(ψ(γ j ) − ψ(∑ j=1 γ j )).

K

k

Eq [logp(z | θ )]=∑N

n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j ) ).

Vr

K

Eq [logp(r | z, δ )] =∑N

n=1 ∑i=1 ∑i=1 Φni log δi j ].

Eq [logp(y | N)]=-log(N).

j

Vw

N

K

Eq [logp(ω | y, z, β )]=∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j ).

K

k

Eq [logq(θ | γ)]=log Γ(∑Kj=1 γ j )-∑K

i=1 log Γ(γi ) + ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j )).

19

K

Eq [logq(z | Φ)]=∑N

n=1 ∑i=1 Φni log(Φni ).

N

Eq [logq(y | λ )] =∑M

m=1 ∑n=1 λmn log λ mn .

Thereafter we obtained the expression of the lower bound depending on the model parameters

(α, µ, σ , β ) and variational parameters (γ, Φ, λ ).

K

K

L[γ, Φ, λ ; δ , σ , β ]=log Γ(∑Kj=1 α j ) − ∑K

i=1 log Γ(αi ) + ∑i=1 (αi − 1)(ψ(γ j ) − ψ(∑ j=1 γ j ))

k

K

+∑N

n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j ) )

Vr

K

+∑N

n=1 ∑i=1 ∑i=1 Φni log δi j

-log(N)

j

Vw

K

N

+∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )

K

k

-log Γ(∑Kj=1 γ j )+∑K

i=1 log Γ(γi ) - ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j ))

K

-∑N

n=1 ∑i=1 Φni log(Φni )

N

-∑M

m=1 ∑n=1 λmn log λ mn .

The next step is to maximize the lower bound , so we will look for values of (α, µ, σ , β ) which

maximize L[γ, Φ, λ ; α, µ, σ , β ].

2.4.1.2

Maximizing Variational Parameters

– Variational Dirichlet γ : In this part we have three variational parameters. These are γ, Φ and

λ . We begin with Variational Dirichlet γ by choosing terms of L[γ, Φ, λ ; α, δ , β ] that contain

γ.

k

L[γi ] = ∑K

i=1 (αi − 1)(ψ(γi )-ψ(∑ j=1 γ j )

K

k

+ ∑N

n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j )

K

k

-log Γ(∑Kj=1 γ j )+∑K

i=1 log Γ(γi ) - ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j )).

We derive L[γ] relative to γi . We obtain :

dL[γi ]

dγi =

0

0

k

K

N

ψ (γi )(αi + ∑K

n=1 Φni − γi )-ψ (∑ j=1 γ j ).∑i=1 (α j+ ∑n=1 Φn j − γ j ).

We put this equation equal to zero, there is the maximum at :

γi =αi +∑N

n=1 Φni .

20

– Variational Multinomial Φ: We now move to maximizing L[γ, Φ, λ ; α, µ, σ , β ] by adding

Lagrange multiplier ∑K

i=1 Φni = 1.

k

K

L[Φni ] = ∑N

n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j )

Vr

K

+∑N

n=1 ∑i=1 ∑i=1 Φni log δi j

j

K

V

+ ∑N

n=1 ∑i=1 ∑ j=1 Φni ωn log(βi j )

K

- ∑N

n=1 ∑i=1 Φni log(Φni )

+η(∑K

i=1 Φni − 1).

We derive L[Φni ] relative to Φni . We obtain :

dL[Φni ]

k

dΦni =ψ(γi )-ψ(∑ j=1 γ j )

+log(δiVr )+∑M

m=1 λmn log(βiVw )-log(Φni )-1+η.

We put this equation equal to zero, there is the maximum at :

Φni ∝ δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M

m=1 λmn log(βiwm )).

– Variational Multinomial λ : Finally, we maximize L[γ, Φ, λ ; α, δ , β ] by adding Lagrange multiplier ∑N

n=1 λmn = 1.

j

Vw

N

K

M

N

N

L[λmn ]=∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )-∑m=1 ∑n=1 λmn log λ mn +η(∑n=1 λmn − 1).

Taking the derivative λmn :

dL[λmn ]

Vw

K

dλmn =∑i=1 Φni ∑ j=1 log(βiwm ) − log(λmn )+η.

Setting this derivative to zero yields the maximizing value of the variational parameter λmn .

λmn ∝exp(∑K

i=1 Φni log(βiwm )).

2.4.1.3

Maximizing Model Parameters

– Conditional Multinomial δ : We again isolate terms containing δ and add Lagrange multiplir

ers (∑Vj=1

δ i j =1).

Nd

Vr

K

L[δ i j ]=∑D

d=1 ∑n=1 ∑i=1 Φdni log(δivn )-η(∑ j=1 δ i j -1).

21

Setting

dL[δi j ]

dδi j =0

leads to :

j

Nd

δ i j ∝∑D

d=1 ∑n=1 Φdni vdn .

j

j

Note that vdn is an indicator variable, vdn =1 only if the nth visual word in document d is indexed j.

– Conditional Multinomial β : Next, we isolate terms containing β and add a Lagrange multiw

plier (∑Vj=1

β i j =1).

Nd

Md

Vw

K

L[βi j ]= ∑D

d=1 ∑m=1 ∑i=1 ∑n=1 Φni λmn log(βiwm )-η(∑ j=1 β i j -1).

Setting

dL[βi j ]

dβi j =0

leads to :

j

Md

Nd

βi j ∝∑D

d=1 ∑m=1 wdm ∑n=1 Φdni λdmn .

Where :

j

j

wdm is an indicator variable, wdm =1 only if the mth textual word in document d is indexed j.

i]

– Dirichlet prior α : Now we will estimate the Dirichlet α. Observe that dL[α

dαi depends on

α j where j6=i . Therefore, a linear-time Newton-Raphson algorithm is developed to find the

maximal α. We omit the details of this algorithm.

2.4.1.4

Variational EM

The derivation described above is summarized in Table 2.2. We find the maximum likelihood

estimates of the model parameters with a variational EM procedure. Particularly, the E-Step computes

the variational posterior {γd , Φd , λd }D

d=1 for each image and caption given the current settings of model

parameters {δ , β , α}. Then, the M-Step subsequently finds maximum likelihood estimates of the

model parameters from expected sufficient statistics taken under the variational distribution.

The variational EM algorithm alternates between two steps until the bound on the expected log

likelihood converges. The E-Step (Variational Inference) and the complete variational EM are summarized in Algorithm 2.4 and Algorithm 2.5, respectively.[8]

Parameter

γi

Φni

λmn

δij

βi j

α

Update Rule

=αi +∑N

n=1 Φni

∝ δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M

m=1 λmn log(βiwm ))

K

∝exp(∑i=1 Φni log(βiwm ))

j

Nd

∝∑D

d=1 ∑n=1 Φdni vdn

j

Md

Nd

∝∑D

d=1 ∑m=1 wdm ∑n=1 Φdni λdmn

Using Newton-Raphson algorithm

Table 2.2: Parameter and Update Rule

22

Algorithm 2.4 Variational Inference for corrLDA

Require α,δ , β from M-Step

initialize γi =αi +N/K, for all i ∈ {1, . . . , K}

initialize Φni =1/K, for all i ∈ {1, . . . , K}

initialize λmn =1/N, for all n ∈ {1, . . . , N}

t := 1

maxiter := 100

repeat

for n := 1 to N do

for i := 1 to K do

Φni := δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M

m=1 λmn log(βiwm ))

end for

end for

normalize{Φni }K

i=1 to sum to 1

for m := 1 to M do

for n := 1 to N do

λmn :=exp(∑K

i=1 Φni log(βiwm ))

end for

end for

normalize{λmn }N

n=1 to sum to 1

for i := 1 to K do

γi :=αi +∑N

n=1 Φni

end for

t := t + 1

until (t > maxitr) or (convergence)

Remark that after we have a training model, we can perform the E-Step with fixed model parameters α, β and δ on an unseen document to estimate the variational posterior.

Algorithm 2.5 Parameter Estimation for corrLDA

t := 1

maxiter := 100

repeat

for d := 1 to D do

[γ,Φ,λ ]=Variational Inference for corrLDA(docd , α, δ , β )

end for

for i := 1 to K do

for j := 1 to Vr do

j

Nd

δ i j :=∑D

d=1 ∑n=1 Φdni vdn ,

end for

for j := 1 to Vw do

j

Md

Nd

βi j :=∑D

d=1 ∑m=1 wdm ∑n=1 Φdni λdmn

end for

end for

w

normalize{βi j }Vj=1

to sum to 1

r

normalize{δi j }Vj=1

to sum to 1

update α by Newton-Raphson algorithm

t := t + 1

until (t > maxitr) or (convergence)

23

2.4.2

Correspondence Correlated Topic Model (CorrCTM)

CorrCTM is built on CTM. As we explained in the first chapter, the contribution of CTM compared to LDA, is using multivariate normal distribution instead of Dirichlet to capture the topic correlation,and adding an additional logistic normal mapping step to incorporate the normally distributed

topics. These contributions make CTM more reliable than LDA.

We derive posterior probability function in the E-Step, then start maximizing the lower-boundary

in order to calculate model parameters. Unfortunately, the direct calculation of exact posterior function over latent variables is intractable. So, we use variational EM framework to approximate the

inference instead of exact inference in the E-Step, then we try to maximize the lower-boundary to get

a new model parameter estimate.

Figure 2.3 presents the CorrCTM model.

Figure 2.3: (left) Proposed CorrCTM, (right) Variational distribution used to approximate the posterior for each image in CorrCTM.

In the following part, we will focus on deriving a Variational Inference algorithm for approximating the posterior distribution, then estimate model parameters. Next, we will derive prediction

algorithm for annotating new images with model parameters.

The table below (Table 2.3) presents symbol and description used in CorrCTM.

24

Symbol

D

K

M

N

Vr

Vw

v

w

y

z

θ

µ

Σ

δ

β

γ

ν

Φ

λ

ζ

Description

Number of documents in training set.

Number of topics.

Number of textual word in document d.

Number of visual word in document d.

Size of visual vocabulary.

Size of textual vocabulary.

j

j

Visual word vn =1 only if the nth visual word in document d is indexed j , else vn =0.

j

j

Textual word wn =1 only if the mth textual word in document d is indexed j , else wn =0.

Discrete indexing variable y∈ {1, · · · ,M}.

Latent topic. zin = 1 if zn is the ith latent topic, else zin = 0

Multinomial. θi , i ∈ {1, · · · ,K}.

Normal vector µi , i ∈ {1, · · · ,K}.

Normal matrix µi , i ∈ {1, · · · ,K}.

Multinomial. δi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vr }

Multinomial. βi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vw }

Variational mean, γ i , i ∈ {1, · · · ,K}.

Varriance of normal distribution ν i , i ∈ {1, · · · ,K}.

Variational Multinomial. Φni , n ∈ {1, · · · ,N}, i ∈ {1, · · · , K}

Variational Multinomial. λmn , m ∈ {1, · · · ,M}, n ∈ {1, · · · , N}

Additional variational parameter

Table 2.3: Symbol and Description (CorrCTM)

2.4.2.1

Variational Inference

To start our derivation, firstly we have to simplify the original graphical model by introducing four

(γ, ν, Φ, λ ) variational parameters. The expression of factorized distribution on the latent variables of

a document is:

M

q(θ , z, y | γ, ν, Φ, λ )= q(θ | γ, ν)∏N

n=1 q(zn ,Φn )∏m=1 q(ym ,λm ).

Note that θ =f(η)=

exp(η)

.

∑K

i=1 (ηi )

So, we can write the expression of lower bound as follows :

L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ) =Eq [log p(θ | µ, Σ)] + Eq [log p(z | θ ]+Eq [log p(r | z, δ )] -Eq [logp(y | N)]

+Eq [logp(ω | y, z, β )]-Eq [log q(θ1:K | γ, ν)]

-Eq [log q(z | Φ)]-Eq [logq(y | λ )] .

Knowing that :

Eq [log p(η | µ, Σ)]= 21 log| Σ−1 |- K2 log (2π)- 21 Eq [(η − µ)T Σ−1 (η − µ)].

n

o

2

K

−1 ( K exp γ + νi ) + 1 − log(ζ )].

Eq [log p(z | f (η)]=∑N

[

γ

Φ

−

ζ

∑

∑

i

i=1

n=1 i=1 i ni

2

j

V rt

K

Eq [log p(r | z, δ )]=∑N

n=1 ∑i=1 Φni ∑ j=1 rn log(δi j ).

Eq [logp(y | N)]=-log(N).

25

j

Vw

K

N

Eq [logp(ω | y, z, β )]=∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j ).

1

2

Eq [log q(θ1:K | γ, ν)]=-∑K

i=1 2 [log(νi ) + log(2π)+1].

K

Eq [log q(z | φ )]=∑N

n=1 ∑i=1 Φni log(Φni ).

N

Eq [logq(y | λ )]=∑M

m=1 ∑n=1 λmn log λ mn .

Thereafter, we obtained the expression of the lower bound depending on the model parameters

(µ, Σ, π, β )) and variational parameters (ζ ,γ, ν, Φ, λ ).

L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ) = 12 log| Σ−1 |- K2 log (2π)- 12 Eq [(η − µ)T Σ−1 (η − µ)].

n

o

2

K

−1 ( K exp γ + νi ) + 1 − log(ζ )]

γ

Φ

−

ζ

+∑N

[

∑i=1

i

n=1 ∑i=1 i ni

2

j

Vt

K

+∑N

n=1 ∑i=1 Φni ∑ j=1 rn log(δi j )

-log(N)

j

Vw

N

K

+∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )

1

2

+∑K

i=1 2 [log(νi ) + log(2π)+1]

K

-∑N

n=1 ∑i=1 Φni log(Φni )

N

-∑M

m=1 ∑n=1 λmn log λ mn .

Our aim is to maximize as much as possible the expression of the lower bound L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ).

In E-Step, we approach the variational parameters {ζ ,γ, ν, Φ, λ } to lower bound the marginal likelihood, then maximize the lower boundary in M-Step to obtain model parameters {µ, Σ, δ , β }.

The same treatment of CorrLDA variational parameters will be performed in the next part.

2.4.2.2

Maximizing Variational Parameters

We isolate terms containing λ :

j

Vw

N

K

M

N

L[λ ]=∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )-∑m=1 ∑n=1 λmn log λ mn .

Setting

dLL[λ ]

dλ =0

leads to :

λmn ∝ exp ∑K

i=1 φni log βiwm .

Then we isolate terms containing ζ :

26

n

o

2

K

−1 ( K exp γ + νi ) − log(ζ )].

[−ζ

L[ζ ] =∑N

∑

∑

i

i=1

n=1 i=1

2

Setting

dL[ζ ]

=0

dζ

leads to :

n

o

νi2

exp

γ

+

ζ =∑K

i

i=1

2 .

For γ and ν 2 ,we have :

K

−1 ( K exp{γ +

L[γ] = − 12 (γ − µ)T Σ−1 (γ − µ)+∑N

∑i=1

i

n=1 [∑i=1 γi Φni − ζ

νi2

2 })].

and

−1 ( K exp{γ +

L[ν 2 ] = - 12 Trace(diag(ν 2 )Σ−1 )+∑N

∑i=1

i

n=1 [-ζ

νi2

K 1

2

2 })]+∑i=1 2 log(νi ).

We can use Newton’s method to find the local optimal of the coordinate object function. In

practice, we use LBFGS and constrained Newton method to iterate γ and ν 2 .[10]

Next, we follow the same process for Φ.

j

Vr

K

N

K

N

K

L[Φ]=∑N

n=1 ∑i=1 γi Φni +∑n=1 ∑i=1 ∑ j=1 Φni rn log(δi j )-∑n=1 ∑i=1 Φni log(Φni )

Vw j

N

K

+∑M

m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 rm log(βi j ).

Setting

dL[Φ]

dΦ =0

leads to :

Φni ∝δivn exp γi + ∑M

m=1 λmn log βiwm .

2.4.2.3

Maximizing Parameter Estimation

Taking a collection of image data with annotation words {(rd ,wd )}D

d=1 , we will try to maximize

likelihood estimation for model parameters {µ, Σ, δ , β }. We will then update model parameters by

setting derivation equal zero with respect to each model parameter.

1

T −1

L[µ]=∑D

d=1 − 2 (γd − µ) Σ (γd − µ)

Setting

dL[µ]

dµ =0

leads to :

µ = D1 ∑D

d=1 γd .

The terms containing Σ:

1

−1 1

2 −1

T −1

L[Σ]=∑D

d=1 [ 2 log| Σ |- 2 Trace((diag(ν )Σ ) + (γd − µ) Σ (γd − µ)].

Setting

dL[Σ]

dΣ =0

leads to :

27

2

T

Σ= D1 ∑D

d=1 (I.νd + (γd − µ)(γd − µ) ).

The terms containing δ :

j

Vr

N

K

L[δi j ]=∑D

d=1 ∑n=1 ∑i=1 Φni ∑ j=1 rn log(δi j )

Setting

dL[δi j ]

dδi j =0

leads to :

j

Nd

δi j =∑D

d=1 ∑n=1 Φdni νdn .

The terms containing βi j :

j

Vw

M

N

K

L[βi j ]=∑D

d=1 [− ∑m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )]

Setting

dL[βi j ]

dβi j =0

leads to :

j

Md

Nd

βi j ∝∑D

d=1 ∑m=1 wdm ∑n=1 Φdni λdmn .

2.4.2.4

Learning algorithm

The algorithm below (Algorithm 2.6) presents Variational Inference for CorrCTM . Note that

LBFGS [10] is an iterative method for solving unconstrained nonlinear optimization problems. We

use LBFGS to iterate γ and ν.

28

Algorithm 2.6 Variational Inference for CorrCTM

intialize : ζ =10

intialize : Φdni = 1k

intialize : νi2 =10

intialize : λi2 =0

t := 1

repeat

K

ζd := ∑ exp{γdi +

2

νdi

2 }

i=1

for m := 1 to Md do

for n := 1 to Nd do

Vw

K

{ ∑ Φdni ∑ log βi j }

λdmn ∝ exp i=1

end for

Normalize(λdm )

end for

γd := LBFGS(γd )

2 := LBFGS(ν 2 )

νdi

di

for n := 1 to Nd do

for i := 1 to K do

Vr

j=1

j

Md

Vw

m=1

j=1

j

{γdi + ∑ rn log δi j + ∑ λdmn ∑ wdm log βi j }

j=1

Φdni ∝ exp

end for

Normalize(Φdn )

end for

t := t + 1

until (t > maxitr) or (convergence)

29

The algorithm below (Algorithm 2.7) presents parameter estimation for CorrCTM.

Algorithm 2.7 Parameter Estimation for CorrCTM

β and δ are randomally initialized, then normalized

t := 1

repeat

for docd ∈ C do

[ζd , γd , νd , Φd , λd ]:=Variational Inference(docd , µ, Σ, δ , β )

end for

D

µ := ∑ γd

Σ :=

d=1

D

1

{Iνd2 + (γd − µ) (γd − µ)T }

∑

D

d=1

for i := 1 to K do

for j := 1 to Vr do

D

Nd

Vr

j

δi j ∝ ∑ ∑ Φdni ∑ rdn

d=1 n=1

j=1

end for

Normalize(δi )

for j := 1 to Vw do

D

Md Nd

j

βi j ∝ ∑ ∑ ∑ λdmn Φdni wdm

d=1 m=1 n=1

end for

Normalize(βi )

end for

t := t + 1

until (t > maxitr) or (convergence)

2.5

Comparison between LDA and CTM based topic models

Topic models can be regrouped into two main categories: LDA based topic models (LDA,CorrLDA,sLDAbin,...) and CTM based topic models (CTM,CorrCTM,sCTM-bin ...).

2.5.1

Topic proportions θ

LDA based topic models are all under Dirichlet assumption: topic proportions of an image are

randomly drawn from a Dirichlet distribution. Under Dirichlet, each topic proportion is assigned

independently, which leads to an unrealistic limitation that the presence of one topic is not correlated

with the presence of others.

CTM based topic models are all under normal distribution: topic proportions of an image are

correlated using logistic normal mapping. Annotation words of an image have correlation upon image

content, this correlation could be modeled by topic correlations. CTM produces correlated topic

proportions for words in one document thanks to the logistic normal mapping.

So, LDA based topic models allow each document to exhibit multiple topics, but ignores the

correlation between topics. CTM based topic models are based on LDA and overcome the limitation

of LDA.

30

2.5.2

Variational Parameters

Taking the equation of Φ which maximizes Likelihood in CorrLDA and CorrCTM, we note that

the per-codeword variational distribution over topics Φ in CorrCTM depends on the different topic

proportions hyper parameter γi of topic proportion. In CorrLDA γi , i∈ {1, . . . , K} are generated from

Dirichlet, which implies that γi , i∈ {1, . . . , K} are independent and there is no correlation between

them.

Nevertheless, in CorrCTM the γi , i∈ {1, . . . , K} is generated from normal distribution, and γi ,γ j

are calculated in relation to each other where the variance νi is involved in the covariance matrix Σ.

As a result, the covariance structure highly correlates topic proportions for each image, and the topic

assignment for the codewords Φni and word λmn would also be correlated.

2.5.3

Inference Process

The main contribution of CTM based topic models is using correlation between topics via the

logistic normal distribution. At the same time, the non-conjugacy of the logistic normal distribution

adds complexity to the Variational Inference process.

The drawback of using the logistic normal is that it is not conjugate to the multinomial, which

complicates the corresponding approximate posterior inference procedure.

Therefore, a negative mark for models based on CTM is that they take a very long time in execution, while inference algorithms based on LDA models are faster than those models based on CTM.

2.6

Conclusion

This chapter dealt with the various topic models that were implemented in our work: CorrLDA

and CorrCTM. It is important to note that each module was implemented separately for each model,

each equation was tested apart before it was assembled in the model. This strategy allowed us to avoid

many errors in the implementation.

It should be noted that in the implementation stage, we encountered many problems that are basically mathematical. Sometimes the inference algorithm diverges. This is a problem of initialization

parameters. We spent a lot of time in finding good intialisation leading the algorithm to converge.

In the next chapter, we will test the performance of each model in the annotation using many

comparison criteria. We will also present a comparative study between them based on the results and

curves obtained.

31

Chapter 3

Experiences and Results

3.1

Introduction

In the previous chapter, we introduced the models that we used in this work. These models would

not be important if we do not test their perfermance to achieve our goals.

In this chapter, we will present the assessment of our models for image annotation. We will

test the performance of each model by testing the variation of average predictive likelihood, average

perplexity for feature and word for various values of number of topics. We will discuss the different

results of these experiments, interpret each observation generated and explain the cause of each gap

between the values.

But before presenting our study, we will describe the environmental material and technology used

in this work, the image database that was used and the method adopted to divide it into two bases :

the first for training and the second for testing.

3.2

3.2.1

Technology and material environment

Hardware environment

My project was carried out using a laptop HP having the following characteristics :

– Processor AMD Athlon II DUAL-Core M320(2.1GHz).

– RAM 4 Go.

– Hard drive 500 Go.

– Display 15 inches.

We used other more powerful machines belonging to the laboratory to execute algorithms of high

complexity.

3.2.2

Technology and software environment

Elementary OS-i386.20130810 : The operating system used is ”Elementary OS”. It is a GNU /

Linux distribution designed to create a complete environment (called Pantheon) based on the GNOME

desktop as well as on the Ubuntu distribution.

32

We used this system because it is very light and robust especially when using the virtual machine.

Matlab 13 : (matrix laboratory) is a multi-paradigm numerical computing environment and fourthgeneration programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces and

interfacing with programs written in other languages, including C, C++, Java, and Fortran. Matlab

was used to implement several mathematical algorithms useful to our work.

Code::Blocks 10.05 : It is a free and open source, cross-platform IDE which supports multiple

compilers including GCC and Visual C++. This editor is used to implement topic models for our

project. Among the reasons that lead us to use Code::Blocks are :

– Backup compilation (generation) and execution with a single key (F9).

– Code errors are directly shown in the code.

– Automatic formatting.

– Direct command help C / C + + from the editor with F1.

3.3

Data configuration

We consider our models with the Corel 5k dataset. As Corel 5K is a copyright dataset, we just use

the public statistic and evaluate the annotation performance. This dataset includes 5000 images (4500

for training, 500 for testing) and 260 words.

While there exists a great variety of image features in the literature to choose from, as the purpose

of our work is to compare model rather than feature selection, we restrict our choice to SIFT feature,

which has been shown to be discriminative in numerous classification and recognition problem.

3.4

Training models

Now we will launch training models for each value of K (number of topics). the train data set

contains 5000 images. we will add in the tab «Set programs argument» of CodeBlocks editor :

[est] [alpha] [k] [settings] [data] [random/seeded/*] [directory]

The term [est] : means estimation.

The term [K] : number of topics.

The term [settings] : contains values of convergence criteria .

The term [data] : contains train data set.

The term [random] : describes how the topics will be initialized. "random" initializes each topic

randomly; "seeded" initializes each topic to a distribution smoothed from a randomly chosen document

The term [directory] : is the result of training.

Example: est 0.1 2 settings.txt train.txt random out2.

33

The following figure (Figure 3.1) shows the execution of the program during the learning phase

of CorrLDA.

Figure 3.1: Training CorrLDA

In this case, a directory named out2 appears in the current directory. It represents the learning

model for two topics. To test the result of the annotation model for other value K, we must start

learning again for the new value. For this reason we start learning for different values of K. We

obtained directories out2, out 20 and out50 as the following figure (Figure 3.2) shows for K = 2, 20

and 50.

Figure 3.2: Different versions of the learning models

34

3.5

Testing models

The learning phase is complete, we will now test the model by using test data set. The test data

contains 500 images without annotation. The model will estimate an annotation for each image. We

will add in the tab «Set programs argument» of CodeBlocks editor :

inf [settings] [model] [data] [name]

Variational Inference is performed on the data using the model in [model].

Example: inf settings.txt out50.final test.txt test.

The following figure (Figure 3.3) shows the execution of the program during the test phase of

CorrLDA for number of topics equal to 50.

Figure 3.3: Testing CorrLDA

A directory named test appears in the current directory, it contains 5 files, each file begins with

the prefix final as the following figure (Figure 3.4) shows.

35

Figure 3.4: Result of testing CorrLDA

The file final50-pd.dat contains value of like likelihood for each document in test data set.

The file final50-gamma.dat contains the matrix gamma for the test data set.

The file final50-anno.dat contains 500 lines such that each row represents a ranked list of textual

words assigned to the corresponding image. This list is ordered by the probability of each textual word

knowing the current iamge. Each word is represented by its index in the list of textual vocabulary. The

following figure (Figure 3.5) shows the annotation assigned to the documents of the data set.

Figure 3.5: Annotation of CorrLDA

Take for example the first image, we take the top 4 of words which had index {4,2,6,11}, corresponding to : {jetski, sky,sand,beach}.

36

We can verify annotation of the first image by opening the file containing test data set. Figure (3.6)

shows test data set ”test.txt”.

Figure 3.6: Test data set

The first row is the total number of images in the base equal to 499.

From the second line each image is shown in a couple of lines :

– [M] [term_1] [term_2]... [term_M] : where [M] is the number of textual words in the document.

– [N] [term_1] :[count] [term_2] :[count] ... [term_N] :[count] : where [N] is the number of visual

words in the document and the [count] associated with each term is how many times that term

appeared in the document. Note that [term_1] is an integer which indexes the term.

Indexation of first document is put in the red rectangle : {1,2,3,4} corresponding to {sea,sky,person,jetsky}.

The list of basic textual words with their index is put into the appendix A.

3.6

Test set likelihood

We trained a number of models for both corrCTM and corrLDA by changing values of the topic

number K on the corel 5K dataset. To compare the generalization performance of these two models,

we compute the per-image average log likelihood of the hold-out test set.

Note that for the values of CorrCTM, we adopted results of others searchers in order to compare

the results found in CorrLDA because the evaluation of CorrCTM model is not yet completed, we

hope finish as soon as possible.

A model which better fits the data will assign a higher likelihood to the test set (for the log

likelihood, lower value in abstract is better). We see in Figure 3.7 that CorrCTM provides a better

37

fit than CorrLDA in the held-out test data for all values of K. This indicates that CorrCTM is more

robust to over fitting when changing the topic number than CorrLDA.

Figure 3.7: The average test likelihood

3.7

Word Perplexity

Word Perplexity is a parameter for testing the quality of annotation performance provided by the

topic model. A highly efficient model corresponds to low values of Word Perplexity. The Word Perplexity P(word) denoted in the following equation shows how well the model predicts the conditional

probability p(w|r) for all images in held out test data set.

n D Md

o

log(wm |rd )

P(word)== exp − ∑d=1 ∑m=1

D M

∑d=1

d

We see in the following figure (Figure 3.8) that models based on CTM like CorrCTM provide

lower perplexity value compared to LDA based models like CorrLDA for different number of topics

K. This essentially increases the capability of CTM based models to capture topic correlation.

Figure 3.8: Word Perplexity

38

We observe that CorrCTM gains almost reduction on word perplexity on Corel 5K. This result

is anticipated because topic assignments are correlated according to the prior covariance structure

among topics.

A word will be selected depending on the propagation of topic correlation, which leads to a lower

perplexity for word in CorrCTM compared with CorrLDA.

3.8

Conclusion

We presented in this chapter the general process to perform the annotation using topic models.

This process is composed of two steps : a training step and testing step. After the testing phase we

tried to evaluate the perfermance annotation of each model. Models based on CTM provide higher

performance than models based on LDA. But it should be noted that models based on CTM have large

complexity algorithms.

39