IdrissHassineRapport PFE .pdf



Nom original: IdrissHassineRapport_PFE.pdf

Ce document au format PDF 1.5 a été généré par TeX / MiKTeX pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 27/10/2015 à 16:34, depuis l'adresse IP 194.57.x.x. La présente page de téléchargement du fichier a été vue 581 fois.
Taille du document: 679 Ko (57 pages).
Confidentialité: fichier public


Aperçu du document


Ministère de l’Enseignement Supérieur et de la Recherche Scientifique

Université Tunis El Manar

Département Technologies de l’Information et de la Communication

Projet de Fin d’Etudes
présenté par
Idriss HASSINE

pour obtenir
le Diplôme National d’Ingénieur
en Informatique

Automatic image annotation using topic models

réalisé au sein de
LR-SITI-ENIT

Encadrant

Mr. Hamid AMIRI Professeur (ENIT)

Année Universitaire 2013/2014

Signatures
Encadrant Entreprise

Encadrant ENIT

Abstract

Image annotation is an important open problem in computer vision research. Because an image
is captured without caption, the goal of image annotation is to assign relevant text keywords from
a word vocabulary to the image, to describe its visual content. Image annotation may be useful for
many tasks such as object recognition, or text-based image retrieval. Therefore a lot of attention in
recent years has been paid to the design and development of systems allowing automatic annotation
of images.
Specifically, we are interested in topic models and thereby we evaluate their performance in image
annotation.

Keywords: Automatic Image Annotation (AIA), topic models.

Acknowledgements

This project would not have been possible without the help and support of many friends and
colleagues.
Foremost, I thank my advisor Mr.Hamid Amiri. He has been an exemplary teacher, mentor, and
collaborator. I also thank Mr. Marouan Haj Ayech for his insightful comments and suggestions.
I am fortunate to have interacted with a number of superb colleagues, have all been influential
to my project through collaboration, discussion, constructive criticism, and I would like to recognize
their contribution to this work.
My experience in LR-SITI-ENIT would not have been possible without the generous material aid
from officials of the research unit.
I thank all my friends and family for their support and distraction. I especially thank my parents
Habib Hassine and Faten Hacheni and sister Nesrine Hassine who have given me a lifetime of love
and care. Finally, I thank Mohamed Hassine my brother and my dear friend for his kindness, patience,
humor, friendship and support for me.

Dedicated to my parents, Faten Hacheni and Habib Hassine.

Contents
Introduction

1

1

3
3
3
4
4
5
5
6

State of the art
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Automatic image annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Topic models for automatic image annotation . . . . . . . . . . . . . . . . . . . . .
1.3.1 Latent Dirichlet Allocation(LDA) . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Model Gaussian-Multinomial Mixture (GM-Mixture) . . . . . . . . . . . . .
1.3.3 Model Gaussian-Multinomial LDA (GM-LDA) . . . . . . . . . . . . . . . .
1.3.4 Correspondence Latent Dirichlet Allocation (CorrLDA) . . . . . . . . . . .
1.3.5 Supervised Latent Dirichlet Allocation with Multivariate Binary Response
Variables (sLDA-bin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.6 Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA) . . .
1.3.7 Max-Margin Latent Dirichlet Allocation (MMLDAa ) . . . . . . . . . . . . .
1.3.8 Correspondence CTM (corrCTM) . . . . . . . . . . . . . . . . . . . . . . .
1.4 Annotation Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1.1 PASCAL VOC07 . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1.2 Corel 5K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1.3 LabelMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1.4 UIUC-Sport . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2 Visual feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.2.1 Scale Invariant Feature Transform (SIFT) . . . . . . . . . . . . . .
1.4.2.2 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3 The performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3.1 Caption perplexity . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3.3 F-measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.3.4 Mean Average Precision (MAP) . . . . . . . . . . . . . . . . . . .
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
7
8
9
9
9
9
9
9
10
10
10
10
10
11
11
12
12
12

2

3

Detailed study of used topic models
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data representation . . . . . . . . . . . . . . . . . . . . . .
2.3 Framework of topic modeling . . . . . . . . . . . . . . . . .
2.3.1 Learning : . . . . . . . . . . . . . . . . . . . . . .
2.3.1.1 EM algorithm . . . . . . . . . . . . . . .
2.3.1.2 Variational EM algorithm . . . . . . . . .
2.3.1.3 Varitional Inference . . . . . . . . . . . .
2.3.2 Inference Algorithm . . . . . . . . . . . . . . . . .
2.4 Topic models for image annotation . . . . . . . . . . . . . .
2.4.1 Correspondence LDA (CorrLDA) . . . . . . . . . .
2.4.1.1 Variational Inference . . . . . . . . . . .
2.4.1.2 Maximizing Variational Parameters . . . .
2.4.1.3 Maximizing Model Parameters . . . . . .
2.4.1.4 Variational EM . . . . . . . . . . . . . . .
2.4.2 Correspondence Correlated Topic Model (CorrCTM)
2.4.2.1 Variational Inference . . . . . . . . . . .
2.4.2.2 Maximizing Variational Parameters . . . .
2.4.2.3 Maximizing Parameter Estimation . . . .
2.4.2.4 Learning algorithm . . . . . . . . . . . .
2.5 Comparison between LDA and CTM based topic models . .
2.5.1 Topic proportions θ . . . . . . . . . . . . . . . . .
2.5.2 Variational Parameters . . . . . . . . . . . . . . . .
2.5.3 Inference Process . . . . . . . . . . . . . . . . . . .
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .
Experiences and Results
3.1 Introduction . . . . . . . . . . . . . . . . . .
3.2 Technology and material environment . . . .
3.2.1 Hardware environment . . . . . . . .
3.2.2 Technology and software environment
3.3 Data configuration . . . . . . . . . . . . . .
3.4 Training models . . . . . . . . . . . . . . . .
3.5 Testing models . . . . . . . . . . . . . . . .
3.6 Test set likelihood . . . . . . . . . . . . . . .
3.7 Word Perplexity . . . . . . . . . . . . . . . .
3.8 Conclusion . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

14
14
14
14
15
15
16
16
17
18
18
18
20
21
22
24
25
26
27
28
30
30
31
31
31

.
.
.
.
.
.
.
.
.
.

32
32
32
32
32
33
33
35
37
38
39

Conclusion and outlook

40

Bibliography

41

Appendix A : Indexing of textual words (Corel 5K)

42

Appendix B : LDA

44

List of Figures
1.1
1.2
1.3
1.4
1.5

Graphic model LDA (Blei 2003) . . . . . . . . . . . . . . . . . . . .
GM-Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GM-LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CorrLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

.
.
.
.
.

4
5
6
6
8

2.1
2.2

General process for a topic model . . . . . . . . . . . . . . . . . . . . . . . . . . .
(Left) Graphical model representation of our modified CorrLDA. (Right) Graphical
model representation of the variational distribution used to approximate the posterior
in Corr-LDA. Note that the variational graphical model creates independency between
z,θ and y, which enables us to factorize the joint probability. . . . . . . . . . . . . .
(left) Proposed CorrCTM, (right) Variational distribution used to approximate the
posterior for each image in CorrCTM. . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.3

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

Training CorrLDA . . . . . . . . . . . .
Different versions of the learning models .
Testing CorrLDA . . . . . . . . . . . . .
Result of testing CorrLDA . . . . . . . .
Annotation of CorrLDA . . . . . . . . . .
Test data set . . . . . . . . . . . . . . . .
The average test likelihood . . . . . . . .
Word Perplexity . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

19
24
34
34
35
36
36
37
38
38

List of Tables
2.1
2.2
2.3

Symbol and Description (CorrLDA) . . . . . . . . . . . . . . . . . . . . . . . . . .
Parameter and Update Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Symbol and Description (CorrCTM) . . . . . . . . . . . . . . . . . . . . . . . . . .

18
22
25

3.1

Symbol and Description (LDA)

44

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Algorithms
2.1
2.2
2.3
2.4
2.5
2.6
2.7
3.1
3.2

EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variational EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
Inference Algorithm, note that r is a set of regions belonging to image I
Variational Inference for corrLDA . . . . . . . . . . . . . . . . . . . .
Parameter Estimation for corrLDA . . . . . . . . . . . . . . . . . . . .
Variational Inference for CorrCTM . . . . . . . . . . . . . . . . . . .
Parameter Estimation for CorrCTM . . . . . . . . . . . . . . . . . . .
Variational Inference for LDA . . . . . . . . . . . . . . . . . . . . . .
Parameter Estimation for LDA . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

16
16
17
23
23
29
30
45
46

List of Acronyms
LDA
GM-Mixture
GM-LDA
CorrLDA
sLDA-bin
Tr-mmLDA
MMLDAa
CTM
CorrCTM
sCTM-bin
trmmCTM
SIFT
R
P

mR
mP
MAP
p
Mult
N(.,.)
E-Step
M-Step
L
Unif
SVMs
maxiter
Γ
Dir
Eq

Latent Dirichlet Allocation
Gaussian-Multinomial Mixture model
Gaussian-Multinomial Latent Dirichlet Allocation
Correspondence Latent Dirichlet Allocation
Supervised Latent Dirichlet Allocation with Multivariate Binary Response
Topic regression Multi-Modal Latnet Dirichlet Allocation
Max-Margin Latent dirichlet Allocation
Correlated Topic Model
Correspondence CTM
Supervised CTM with Multivariate Binary Response
Topic regression Multi-Modal CTM
Scale-invariant feature transform
Recall
Precision
Mean recall
Mean precision
Mean Average Precision
Probability
Multinomial law
Normal law
Estimation Step
Maximization Step
Lower bound
Uniform law
Support Vector Machines
Maximum number of iterations
Gamma function
Dirichlet law
Posterior Expectation

Introduction
Background : Image annotation is an important open problem in computer vision research. Because a large amount of images are available without caption, the goal of image annotation is to
assign relevant text keywords from a word vocabulary to the image, to describe its visual content.
Image annotation can be applied in object recognition (propose keywords for image), or text-based
image retrieval (propose images from a combination of keywords). That is why a lot of attention in
recent years has been paid to the design and development of automatically annotating images.
Generally, an annotation algorithm is performed in two processes. The first offline process consists
in learning patterns of association between image and text from a set of annotated images. The second
online process allows to predict textual annotations for any unlabeled image. Previous approaches can
be mainly categorized into 2 types.
For the first type, a classifier is learned for each vocabulary word which is treated as "concept
class". This is a supervised learning model, and often SVMs or boosting classifiers are typically used.
In practice, this line of approaches would inevitably suffer from scalability problem, as the annotation
performance sharply decreases when the discriminative boundaries for large scale classifiers become
ambiguous.
In the case of second type, the joint probabilistic distribution between image feature and text word
is learned by a generative latent factor framework which assumes that each image is formed with a
small set of hidden factors, and these factors correspondingly associate with text words.
Contributions : In this project we will use the second approach, and we will consider the following
objectives:
– Implementing the most commonly used topic models .
– Evaluating annotation performance of each model .
– Implementing an interactive system for image annotation, which takes an image without annotation as input and generates an annotation as output composed of words from a fixed vocabulary.
The host laboratory : This work was performed in the LR-SITI-ENIT laboratory .
This laboratory is a part of the National Engineering School of Tunis. Research areas of the laboratory are various such as image processing, pattern recognition and computer graphics. In the area
of pattern recognition, LR-SITI-ENIT has developed new approaches for both 2D and 3D problems
of object segmentation and recognition. Implemented algorithms are used in various fields such as
the biomedical field and industry field which need machine learning.
The main work of LR-SITI-ENIT focuses on :
1

– The development of models and learning methods for prediction and classification.
– The proposed methods for segmentation and characterization of 2D and 3D images.
Application areas favored by Researchers are document images and medical images.
Report structure : This report is organized as follows. In chapter 1, we will describe the topic
models of the state of the art and some criteria adopted by researchers to evaluate their models.
In chapter 2, we will describe the general process of automatic image annotation using topic
models. Then, we will present the models that we used in our work.
In chapter 3, we will present the technology and materials environment used in our project. Then,
we will put some screen prints to describe the work performed. Finally, we will conduct a comparative
study between the implemented topic models by evaluating the perfermances of each model.

2

Chapter 1
State of the art
1.1

Introduction

Automatic image annotation is the ability to provide a set of words for a given image in order to
describe its visual content using a system without any intervention of the human being. This work is
the subject of extensive research on image processing.
In this chapter we will talk about the automatic image annotation then we will present some models
used to annotate images. The main role of these models is to provide the automatic annotation of
images. It is necessary to note that these models can perform other tasks such as annotating an image
using a single word region, disambiguating a word by using visual content, searching images...
Probabilistic models for automatic annotation consist in estimating for each word ω the posterior
probability :
– p(ω | b) where b describes the known information on the blob (for example, the vector that
presents the visual content of the blob) for a corresponding model.
– p(ω | d) where d describes the known image information (for example, the ensemble β d all the
visual vector image) .
In order to determine the ability of a system to annotate an image, we needed a corpus of images that
d . So, the aim of the system is to
are manually annotated previously by a bag of reference words WRe
f
estimate an annotation to a given picture that reflects the reality using a set of words in a specific way.
Firstly, we will briefly introduce the automatic annotation of images, then we will describe some
probabilistic models of the state of the art which are based essentially on the distribution of Dirichlet.
We will study the contribution of each model to image annotation compared to other models . Finally,
we will present the criteria used to measure performance, their characteristics and the necessary conditions to use .

1.2

Automatic image annotation

The objective of automatic annotation is to associate a set of words to a given image via a computer
system.
3

The principle of this method is:
– First, the computer system must learn to annotate images based on images annotated previously.
– Then, the system must be able to annotate a new image that did not have a text description.
For probabilistic models, automatic annotation consists in estimating the posterior probability :
– p (ω| i), where i describes the information generated from the image (for example, all the visual
vector image).
– If a segmented image is available, we must first estimate the posterior probability p (ω | b)
where b describes the information generated from an image region (for example, the vector
with the visual content of the image region).

1.3

Topic models for automatic image annotation

In this section, we will focus on the models based on Latent Dirichlet Allocation(LDA). Therefore,
first we introduce LDA and then, we will study LDA extensions to image annotation problem : GMMixture, CorrLDA,sLDA-bin, tr-mmLDA, MMLDAa and CorrCTM.

1.3.1

Latent Dirichlet Allocation(LDA)

In this part, we will focus on the models based on the Dirichlet distribution . The Dirichlet distribution estimates the probability vector Θd = (p1 , p2 , ... , pr ) where p j is the probability that the
concept (also called hidden or latent class) z j is in the document, depending on the frequency of occurrences of each concept α j in the document.

Figure 1.1: Graphic model LDA (Blei 2003)
Latent Dirichlet Allocation(LDA) is a probabilistic model proposed by Blei (Blei 2003). The
purpose of designing LDA is to model discrete data, in particular large collection of textual document
(see Figure 1.1 and the part B of the appendix).
The number of dimensions corresponds to the number of descriptors of the hidden concepts. Latent Dirichlet Allocation (LDA) is a generative probabilistic model of a corpus. It is a bayesian
hierarchical model that contains three levels and considers that words and concepts are interchangeable.
The main idea is that the documents are described as random mixtures over latent topics, where
each topic is defined by a distribution of words.
In the next section we will describe some models of the state of the art.
4

1.3.2

Model Gaussian-Multinomial Mixture (GM-Mixture)

GM-Mixture is a simple mixing model (Blei 2004) that does not use the LDA, but it is important
to describe in order to present the next models that are inspired from LDA.
GM-Mixture uses a single random variable Z, and therefore only one component z. It is generally
used to express the relationship between an image and its annotation. The strength of this model is
that it provides high associativity between the blobs and image caption.[1]
the principle of this model : we denote by r the number of dimension of GM-Mixture,the annotation for an image d follows the following steps:
1. Choose a value z from {1,2,...,r} according to p(Z| θ ) ∼ Mult(θ ).
2. For each blob b of the image:
– We predict that the visual features of b are selected according to p(Bd | z, µ1:r ,σ1:r )∼ N(µz , σz ).
– We predict that ω is selected according to p(Wd | z, β1:r )∼ Mult(βz ).
The following Figure (1.2) describes the process of this model.

Figure 1.2: GM-Mixture

1.3.3

Model Gaussian-Multinomial LDA (GM-LDA)

GM-LDA model extends the GM-MIXTURE model but uses LDA. The strength of this model is
that it provides high flexibility between the blobs and image caption. [1]
The principle of this model : we denote by r the number of dimension of GM-LDA, the visualtextual annotation follows the following generative process :
1. Choose proportions θ d according to p(θ | α) ∼ Mult(α).
n
o
2. For each p ∈ 1, 2, ..., nβ d
– Select a component z p from {1, 2, ..., r} according to p(Zd | θ d ) ∼ Mult(θ d ).
– Choose the vector visual b p according to p(Bd | z p , µ1:r , σ1:r ) ∼ N(µz p , σz p ).
For each j ∈ {1, 2, ..., nω d }
– Select a component y j from {1, 2, ..., r} according to p(Yd | θ d ) ∼ Mult(θ d ).
5

– Choose the word ω j according to p(Wd | y j , β1:r ) ∼ N(βy j ).
The following Figure (1.3) describes the process of this model.

Figure 1.3: GM-LDA

1.3.4

Correspondence Latent Dirichlet Allocation (CorrLDA)

This model inherits the flexibility of GM-LDA model and the associativity of the GM-Mixture
model. Presented by (Blei & Jordan, 2003), this model has a partial interchangeability. It begins by
generating blocks in a first phase and then the words in a second phase. [1]
The principle of this model is :
1. Choose proportions θ d according to p(θ | α) ∼ Dir(α).
n
o
2. For each p ∈ 1, 2, ..., nβ d
– Select a component z p from {1, 2, ..., r} according to p(Zd | θ d ) ∼ Mult(θ d ).
– Choose the vector visual b p according to p(Bd | z p , µ1:r , σ1:r ) ∼ N(µz p , σz p )
For each j ∈ {1, 2, ..., nω d }
n
o
– Select a component y j from {1, 2, ..., r} according to p(Yd | θ d ) ∼ Uni f ( 1, 2, ..., nβ d ).
– Choose the word ω j according to p(Wd | y j , β1:r ) ∼ Mult(βy j ).
The following Figure (1.4) describes the process of this model.

Figure 1.4: CorrLDA
6

1.3.5

Supervised Latent Dirichlet Allocation with Multivariate Binary Response
Variables (sLDA-bin)

Supervised LDA is modeled on LDA which employs hidden variables given a collection of documents (As a bag of words). LDA decomposes the distribution of word counts from each document
into contributions from K topics.In LDA, a document is modeled as a proportion topics (mixture of
topics). Each topic is a multinomial distribution over words. If the number of topics K is much smaller
than the size of the vocabulary, LDA could be regarded as effecting the reduction of dimensionality of
learning a small set of projection directions (subjects) that account for most correlations in the data.
Working with labeled documents or documents paired with their values associated response, the
purpose is to predict the response values given to the document. This is precisely the objective which
contributed to the appearance of supervised LDA (sLDA). [2]
Particularly, sLDA added to LDA a linear regression module allowing to a variable answer for
1 N
real-value to be predicted linearly from heading empirical proportion z¯= N
∑n=1 zn .

1.3.6

Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

Tr-mmLDA is a new model of statistics topic whose objective is to perform image and video annotation.This model draws originality from a new latent variable regression approach that is very useful
for capturing correlations between image or video functions and annotation text. This contribution
can not use the old technique that is based on sharing a set of latent topics between the two methods
of data.
It is a new orientation that incorporates a regression model for performing a correlation between
the two sets of topics, which captures more general forms of association while ensuring that the
number of topics in both modality data will be different.[3]
It is assumed that we have K image topics and L text topics and that we have an LDA model containing Z={z1 , z2 , ..., zN }and topic proportion θ . A real-valued topic proportion variable for caption
text x ∈ RL such as x = A¯z +µ + n where :
– A is an L ×K regression coefficients matrix.
– µ is a vector of the mean parameters.
– n ∼N(n; 0;Λ) is a zero-mean uncorrelated Gaussian noise with a diagonal precision matrix Λ .
The principle of this model [3] :
1. Take an image topic proportion θ | α ∼ Dir (α).
2. For each rn , n ∈ {1, 2, ..., N}.
Take topic assignement zn = k| θ ∼ Mult(θk ).
Take visual word rn =t| zn = k∼ Mult(βktr ).
1 N
3. Given the empirical image topic proportion z¯= N
∑n=1 zn , extract a true-valued topic proportion variable for the legend text x|¯z,A,µ, Λ∼ N(x;A¯z+µ,Λ).

4. Calculate topic proportion ηl =

exp(xl )
.
∑Lk=1 exp(xk )

7

5. For each caption word ωm , m∈ {1, 2, ..., M}.
Take topic assignment sm = l| η ∼ Mult(ηl ).
Take caption word ω m = t| sm =l∼ Mult(βltω ).
The following Figure (1.5) describes the process of this model.

Figure 1.5: Topic-regression multi-modal Latent Dirichlet Allocation (tr-mmLDA)

1.3.7

Max-Margin Latent Dirichlet Allocation (MMLDAa )

In this part, we will describe the problem of the classification and image annotation presenting
the MMLDA model. This is a new hierarchical model that incorporates discriminant learning with
maximum margin. It is based on sLDA (supervised LDA) and MedLDA (discriminative maximum
entropy LDA). This model is inspired from MedLDA. The objective is to enhance the suitability for
the vision tasks. The feature vector used for classification in MedLDA is built only on the representation of latent topic. However, for many vision applications, representations raw entities usually give
excellent performance. But, It was not possible to solve many image classification problems by the
use of the MedLDA until the appearance of MMLDA.[4]
There are two types of MMLDA :
– MMLDAc : for image classification.
– MMLDAa : for images annotation.
MMLDAa is used for the problem of image annotation. The purpose of this model is to use a set of
tags to annotate a given image. It is not a problem of standard classification because multiple tags can
be associated with a single image.

8

1.3.8

Correspondence CTM (corrCTM)

Most of the models mentioned so far are inspired from the famous LDA model that is based on
the proportion of topics. These topics are independent and have no correlation between them. So,
corr-CTM has made the yield of annotation service more efficient.[5]
Corr-CTM is derived from the CTM. By comparing this model with the LDA, this model has
made the following two changes :
1. Using a normal distribution multivariate instead of Dirichlet in order to capture topic correlation.
2. Adding a normal additional logistic for topics normally distributed.
These two contributions have given the CTM model an adjustment better than LDA in the representation of textual collection thanks to modeling the correlated space instead of modeling the space of
independent topics (LDA). Corr-CTM first uses a covariance matrix for the capture correlation and
then a normal logistic function instead of the Dirichlet function in second stage.

1.4
1.4.1

Annotation Performance
Datasets

In this section, we will present the various data sets used as a support to test the efficacy of each
model.
1.4.1.1

PASCAL VOC07

It contains 10000 pictures grouped into 20 classes object and annotated while keeping tags that
appear at least 5 times, which provides a basis of 6200 pictures with vocabulary of a capacity of 186
words in total.
1.4.1.2

Corel 5K

It contains 5000 images, each image is annotated manually from 1 to 5 keywords with a vocabulary
of capacity of 260 words.
1.4.1.3

LabelMe

It contains 2687 words, 2149 for training and 538 for testing. This base removes words that have
an annotation repeated at least 3 times for a given vocabulary from 136 entities . LabelME contains 8
different types of scenes, "coast", "forest", "highway", "inside the city", "mountain", "open country",
"street" and "high-rise building".

9

1.4.1.4

UIUC-Sport

It contains 8 categories of sporting events: rowing (250 images), badminton (200 images), polo
shirts (182 images), petanque (137 images), snowboard (190 images), croquet (236 images), sailing
(190 images), and climbing (194 images). The images are classified into easy and medium according
to judgments from human subjects. Details on the distance of foreground objects are also supplied
for each image.

1.4.2

Visual feature extraction

After detecting the characteristics of a given image, it should be noted that each image is abstracted
by several local patches. These are entity representation methods that describe the way in which
patches are represented as numerical vectors.
These vectors are called feature descriptors. A good descriptor must have the capacity to handle
changes in intensity, scale, rotation and having affinity to some extent. Among these descriptors we
can mention the Scale-Invariant Feature Transform (SIFT) that will be described in the next section.
1.4.2.1

Scale Invariant Feature Transform (SIFT)

SIFT : Scale-Invariant Feature Transform, is an algorithm that was developed in 1999 by “David
Lowe”. This algorithm is used in the area of computer vision for detecting and identifying similar
components between the various digital images.
The basic step of the proposed Lowe method is to calculate what we called "SIFT descriptors"
for the images to study. Its digital information is derived from the local analysis of an image and
characterizes the visual content of the image seen differently from different sides, the scale (zoom
and resolution of the sensor), framing, angle of observation and the exposure (brightness).
Therefore, two photographers who take a picture for the same object will have all the possibility
of having similar SIFT descriptors. These two values can be increasingly similar if the moments of
the shooting and angles of views are close.
The applications of the method are numerous and are growing. It has been used in the early
twenty-first century fields such as object detection, mapping and navigation, photo stitching, 3D modeling, research by image content, tracking and match moving video.
1.4.2.2

K-means Clustering

K-means clustering is a process of vector quantization. It often used to reduce dimensionality of
input data while keeping pertinent information. This method is intended to divide n observations into
k clusters in unsupervised fashion. The main idea behind K-means is to assign the nearest cluster to
each observation .

1.4.3

The performance measures

Several methods to test the performance of auto-annotation are proposed :
10

– Asking novice users to assess the results obtained from the system.
– Measuring the performance based on keywords associated with the image, such as the words
around an image of web.
– Measuring the performance based on keywords associated with the image drawn manually by
professionals.
The whole is annotated and is so called ground truth.
In the next part, we will present the most used criteria in measuring the performance of the image
automatic annotation.
1.4.3.1

Caption perplexity

In theory of information, the perplexity is a measure on how a probability distribution or a probability model predicts samples. It can be used to compare the probability models. Perplexity is a
test to evaluate the models often used as part of language modeling. It is monotonously decreasing
the likelihood of the test data, and is algebraically similar to the reciprocal of the geometric mean
per-word likelihood.[8]
The lower a score of perplexity is, the better the performance generalization is. For a test set of M
documents, the formula of caption perplexity is:

o
n M
∑d=1 log p(wd )
perplexity = exp −
M
N
∑d=1

1.4.3.2

d

Precision and Recall

Recall and precision are two classical measures in information retrieval. Precision is the ratio of
the number of relevant documents found to the total number of selected documents. The recall is the
ratio of the number of the relevant documents found to the total number of relevant documents.[7]
We denote by:
– n : the number of relevant documents.
– r : the number of relevant documents found.
– w : the number not relevant documents found.
The recall R and precision P are then defined by :

R=

r
n and P

=

r
r+w

In the context of use of these two measurements for the automatic image annotation, we must
apply for each word of lexicon a query q = {ω} containing only the word. Then, we calculate the
number nW 6=∅ of words for which at least one image was found :

nW 6=∅ = ∑ω∈w | {ω | R (ω) > 0 and P(ω)> 0}| .
The next step consists in measuring the mean recall mR and the mean precision on all the words
of W6=∅ :
r(ω)

mR = ∑ω∈W6=∅ n(ω)
11

and
r(ω)

mP = ∑ω∈W6=∅ r(ω)+w(ω)
where :
– n(ω) : the number of test images annotated by ω .
– r(ω) : the number of images that initially contains the word caption, and the system annotated
with the word.
– w(ω): the number of Images not initially annotated by this word and annotated by the system
with this word.
It is important to indicate that the number nW 6=∅ of words which at least one image was found when
calculating mP and mR because a system can obtain a high average recall and high average precision
by specifying only the most frequent words.
1.4.3.3

F-measure

A popular measure of a test’s accuracy combining precision and recall is as follows :

F=

2×(recall×precision)
(recall+precision)

This is called F1 measure because precision and recall are weighted equally.This is a special case
of the general measure Fη ( for positive real values η ).

Fη =
1.4.3.4

(1+η 2 )(recall×precision)
(η 2 ×recall+precision)

Mean Average Precision (MAP)

For a set of queries mean average precision is the mean of the average precision values for each
query.
Q

MAP =

∑q=1 AveP(q)
Q

Where :
– Ave : function that calculates the average.
– P : the precision value for each query.
– Q : the number of queries.

1.5

Conclusion

In this chapter, we presented the different topic models used for image annotation and the different
criteria to evaluate them. We can note that the appearance of these models follows a chronological
order. Sometimes, a new model is the fusion of two other models by changing the laws of parameters.

12

There are several different models introduced by many researchers. These models need to be
evaluated to test the contribution of each model to image annotation. In this context, we presented the
criteria used for testing the performance of these topic models such as caption perplexity and testing
the value of average likelihood.
In the next chapter, we will present the models that we will study and specify the difference in
structure between them. We will describe algorithms for each model.
This study will be very detailed, we will start by mathematical expressions of the model parameters then we will try to find the appropriate equation that maximizes lower bound to obtain the
algorithm corresponding to each model.

13

Chapter 2
Detailed study of used topic models
2.1

Introduction

In this chapter, we will describe the topic models for image annotation, design them differently
according to our perception and finally implement them.
We will start with describing data used for our work, then we will study in depth the correspondence LDA (CorrLDA) model . We will present the whole learning process associated to this model
called Variational EM. So, we will start by calculating the lower bound, maximizing it and ending by
writing the learning algorithm of this model.
The same process is applied to Correspondence Correlated Topic Model (Corr-CTM).

2.2

Data representation

In this part, we will describe the data used in our work.
– A corpus C is a collection of documents: C = {doc1 , ..., docd , ..., docD }.
– A document docd is a pair of an image rd and its annotation wd : docd = {rd , wd }.
– An image is a set of visual words rd = {rd1 , ..., rdn , ..., rdNd }.
– An annotation is a set of textual words wd = {wd1 , ..., wdm , ..., wdMd }.

T
j
Vr
1
– A visual word rdn = rdn , ..., rdn , ...rdn , where ∃i = 1 and j = 0 , ∀ j 6= i.

T
j
w
– A textual word wdm = w1dm , ..., wdm , ...wVdm
, where ∃i = 1 and j = 0 , ∀ j 6= i.

2.3

Framework of topic modeling

A general process associated to a topic model is divided into two stages: Learning and Inference
(see Figure 2.1).

14

Figure 2.1: General process for a topic model

2.3.1

Learning :

Topic models are probabilistic models containing many hidden variables. So, learning such models
requires more powerful algorithms than tradiotional EM. Therefore, common alternative approches
are either Gibbs sampling or variational EM. In our case, we will use the second approach because it
is more cited in the litterature.
In the next section, we will compare EM to Variational EM.
2.3.1.1

EM algorithm

Traditionally, Expectation-Maximization (EM) is used to learn probabilistic models. Indeed, EM
is an iterative algorithm which performs two steps :
Expectation-Step to approximate the log-likelihood and Maximization-Step to estimate model
parameters which maximize the corpus log-likelihood.
However, topic models which are special cases of probabilistic models did not use traditional EM.
They rather apply a variant of EM called Varitional EM.
The main difference between EM and Variational EM is the introduction of Variational Inference
in the E-Step (see the following algorithms (2.1) and (2.2)).

15

Algorithm 2.1 EM algorithm
repeat
E-Step:
Q(π|π (t) )=E[logLZ|X,π (t) (π;x,z)]
M-Step:
π (t+1) =argmax Q(π|π (t) )
until (convergence)
Where :
– z={z1 ,z2 ,...,zd ,...,zD } : hidden latent variables.
– x={x1 ,x2 ,...,xd ,...,xD } : sample of training observations.
– π : model parameters.
– logL : log-likelihood.
– Q : expectation of logL.
2.3.1.2

Variational EM algorithm

Variational EM algorithm, described in the algorithm below (Algorithm 2.2), is an iterative algorithm, whose each iteration consists of two steps :
– E-Step : finds for each document the optimizing values of the variational parameters.
– M-Step : maximizes the resulting lower bound on the log likelihood with respect to the model
parameters.
The algorithm converges once the perplexity of the model given training corpus does not change .
Therefore, Variational EM algorithm is based on Variational Inference procedure that we will
explain in the subsequent subsection.
Algorithm 2.2 Variational EM algorithm
repeat
E-Step:
for each observation (xd ) do
[Vpd ]=Variational Inference(xd ,π (t) )
end for
M-Step:
π (t+1) =argmax (Lxd )
until (convergence)
Where :
– Vp : Variational parameters of xd .
– xd : dth observation.
– Lxd : lower bound on logL(xd ).
2.3.1.3

Varitional Inference

As indicated in the LDA paper [7] , E-Step typically amounts to an inference step :

16

p(Z | X, π)= p(X|Z,π)p(X|π)
p(X|π)
p(Z | X, π) is intractable because it is difficult to calculate p(X | π). To resolve this inferential
problem, Variational EM approximates logLxd =log p(X = xd | π) by lower bound Lxd . In LDA paper,
this idea is expressed in the following equation :
logLxd = Lxd + DKL (q(V pxd ) k p(Z | xd , π))
Where q(Vpxd ) is called variational distribution and Vpxd are the variational paramters associated
to xd .
In EM algorithm, the ultimate goal is to find model parameters that maximize the log-likelihood
given training data. Since the log-likelihood of training data is the sum of log-likelihood of all document xd=1:D , so maximizing logL(corpus) is equivalent to to maximizing logL(xd ) for each document
xd .
From the equation of logLxd , for each document xd , maximizing logL(xd ) amounts to minimizing
DKL and maximizing Lxd . Therefore, Variational EM maximizes a lower bound (Lxd ) for each document in procedure called Variational Inference and returns optimal values of variational paramaters.

2.3.2

Inference Algorithm

Once we obtain variational and model parameters, it remains to predict annotations from an unknown image. The first step is to perform Variational Inference for the unknown image by fixing model
parameters which have already been obtained from a training set.
For predicting annotations, we approximate the conditional distribution over words in the textual
vocabulary. The conditional probability p(w|r) can be treated as the predicted confidence score of
each annotation word in word (w) vocabulary given the wholly codewords of the unknown image (r).
Then, we can choose the most promising annotations by ranking the confidence scores.
The next algorithm 2.3 describes briefly this procedure.
Algorithm 2.3 Inference Algorithm, note that r is a set of regions belonging to image I
for each document docd ={rd ,wd }in test data set do
for each textual word wi in textual vocabulary do
doc f ictive ={rd ,wi }
[Vp]=Variational Inference(doc f ictive , π)
compute p(wi |r) in function of Vp
end for
Rank values of p(wi |r)
w=
ˆ top-5 of ranked list
end for
Where doc f ictive ={rd ,wi } is an imaginary document composed of regions of docd and one word
from textual vocabulary.

17

2.4

Topic models for image annotation

In this section, we will present models for image annotation. We will carefully study models
oriented to the task of annotation.

2.4.1

Correspondence LDA (CorrLDA)

We introduce correspondence LDA (CorrLDA) as a model combining exibility of GM-LDA with
associability of GM-Mixture. With this model, we reach simultaneous dimensionality reduction in the
representation of region descriptions and words, while also modeling the conditional correspondence
between their respective reduced representations.
The table below (2.1) presents symbol and description used in CorrLDA.
Symbol
D
K
M
N
Vr
Vw
v
w
y
z
θ
α
δ
β
γ
Φ
λ

Description
Number of documents in training set.
Number of topics.
Number of textual word in document d.
Number of visual word in document d.
Size of visual vocabulary.
Size of textual vocabulary.
j
j
Visual word vn =1 only if the nth visual word in document d is indexed j , else vn =0.
j
j
Textual word wn =1 only if the mth textual word in document d is indexed j , else wn =0.
Discrete indexing variable y∈ {1, · · · ,M}.
Latent topic. zin = 1 if zn is the ith latent topic, else zin = 0
Multinomial. θi , i ∈ {1, · · · ,K}.
Dirichlet prior for θ . α i , i ∈ {1, · · · ,K}.
Multinomial. δi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vr }
Multinomial. βi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vw }
Variational Dirichlet. γ i , i ∈ {1, · · · ,K}.
Variational Multinomial. Φni , n ∈ {1, · · · ,N}, i ∈ {1, · · · , K}
Variational Multinomial. λmn , m ∈ {1, · · · ,M}, n ∈ {1, · · · , N}
Table 2.1: Symbol and Description (CorrLDA)

2.4.1.1

Variational Inference

Simplifying the Graphical Model
To start our derivation, firstly we have to simplify the original graphical model described in chapter
1 by introducing three variational parameters. We only concentrate on the hidden variables θ ,z,y and
give a simplified graphical model in the following figure (Figure 2.2).[9]

18

Figure 2.2: (Left) Graphical model representation of our modified CorrLDA. (Right) Graphical model
representation of the variational distribution used to approximate the posterior in Corr-LDA. Note that
the variational graphical model creates independency between z,θ and y, which enables us to factorize
the joint probability.
The expression of factorized distribution on the latent variables of a document is the following:

M
q(θ , z, y | γ, Φ, λ )= q(θ | γ)∏N
n=1 q(zn ,Φn )∏m=1 q(ym ,λm ).

Note that γ is the variational Dirichlet, Φn is a variational multinomial over K topics, and λm is a
variational multinomial over M words.
– Finding the Lower Bound : By looking at the graphical model in Figure 2.2, we can factorize
q(θ , z, y | γ, Φ, λ ) into q(θ | λ )q(z | Φ)q(y | λ ) . So, we can write the expression of lower bound
as follows :

L[γ, Φ, λ ; α, δ , β ]=Eq [logp(θ | α)]+Eq [logp(z | θ )] +Eq [logp(r | z, δ )] +Eq [logp(y | N)]
+Eq [logp(ω | y, z, β )] -Eq [logq(θ | γ)] -Eq [logq(z | Φ)]-Eq [logq(y | λ )] .
– Expanding Factors : Now we need to expand each term to make L[γ, Φ, λ ; α, µ, σ , β ] as
an explicit function of model parameters α, µ, σ , β and variational parameters γ, Φ, λ . These
equations can be used in several other occasions.
K
K
Eq [logp(θ | α)]=log Γ(∑Kj=1 α j ) − ∑K
i=1 log Γ(αi ) + ∑i=1 (αi − 1)(ψ(γ j ) − ψ(∑ j=1 γ j )).
K
k
Eq [logp(z | θ )]=∑N
n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j ) ).
Vr
K
Eq [logp(r | z, δ )] =∑N
n=1 ∑i=1 ∑i=1 Φni log δi j ].

Eq [logp(y | N)]=-log(N).
j

Vw
N
K
Eq [logp(ω | y, z, β )]=∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j ).
K
k
Eq [logq(θ | γ)]=log Γ(∑Kj=1 γ j )-∑K
i=1 log Γ(γi ) + ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j )).

19

K
Eq [logq(z | Φ)]=∑N
n=1 ∑i=1 Φni log(Φni ).
N
Eq [logq(y | λ )] =∑M
m=1 ∑n=1 λmn log λ mn .

Thereafter we obtained the expression of the lower bound depending on the model parameters
(α, µ, σ , β ) and variational parameters (γ, Φ, λ ).
K
K
L[γ, Φ, λ ; δ , σ , β ]=log Γ(∑Kj=1 α j ) − ∑K
i=1 log Γ(αi ) + ∑i=1 (αi − 1)(ψ(γ j ) − ψ(∑ j=1 γ j ))
k
K
+∑N
n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j ) )
Vr
K
+∑N
n=1 ∑i=1 ∑i=1 Φni log δi j

-log(N)
j

Vw
K
N
+∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )
K
k
-log Γ(∑Kj=1 γ j )+∑K
i=1 log Γ(γi ) - ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j ))
K
-∑N
n=1 ∑i=1 Φni log(Φni )
N
-∑M
m=1 ∑n=1 λmn log λ mn .

The next step is to maximize the lower bound , so we will look for values of (α, µ, σ , β ) which
maximize L[γ, Φ, λ ; α, µ, σ , β ].
2.4.1.2

Maximizing Variational Parameters

– Variational Dirichlet γ : In this part we have three variational parameters. These are γ, Φ and
λ . We begin with Variational Dirichlet γ by choosing terms of L[γ, Φ, λ ; α, δ , β ] that contain
γ.
k
L[γi ] = ∑K
i=1 (αi − 1)(ψ(γi )-ψ(∑ j=1 γ j )
K
k
+ ∑N
n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j )
K
k
-log Γ(∑Kj=1 γ j )+∑K
i=1 log Γ(γi ) - ∑i=1 (γi − 1)(ψ(γi )-ψ(∑ j=1 γ j )).

We derive L[γ] relative to γi . We obtain :

dL[γi ]
dγi =

0

0

k
K
N
ψ (γi )(αi + ∑K
n=1 Φni − γi )-ψ (∑ j=1 γ j ).∑i=1 (α j+ ∑n=1 Φn j − γ j ).

We put this equation equal to zero, there is the maximum at :
γi =αi +∑N
n=1 Φni .

20

– Variational Multinomial Φ: We now move to maximizing L[γ, Φ, λ ; α, µ, σ , β ] by adding
Lagrange multiplier ∑K
i=1 Φni = 1.
k
K
L[Φni ] = ∑N
n=1 ∑i=1 Φni (ψ(γi )-ψ(∑ j=1 γ j )
Vr
K
+∑N
n=1 ∑i=1 ∑i=1 Φni log δi j
j

K
V
+ ∑N
n=1 ∑i=1 ∑ j=1 Φni ωn log(βi j )
K
- ∑N
n=1 ∑i=1 Φni log(Φni )

+η(∑K
i=1 Φni − 1).
We derive L[Φni ] relative to Φni . We obtain :

dL[Φni ]
k
dΦni =ψ(γi )-ψ(∑ j=1 γ j )

+log(δiVr )+∑M
m=1 λmn log(βiVw )-log(Φni )-1+η.

We put this equation equal to zero, there is the maximum at :
Φni ∝ δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M
m=1 λmn log(βiwm )).
– Variational Multinomial λ : Finally, we maximize L[γ, Φ, λ ; α, δ , β ] by adding Lagrange multiplier ∑N
n=1 λmn = 1.
j

Vw
N
K
M
N
N
L[λmn ]=∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )-∑m=1 ∑n=1 λmn log λ mn +η(∑n=1 λmn − 1).

Taking the derivative λmn :

dL[λmn ]
Vw
K
dλmn =∑i=1 Φni ∑ j=1 log(βiwm ) − log(λmn )+η.

Setting this derivative to zero yields the maximizing value of the variational parameter λmn .

λmn ∝exp(∑K
i=1 Φni log(βiwm )).
2.4.1.3

Maximizing Model Parameters

– Conditional Multinomial δ : We again isolate terms containing δ and add Lagrange multiplir
ers (∑Vj=1
δ i j =1).
Nd
Vr
K
L[δ i j ]=∑D
d=1 ∑n=1 ∑i=1 Φdni log(δivn )-η(∑ j=1 δ i j -1).

21

Setting

dL[δi j ]
dδi j =0

leads to :
j

Nd
δ i j ∝∑D
d=1 ∑n=1 Φdni vdn .
j

j

Note that vdn is an indicator variable, vdn =1 only if the nth visual word in document d is indexed j.
– Conditional Multinomial β : Next, we isolate terms containing β and add a Lagrange multiw
plier (∑Vj=1
β i j =1).

Nd
Md
Vw
K
L[βi j ]= ∑D
d=1 ∑m=1 ∑i=1 ∑n=1 Φni λmn log(βiwm )-η(∑ j=1 β i j -1).

Setting

dL[βi j ]
dβi j =0

leads to :

j

Md
Nd
βi j ∝∑D
d=1 ∑m=1 wdm ∑n=1 Φdni λdmn .

Where :
j
j
wdm is an indicator variable, wdm =1 only if the mth textual word in document d is indexed j.
i]
– Dirichlet prior α : Now we will estimate the Dirichlet α. Observe that dL[α
dαi depends on
α j where j6=i . Therefore, a linear-time Newton-Raphson algorithm is developed to find the
maximal α. We omit the details of this algorithm.
2.4.1.4

Variational EM

The derivation described above is summarized in Table 2.2. We find the maximum likelihood
estimates of the model parameters with a variational EM procedure. Particularly, the E-Step computes
the variational posterior {γd , Φd , λd }D
d=1 for each image and caption given the current settings of model
parameters {δ , β , α}. Then, the M-Step subsequently finds maximum likelihood estimates of the
model parameters from expected sufficient statistics taken under the variational distribution.
The variational EM algorithm alternates between two steps until the bound on the expected log
likelihood converges. The E-Step (Variational Inference) and the complete variational EM are summarized in Algorithm 2.4 and Algorithm 2.5, respectively.[8]
Parameter
γi
Φni
λmn
δij
βi j
α

Update Rule
=αi +∑N
n=1 Φni
∝ δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M
m=1 λmn log(βiwm ))
K
∝exp(∑i=1 Φni log(βiwm ))
j
Nd
∝∑D
d=1 ∑n=1 Φdni vdn
j
Md
Nd
∝∑D
d=1 ∑m=1 wdm ∑n=1 Φdni λdmn
Using Newton-Raphson algorithm
Table 2.2: Parameter and Update Rule

22

Algorithm 2.4 Variational Inference for corrLDA
Require α,δ , β from M-Step
initialize γi =αi +N/K, for all i ∈ {1, . . . , K}
initialize Φni =1/K, for all i ∈ {1, . . . , K}
initialize λmn =1/N, for all n ∈ {1, . . . , N}
t := 1
maxiter := 100
repeat
for n := 1 to N do
for i := 1 to K do
Φni := δivn exp(ψ(γi ) − ψ(∑kj=1 γ j ) + ∑M
m=1 λmn log(βiwm ))
end for
end for
normalize{Φni }K
i=1 to sum to 1
for m := 1 to M do
for n := 1 to N do
λmn :=exp(∑K
i=1 Φni log(βiwm ))
end for
end for
normalize{λmn }N
n=1 to sum to 1
for i := 1 to K do
γi :=αi +∑N
n=1 Φni
end for
t := t + 1
until (t > maxitr) or (convergence)
Remark that after we have a training model, we can perform the E-Step with fixed model parameters α, β and δ on an unseen document to estimate the variational posterior.
Algorithm 2.5 Parameter Estimation for corrLDA
t := 1
maxiter := 100
repeat
for d := 1 to D do
[γ,Φ,λ ]=Variational Inference for corrLDA(docd , α, δ , β )
end for
for i := 1 to K do
for j := 1 to Vr do
j
Nd
δ i j :=∑D
d=1 ∑n=1 Φdni vdn ,
end for
for j := 1 to Vw do
j
Md
Nd
βi j :=∑D
d=1 ∑m=1 wdm ∑n=1 Φdni λdmn
end for
end for
w
normalize{βi j }Vj=1
to sum to 1
r
normalize{δi j }Vj=1
to sum to 1
update α by Newton-Raphson algorithm
t := t + 1
until (t > maxitr) or (convergence)

23

2.4.2

Correspondence Correlated Topic Model (CorrCTM)

CorrCTM is built on CTM. As we explained in the first chapter, the contribution of CTM compared to LDA, is using multivariate normal distribution instead of Dirichlet to capture the topic correlation,and adding an additional logistic normal mapping step to incorporate the normally distributed
topics. These contributions make CTM more reliable than LDA.
We derive posterior probability function in the E-Step, then start maximizing the lower-boundary
in order to calculate model parameters. Unfortunately, the direct calculation of exact posterior function over latent variables is intractable. So, we use variational EM framework to approximate the
inference instead of exact inference in the E-Step, then we try to maximize the lower-boundary to get
a new model parameter estimate.
Figure 2.3 presents the CorrCTM model.

Figure 2.3: (left) Proposed CorrCTM, (right) Variational distribution used to approximate the posterior for each image in CorrCTM.
In the following part, we will focus on deriving a Variational Inference algorithm for approximating the posterior distribution, then estimate model parameters. Next, we will derive prediction
algorithm for annotating new images with model parameters.
The table below (Table 2.3) presents symbol and description used in CorrCTM.

24

Symbol
D
K
M
N
Vr
Vw
v
w
y
z
θ
µ
Σ
δ
β
γ
ν
Φ
λ
ζ

Description
Number of documents in training set.
Number of topics.
Number of textual word in document d.
Number of visual word in document d.
Size of visual vocabulary.
Size of textual vocabulary.
j
j
Visual word vn =1 only if the nth visual word in document d is indexed j , else vn =0.
j
j
Textual word wn =1 only if the mth textual word in document d is indexed j , else wn =0.
Discrete indexing variable y∈ {1, · · · ,M}.
Latent topic. zin = 1 if zn is the ith latent topic, else zin = 0
Multinomial. θi , i ∈ {1, · · · ,K}.
Normal vector µi , i ∈ {1, · · · ,K}.
Normal matrix µi , i ∈ {1, · · · ,K}.
Multinomial. δi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vr }
Multinomial. βi j , i ∈ {1, · · · ,K}, j ∈ {1, · · · , Vw }
Variational mean, γ i , i ∈ {1, · · · ,K}.
Varriance of normal distribution ν i , i ∈ {1, · · · ,K}.
Variational Multinomial. Φni , n ∈ {1, · · · ,N}, i ∈ {1, · · · , K}
Variational Multinomial. λmn , m ∈ {1, · · · ,M}, n ∈ {1, · · · , N}
Additional variational parameter
Table 2.3: Symbol and Description (CorrCTM)

2.4.2.1

Variational Inference

To start our derivation, firstly we have to simplify the original graphical model by introducing four
(γ, ν, Φ, λ ) variational parameters. The expression of factorized distribution on the latent variables of
a document is:
M
q(θ , z, y | γ, ν, Φ, λ )= q(θ | γ, ν)∏N
n=1 q(zn ,Φn )∏m=1 q(ym ,λm ).

Note that θ =f(η)=

exp(η)
.
∑K
i=1 (ηi )

So, we can write the expression of lower bound as follows :
L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ) =Eq [log p(θ | µ, Σ)] + Eq [log p(z | θ ]+Eq [log p(r | z, δ )] -Eq [logp(y | N)]
+Eq [logp(ω | y, z, β )]-Eq [log q(θ1:K | γ, ν)]
-Eq [log q(z | Φ)]-Eq [logq(y | λ )] .
Knowing that :
Eq [log p(η | µ, Σ)]= 21 log| Σ−1 |- K2 log (2π)- 21 Eq [(η − µ)T Σ−1 (η − µ)].
n
o
2
K
−1 ( K exp γ + νi ) + 1 − log(ζ )].
Eq [log p(z | f (η)]=∑N
[
γ
Φ

ζ


i
i=1
n=1 i=1 i ni
2
j

V rt
K
Eq [log p(r | z, δ )]=∑N
n=1 ∑i=1 Φni ∑ j=1 rn log(δi j ).
Eq [logp(y | N)]=-log(N).

25

j

Vw
K
N
Eq [logp(ω | y, z, β )]=∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j ).
1
2
Eq [log q(θ1:K | γ, ν)]=-∑K
i=1 2 [log(νi ) + log(2π)+1].
K
Eq [log q(z | φ )]=∑N
n=1 ∑i=1 Φni log(Φni ).
N
Eq [logq(y | λ )]=∑M
m=1 ∑n=1 λmn log λ mn .

Thereafter, we obtained the expression of the lower bound depending on the model parameters
(µ, Σ, π, β )) and variational parameters (ζ ,γ, ν, Φ, λ ).
L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ) = 12 log| Σ−1 |- K2 log (2π)- 12 Eq [(η − µ)T Σ−1 (η − µ)].
n
o
2
K
−1 ( K exp γ + νi ) + 1 − log(ζ )]
γ
Φ

ζ
+∑N
[
∑i=1
i
n=1 ∑i=1 i ni
2
j

Vt
K
+∑N
n=1 ∑i=1 Φni ∑ j=1 rn log(δi j )

-log(N)
j

Vw
N
K
+∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )
1
2
+∑K
i=1 2 [log(νi ) + log(2π)+1]
K
-∑N
n=1 ∑i=1 Φni log(Φni )
N
-∑M
m=1 ∑n=1 λmn log λ mn .

Our aim is to maximize as much as possible the expression of the lower bound L(ζ ,γ, ν, Φ, λ ; µ, Σ, δ , β ).
In E-Step, we approach the variational parameters {ζ ,γ, ν, Φ, λ } to lower bound the marginal likelihood, then maximize the lower boundary in M-Step to obtain model parameters {µ, Σ, δ , β }.
The same treatment of CorrLDA variational parameters will be performed in the next part.
2.4.2.2

Maximizing Variational Parameters

We isolate terms containing λ :
j

Vw
N
K
M
N
L[λ ]=∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )-∑m=1 ∑n=1 λmn log λ mn .

Setting

dLL[λ ]
dλ =0

leads to :



λmn ∝ exp ∑K
i=1 φni log βiwm .
Then we isolate terms containing ζ :

26

n
o
2
K
−1 ( K exp γ + νi ) − log(ζ )].
[−ζ
L[ζ ] =∑N


i
i=1
n=1 i=1
2
Setting

dL[ζ ]
=0


leads to :
n
o
νi2
exp
γ
+
ζ =∑K
i
i=1
2 .

For γ and ν 2 ,we have :
K
−1 ( K exp{γ +
L[γ] = − 12 (γ − µ)T Σ−1 (γ − µ)+∑N
∑i=1
i
n=1 [∑i=1 γi Φni − ζ

νi2
2 })].

and
−1 ( K exp{γ +
L[ν 2 ] = - 12 Trace(diag(ν 2 )Σ−1 )+∑N
∑i=1
i
n=1 [-ζ

νi2
K 1
2
2 })]+∑i=1 2 log(νi ).

We can use Newton’s method to find the local optimal of the coordinate object function. In
practice, we use LBFGS and constrained Newton method to iterate γ and ν 2 .[10]
Next, we follow the same process for Φ.
j

Vr
K
N
K
N
K
L[Φ]=∑N
n=1 ∑i=1 γi Φni +∑n=1 ∑i=1 ∑ j=1 Φni rn log(δi j )-∑n=1 ∑i=1 Φni log(Φni )
Vw j
N
K
+∑M
m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 rm log(βi j ).

Setting

dL[Φ]
dΦ =0

leads to :


Φni ∝δivn exp γi + ∑M
m=1 λmn log βiwm .

2.4.2.3

Maximizing Parameter Estimation

Taking a collection of image data with annotation words {(rd ,wd )}D
d=1 , we will try to maximize
likelihood estimation for model parameters {µ, Σ, δ , β }. We will then update model parameters by
setting derivation equal zero with respect to each model parameter.
1
T −1
L[µ]=∑D
d=1 − 2 (γd − µ) Σ (γd − µ)

Setting

dL[µ]
dµ =0

leads to :
µ = D1 ∑D
d=1 γd .

The terms containing Σ:
1
−1 1
2 −1
T −1
L[Σ]=∑D
d=1 [ 2 log| Σ |- 2 Trace((diag(ν )Σ ) + (γd − µ) Σ (γd − µ)].

Setting

dL[Σ]
dΣ =0

leads to :
27

2
T
Σ= D1 ∑D
d=1 (I.νd + (γd − µ)(γd − µ) ).

The terms containing δ :
j

Vr
N
K
L[δi j ]=∑D
d=1 ∑n=1 ∑i=1 Φni ∑ j=1 rn log(δi j )

Setting

dL[δi j ]
dδi j =0

leads to :
j

Nd
δi j =∑D
d=1 ∑n=1 Φdni νdn .

The terms containing βi j :
j

Vw
M
N
K
L[βi j ]=∑D
d=1 [− ∑m=1 ∑n=1 λmn ∑i=1 Φni ∑ j=1 ωn log(βi j )]

Setting

dL[βi j ]
dβi j =0

leads to :
j

Md
Nd
βi j ∝∑D
d=1 ∑m=1 wdm ∑n=1 Φdni λdmn .

2.4.2.4

Learning algorithm

The algorithm below (Algorithm 2.6) presents Variational Inference for CorrCTM . Note that
LBFGS [10] is an iterative method for solving unconstrained nonlinear optimization problems. We
use LBFGS to iterate γ and ν.

28

Algorithm 2.6 Variational Inference for CorrCTM
intialize : ζ =10
intialize : Φdni = 1k
intialize : νi2 =10
intialize : λi2 =0
t := 1
repeat
K

ζd := ∑ exp{γdi +

2
νdi
2 }

i=1

for m := 1 to Md do
for n := 1 to Nd do
Vw

K

{ ∑ Φdni ∑ log βi j }

λdmn ∝ exp i=1
end for
Normalize(λdm )
end for
γd := LBFGS(γd )
2 := LBFGS(ν 2 )
νdi
di
for n := 1 to Nd do
for i := 1 to K do
Vr

j=1

j

Md

Vw

m=1

j=1

j

{γdi + ∑ rn log δi j + ∑ λdmn ∑ wdm log βi j }
j=1

Φdni ∝ exp
end for
Normalize(Φdn )
end for
t := t + 1
until (t > maxitr) or (convergence)

29

The algorithm below (Algorithm 2.7) presents parameter estimation for CorrCTM.
Algorithm 2.7 Parameter Estimation for CorrCTM
β and δ are randomally initialized, then normalized
t := 1
repeat
for docd ∈ C do
[ζd , γd , νd , Φd , λd ]:=Variational Inference(docd , µ, Σ, δ , β )
end for
D

µ := ∑ γd
Σ :=

d=1
D
1
{Iνd2 + (γd − µ) (γd − µ)T }

D
d=1

for i := 1 to K do
for j := 1 to Vr do
D

Nd

Vr

j

δi j ∝ ∑ ∑ Φdni ∑ rdn
d=1 n=1

j=1

end for
Normalize(δi )
for j := 1 to Vw do
D

Md Nd

j

βi j ∝ ∑ ∑ ∑ λdmn Φdni wdm
d=1 m=1 n=1

end for
Normalize(βi )
end for
t := t + 1
until (t > maxitr) or (convergence)

2.5

Comparison between LDA and CTM based topic models

Topic models can be regrouped into two main categories: LDA based topic models (LDA,CorrLDA,sLDAbin,...) and CTM based topic models (CTM,CorrCTM,sCTM-bin ...).

2.5.1

Topic proportions θ

LDA based topic models are all under Dirichlet assumption: topic proportions of an image are
randomly drawn from a Dirichlet distribution. Under Dirichlet, each topic proportion is assigned
independently, which leads to an unrealistic limitation that the presence of one topic is not correlated
with the presence of others.
CTM based topic models are all under normal distribution: topic proportions of an image are
correlated using logistic normal mapping. Annotation words of an image have correlation upon image
content, this correlation could be modeled by topic correlations. CTM produces correlated topic
proportions for words in one document thanks to the logistic normal mapping.
So, LDA based topic models allow each document to exhibit multiple topics, but ignores the
correlation between topics. CTM based topic models are based on LDA and overcome the limitation
of LDA.
30

2.5.2

Variational Parameters

Taking the equation of Φ which maximizes Likelihood in CorrLDA and CorrCTM, we note that
the per-codeword variational distribution over topics Φ in CorrCTM depends on the different topic
proportions hyper parameter γi of topic proportion. In CorrLDA γi , i∈ {1, . . . , K} are generated from
Dirichlet, which implies that γi , i∈ {1, . . . , K} are independent and there is no correlation between
them.
Nevertheless, in CorrCTM the γi , i∈ {1, . . . , K} is generated from normal distribution, and γi ,γ j
are calculated in relation to each other where the variance νi is involved in the covariance matrix Σ.
As a result, the covariance structure highly correlates topic proportions for each image, and the topic
assignment for the codewords Φni and word λmn would also be correlated.

2.5.3

Inference Process

The main contribution of CTM based topic models is using correlation between topics via the
logistic normal distribution. At the same time, the non-conjugacy of the logistic normal distribution
adds complexity to the Variational Inference process.
The drawback of using the logistic normal is that it is not conjugate to the multinomial, which
complicates the corresponding approximate posterior inference procedure.
Therefore, a negative mark for models based on CTM is that they take a very long time in execution, while inference algorithms based on LDA models are faster than those models based on CTM.

2.6

Conclusion

This chapter dealt with the various topic models that were implemented in our work: CorrLDA
and CorrCTM. It is important to note that each module was implemented separately for each model,
each equation was tested apart before it was assembled in the model. This strategy allowed us to avoid
many errors in the implementation.
It should be noted that in the implementation stage, we encountered many problems that are basically mathematical. Sometimes the inference algorithm diverges. This is a problem of initialization
parameters. We spent a lot of time in finding good intialisation leading the algorithm to converge.
In the next chapter, we will test the performance of each model in the annotation using many
comparison criteria. We will also present a comparative study between them based on the results and
curves obtained.

31

Chapter 3
Experiences and Results
3.1

Introduction

In the previous chapter, we introduced the models that we used in this work. These models would
not be important if we do not test their perfermance to achieve our goals.
In this chapter, we will present the assessment of our models for image annotation. We will
test the performance of each model by testing the variation of average predictive likelihood, average
perplexity for feature and word for various values of number of topics. We will discuss the different
results of these experiments, interpret each observation generated and explain the cause of each gap
between the values.
But before presenting our study, we will describe the environmental material and technology used
in this work, the image database that was used and the method adopted to divide it into two bases :
the first for training and the second for testing.

3.2
3.2.1

Technology and material environment
Hardware environment

My project was carried out using a laptop HP having the following characteristics :
– Processor AMD Athlon II DUAL-Core M320(2.1GHz).
– RAM 4 Go.
– Hard drive 500 Go.
– Display 15 inches.
We used other more powerful machines belonging to the laboratory to execute algorithms of high
complexity.

3.2.2

Technology and software environment

Elementary OS-i386.20130810 : The operating system used is ”Elementary OS”. It is a GNU /
Linux distribution designed to create a complete environment (called Pantheon) based on the GNOME
desktop as well as on the Ubuntu distribution.
32

We used this system because it is very light and robust especially when using the virtual machine.
Matlab 13 : (matrix laboratory) is a multi-paradigm numerical computing environment and fourthgeneration programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces and
interfacing with programs written in other languages, including C, C++, Java, and Fortran. Matlab
was used to implement several mathematical algorithms useful to our work.
Code::Blocks 10.05 : It is a free and open source, cross-platform IDE which supports multiple
compilers including GCC and Visual C++. This editor is used to implement topic models for our
project. Among the reasons that lead us to use Code::Blocks are :
– Backup compilation (generation) and execution with a single key (F9).
– Code errors are directly shown in the code.
– Automatic formatting.
– Direct command help C / C + + from the editor with F1.

3.3

Data configuration

We consider our models with the Corel 5k dataset. As Corel 5K is a copyright dataset, we just use
the public statistic and evaluate the annotation performance. This dataset includes 5000 images (4500
for training, 500 for testing) and 260 words.
While there exists a great variety of image features in the literature to choose from, as the purpose
of our work is to compare model rather than feature selection, we restrict our choice to SIFT feature,
which has been shown to be discriminative in numerous classification and recognition problem.

3.4

Training models

Now we will launch training models for each value of K (number of topics). the train data set
contains 5000 images. we will add in the tab «Set programs argument» of CodeBlocks editor :
[est] [alpha] [k] [settings] [data] [random/seeded/*] [directory]
The term [est] : means estimation.
The term [K] : number of topics.
The term [settings] : contains values of convergence criteria .
The term [data] : contains train data set.
The term [random] : describes how the topics will be initialized. "random" initializes each topic
randomly; "seeded" initializes each topic to a distribution smoothed from a randomly chosen document
The term [directory] : is the result of training.
Example: est 0.1 2 settings.txt train.txt random out2.
33

The following figure (Figure 3.1) shows the execution of the program during the learning phase
of CorrLDA.

Figure 3.1: Training CorrLDA
In this case, a directory named out2 appears in the current directory. It represents the learning
model for two topics. To test the result of the annotation model for other value K, we must start
learning again for the new value. For this reason we start learning for different values of K. We
obtained directories out2, out 20 and out50 as the following figure (Figure 3.2) shows for K = 2, 20
and 50.

Figure 3.2: Different versions of the learning models

34

3.5

Testing models

The learning phase is complete, we will now test the model by using test data set. The test data
contains 500 images without annotation. The model will estimate an annotation for each image. We
will add in the tab «Set programs argument» of CodeBlocks editor :
inf [settings] [model] [data] [name]
Variational Inference is performed on the data using the model in [model].
Example: inf settings.txt out50.final test.txt test.
The following figure (Figure 3.3) shows the execution of the program during the test phase of
CorrLDA for number of topics equal to 50.

Figure 3.3: Testing CorrLDA
A directory named test appears in the current directory, it contains 5 files, each file begins with
the prefix final as the following figure (Figure 3.4) shows.

35

Figure 3.4: Result of testing CorrLDA
The file final50-pd.dat contains value of like likelihood for each document in test data set.
The file final50-gamma.dat contains the matrix gamma for the test data set.
The file final50-anno.dat contains 500 lines such that each row represents a ranked list of textual
words assigned to the corresponding image. This list is ordered by the probability of each textual word
knowing the current iamge. Each word is represented by its index in the list of textual vocabulary. The
following figure (Figure 3.5) shows the annotation assigned to the documents of the data set.

Figure 3.5: Annotation of CorrLDA
Take for example the first image, we take the top 4 of words which had index {4,2,6,11}, corresponding to : {jetski, sky,sand,beach}.

36

We can verify annotation of the first image by opening the file containing test data set. Figure (3.6)
shows test data set ”test.txt”.

Figure 3.6: Test data set
The first row is the total number of images in the base equal to 499.
From the second line each image is shown in a couple of lines :
– [M] [term_1] [term_2]... [term_M] : where [M] is the number of textual words in the document.
– [N] [term_1] :[count] [term_2] :[count] ... [term_N] :[count] : where [N] is the number of visual
words in the document and the [count] associated with each term is how many times that term
appeared in the document. Note that [term_1] is an integer which indexes the term.
Indexation of first document is put in the red rectangle : {1,2,3,4} corresponding to {sea,sky,person,jetsky}.
The list of basic textual words with their index is put into the appendix A.

3.6

Test set likelihood

We trained a number of models for both corrCTM and corrLDA by changing values of the topic
number K on the corel 5K dataset. To compare the generalization performance of these two models,
we compute the per-image average log likelihood of the hold-out test set.
Note that for the values of CorrCTM, we adopted results of others searchers in order to compare
the results found in CorrLDA because the evaluation of CorrCTM model is not yet completed, we
hope finish as soon as possible.
A model which better fits the data will assign a higher likelihood to the test set (for the log
likelihood, lower value in abstract is better). We see in Figure 3.7 that CorrCTM provides a better
37

fit than CorrLDA in the held-out test data for all values of K. This indicates that CorrCTM is more
robust to over fitting when changing the topic number than CorrLDA.

Figure 3.7: The average test likelihood

3.7

Word Perplexity

Word Perplexity is a parameter for testing the quality of annotation performance provided by the
topic model. A highly efficient model corresponds to low values of Word Perplexity. The Word Perplexity P(word) denoted in the following equation shows how well the model predicts the conditional
probability p(w|r) for all images in held out test data set.
n D Md
o
log(wm |rd )
P(word)== exp − ∑d=1 ∑m=1
D M
∑d=1

d

We see in the following figure (Figure 3.8) that models based on CTM like CorrCTM provide
lower perplexity value compared to LDA based models like CorrLDA for different number of topics
K. This essentially increases the capability of CTM based models to capture topic correlation.

Figure 3.8: Word Perplexity

38

We observe that CorrCTM gains almost reduction on word perplexity on Corel 5K. This result
is anticipated because topic assignments are correlated according to the prior covariance structure
among topics.
A word will be selected depending on the propagation of topic correlation, which leads to a lower
perplexity for word in CorrCTM compared with CorrLDA.

3.8

Conclusion

We presented in this chapter the general process to perform the annotation using topic models.
This process is composed of two steps : a training step and testing step. After the testing phase we
tried to evaluate the perfermance annotation of each model. Models based on CTM provide higher
performance than models based on LDA. But it should be noted that models based on CTM have large
complexity algorithms.

39


Aperçu du document IdrissHassineRapport_PFE.pdf - page 1/57
 
IdrissHassineRapport_PFE.pdf - page 3/57
IdrissHassineRapport_PFE.pdf - page 4/57
IdrissHassineRapport_PFE.pdf - page 5/57
IdrissHassineRapport_PFE.pdf - page 6/57
 




Télécharger le fichier (PDF)


IdrissHassineRapport_PFE.pdf (PDF, 679 Ko)

Télécharger
Formats alternatifs: ZIP



Documents similaires


idrisshassinerapport pfe
mva 2015 mgce
mva 2015 rl7
ch15
abstract cmbbe
lidar topographie

Sur le même sujet..