# Examen 1415 en .pdf

Nom original: Examen_1415_en.pdf

Ce document au format PDF 1.5 a été généré par TeX / pdfTeX-1.40.13, et a été envoyé sur fichier-pdf.fr le 14/12/2015 à 14:37, depuis l'adresse IP 86.201.x.x. La présente page de téléchargement du fichier a été vue 460 fois.
Taille du document: 2.2 Mo (12 pages).
Confidentialité: fichier public

### Aperçu du document

M2 SE

Université Paul Sabatier

Exam - March 27th 2015
(2 hours)
Documents are allowed.
The use of communication devices (phone, . . . ) is strictly forbidden.
Take care of the presentation.

1

Picture compression

In this first exercice, we want to build an algorithm for picture compressing with the help of a
principal component analysis. For the sake of simplicity, we only consider grayscale pictures, i.e.
each pixel is represented by a value in [0, 1] related to its amount of white (0 for a black pixel and
1 for a white pixel). Thereby, a grayscale picture of n pixels by p pixels can be seen as a data
n × p matrix X corresponding to the observations of p variables x1 , . . . , xp on n individuals. For
any i ∈ {1, . . . , n} and j ∈ {1, . . . , p}, the value of the j-th pixel of line i is given by xji ∈ [0, 1].
First, we compute the uniform mean of each variable and we consider the centered data matrix
¯ To illustrate the several steps of the procedure, we use the picture of Gandalf the Grey
X.
displayed in Figure 1 which has n = 513 lines and p = 366 columns.

Figure 1: Pictures of Gandalf the Grey and of the means of the related p variables.
The figure 2 shows the boxplots for the p = 366 centered variables related to the picture of
Gandalf. In the sequel, we do not reduce the variances of these variables.
1.1 With the help of Figure 2, does it seem justified to not reduce the variances of the
observations? Explain why we choose to do that in practice for our procedure.
1

0.6
0.4
0.2
0.0
-0.2
0.6
0.4
0.2
0.0
-0.2
-0.4

Figure 2: Boxplots for the p centered variables related to the picture of Gandalf.
¯ We denote
The next step is to compute the PCA for the uniformly weighted data given by X.
λ1 &gt; · · · &gt; λp &gt; 0 the eigenvalues and v 1 , . . . , v p ∈ Rp the associated orthonormal basis of vectors.
The matrix with the eigenvectors in columns is denoted by V .
1.2 By introducing the appropriate notations, explain for what matrix we compute the
eigenvalues and eigenvectors to get this PCA. Describe with your own words what is the matrix
¯ .
C = XV
Hereafter, we denote by c1 , . . . , cp ∈ Rn the column vectors of the matrix C.
1.3 Figure 3 shows the boxplots of the variables given by c1 , . . . , cp . Explain the shape of
the distribution of these data.
Our aim is now to rebuild the picture of Gandalf from the result of the PCA. Let a number
` ∈ {0, . . . , p} of principal components, we define the picture at level ` by the corresponding n × p
matrix I ` ,
`
X
`
moy
I =I
+
ck tv k
k=1

where v is the line vector transposed from v and I moy is the matrix with the uniform means of
the initial variables (i.e. the picture on the right of Figure 1).
t k

k

2

3
2
1
0
-1
-2
-3

Figure 3: Boxplots of the p columns of C.
1.4 Explain why do we add I moy in the definition of I ` . For any k ∈ {1, . . . , p}, what is the
matrix ck tv k ?
1.5 Figure 4 shows the results obtained with the picture of Gandalf for several values of `.
Comment these results, especially the cases of ` = 0 and ` = 366.
The goal of a compression algorithm is to get a picture as close as possible to the original with
an amount of information the smallest. To quantify this amount of information, we consider the
number of real values to get for building the picture. Thus, the raw picture needs a size np and
the compressed one at level ` has a size p + `(n + p):
• I moy has a size p (one value for each initial variable),
• for each level k ∈ {1, . . . , `}, ck has a size n and v k has a size p.
1.6

Explain why our procedure is only interesting if we have
`6

(n − 1)p
.
n+p

1.7 From the previous question, discuss the case of a picture with a lot more columns than
lines (i.e. p large with respect to n). Conversely, what can you say about a picture with a lot
more lines than columns (i.e. n large with respect to p)? What is the best case for our procedure?
How do you interpret that from the point of view of a PCA?
Back to the picture of Gandalf, we consider the quality of the representation of each line in
the rebuild picture I ` . For any i ∈ {1, . . . , n}, the quality of the i-th line in I ` is measured by
2
2
c1i + · · · + c`i
2
∈ [0, 1].
cos θi =
2
2
(c1i ) + · · · + (cpi )
3

On the left of each picure in Figure 5, these cosine are plotted for each line.
1.8 With the help of Figure 5, what are the parts of the picture of Gandalf that are the
hardest to rebuild? Do some comments about our picture compression procedure in the light of
this remark.
At last, we focus on a data driven way to choose the level `. To do that, we consider the
normalised least squares criterion between the original picture I orig and its compressed version I `
at level `,
p
n
2
1 X X  orig
`
.
Iij − Iij
Crit(`) =
n i=1 j=1
1.9

What is the behavior of Crit(`) when ` grows? Explain why do we have
p
X

Crit(`) =

λ2k .

k=`+1

1.10 What is the link between Crit(`) and the part of inertia explained by the ` first principal
components?
Let α ∈ [0, 1], we propose to pick a level `∗α in the following way,
(
)
p
X

2
`α = min ` ∈ {0, . . . , p} such that Crit(`) 6 α
λk .
k=1

1.11

Explain what is the meaning of such a choice for `∗α .

Figure 6 shows the results obtained with α equal to 1%, 0.1% and 0.01%.
1.12

1.13

What is the sindarin name of Gandalf? ©

4

Figure 4: Rebuilds of the picture of Gandalf at levels 0, 1 and 10 (top from left to right) and at
levels 25, 50 and 366 (bottom from left to right).

5

0

1

0

1

0

1

0

1

Figure 5: Rebuilds of the picture of Gandalf at levels 1, 10, 25 and 50 (top to bottom, left to
right) aligned with the cos2 of the lines.

6

Figure 6: Rebuild picture of Gandalf with α = 10−2 (`∗α = 7), α = 10−3 (`∗α = 23) and α = 10−4
(`∗α = 62) from left to right.

7

2

Oenological variables

In this second section, we are interested in the notion of quality of wine with respect to some
physico-chemical properties. The data are about 4898 portuguese white wines and come from the
work of Cortez et al. For each wine, the 11 following continuous variables have been measured:
• alcohol
• chlorides
• citric acid
• density
• fixed acidity
• free sulfur dioxide
• pH
• residual sugar
• sulphates
• total sulfur dioxide
• volatile acidity
The quality of the wines has been graded by an integer between 0 (bad wine) and 10 (good wine)
on the basis of sensorial data. Our aim is to predict this quality variable from the physico-chemical
properties.
Figure 7 shows the distribution of the grades given to the 4898 white wines. There is no grade
below 3 and no 10 either. The grades are essentially concentrated between 5 and 7. To deal with
groups with similar sizes in the sequel, we consider three classes: bad (1) for a grade equal to 3,
4 or 5, medium (2) for a grade 6 and good (3) for a grade equal to 7, 8 or 9.
2000

1500

1000

500

0
3

4

5

6

7

8

Figure 7: Distribution of the grades given to the wines.

8

9

0.35
0.15

30

0.6

0.20

40

1.0

0.8

0.25

50

0.30

1.0

1.5

60

14
12
10

0.10

20

0.5

0.4

8

0.05

0

0.0

13

1.0
3.4

12

0.8

3.6

1.03
1.02

11

0.6

3.2

1.01

200
0

total.sulfur.dioxide

8

0.2

0.99

2.8

9

0.4

3.0

1.00

100

10

100
50
0

free.sulfur.dioxide

chlorides
14

3.8

1.04

residual.sugar

400
300

250

citric.acid

150

200

0.00

0.2

10

6
4

volatile.acidity

300

fixed.acidity

density

pH

sulphates

alcohol

Figure 8: Boxplots of the 11 physico-chemical variables.
As illustrated in Figure 8, some of the variables admit numerous outliers. To normalize the
data, we decide to ignore all the observations with at least one outliers among their coordinates.
After this cleaning step, the size of the data set decreases from 4898 whites wines to n = 4015.
variance reduction method.
In the sequel, we split our data into to subsets of same size for practical reasons. The first one
is used to train the following procedures and the second one is devoted to test the results.
A first approach simply consists in a discriminant analysis of the training data set to separate
the classes bad (1), medium (2) and good (3). Then applying the result of this MDA to the testing
data to measure the quality of the obtained classification.

9

2.2

The eigenvalues obtained during the computation of the MDA are the following,
0.283

0.022

0

···

0

Explain why do we get all the information in the principal plane.
2.3 Figure 9 is the representation of the classes in the principal plane. Discuss about the
quality of this plotted groups.

2

2

2

2

2

2
1
3
1
2
2
2
2
3
2
2 121 1 21
1
22
2
2
2
2
2
1
22
2
3
2
3
1
2
3
1
1
2
2
2
2
1
3 21 2
32 2232
2 2
2 1 2 3 3 12111 1
22
2
2 3 22 2 2 223
3
2
3
1
1
2
3
2
3 1
2
1
2
13
31
222121321212
2 33
2 2
3
2
3
2
1
2 2 13222 22221 2
1
2
2
2
3
1
2
2
2
2
3
3
2
2
3
12 2 2 223 1 2 2
1 2
2
2
2
2
2
1
1
1
2
23 2
11 1
13
32 22 3 32
21
221122 22 2 22 3232 23 2
3 231
12
1 1 21
2 11212222 1
22 2231 2 22 3 2 322
212212223
3 32
3
3
2
3
2212 2 2122 222
2
3
2
3
1
2
2
2222
3 23 3
23 2
22131123121 2 2222322211 12221232 33 322
122233112122 2
2
1
2
3
1
2
2
1 1
1
2
2
1
1
2
1 1212213
1
1
2
2
2
1
1
2
2
2
2
1
3
3
3
3
1
2
3
2
2
2
1
1
2
2
3
2
2
3
2
2
1
2
1
1
2
3
3
3
3
2
3
2
3
3
1
2
1 322 3 2 22 2 3 2 333 3
12 32 1 1221221
121 122221212213223 221
13
3232
1
2
1
2
3
3
1
3
1
3
2
2
3
1
2
1
2
1
1
2
2
1
2
1
1
2
2
2
2
1
2
3
1 2222
32232 311 12 323 2 32 2133232
2 1 21 121123122 1211123122112212
2
3
21 1 211312122 121221 1
332
2
112212122
123 1
122322232233
2112
31
231 3323 3322 33 23
2132
3 22
3232323 22
22213
11 22112 12
3
2
3
2
2
2
2
2
3
1
1
2
1
1
2
2
3
1 3332 2 2 2
2
1
22 2233121
2 222 1 211 11 2 31 23 1222 33 222133
1 1 1 12
3 323 2 3 23
1 1 1
2 12322
21132121212 2222 22
121
21 2221 21222 3232
132
21 1
3
1
2
1
2
2
2
1
2
1
2
3
2
2
3
12 1233 32 22 32 3 323
1 3231231112222121 1
2 11121131 22212
11 1 21222 212
13
223
22113221221
11121 12
3
2 1221
3322332332 23322 2 33
22
221
22
112
11
2
2
3
2
1
2
1
2
2
2
2
2
2
1
1
2
2
2
1
2
3
1
1 212332 1 33 2232
23
11 2 1 1 1 1 11113211122 112121 1
13 1322
3222223223
1 213221
221 232123212
221322232
1 11 11 12 1 1
11221 1 11112
3 3 2323332 32232232 213 33 3 3
3
31
2
2
1
3
2
2
1
2
2
1 1 1 1 1 2 1121 11212
2
2
1
1
2
2
1
2
2
1
2
3
1
2
3
2
3
1
2 212 22222 2 2 2 2222 2 2 3
1
21 21212
11
1 3
2 1 111 11 112 1 1111
2222311122
32232 3232 3 22
212
1 211
1 12
221211
3213212312
22 1
13
22123222 233 33 32
1
1
1
2
1
1
2
22 2
3
1
2
1
1
1
1
2
2
2
1
2
1
111 2111211 1 1 1 2 11212 3122 3 1 2 23
33 2
3
1
1
1
2
2322232223 3 322 3 2
21 2 111 2 2
11
32 2 13 2122332
13
1 2212
121 2112
12
1
1
2 1121
132
3
2
3
2
2
3
2
3
2
2
2
2
1
1 1 21 12
1 11
1 1
2 2 31 2 2 3 332 33 3 3 3 3 3
3
11 2 11222
1 12 1 2
2
1
12 1
3
3
2 2 2 2222 1 1
2 22 3 2 32313 3 322 2
121
333
1 1
1 31
1
1
1
2
2
1
2
3
1
1
1
33
2
33
222 3 2
232
21 1111 12
1 11 1 111 22132 2213 2 3 2 2 3 233 3 223 333
2
2
33 2 2
1
1
3
2
2
2
2
21
2
2
3
3
2
2122
1
3
3
3
1
3
1 1 11121 1223222 1 12
1 11
12 3 3 332 2 3322 3 3 3 3
1
1 1
33
1 112 2
3 3
3
1
1
2
2
3 323 3 3 2 3 3333 3 2 3
3
2
1
2
1 2
1 323 3 3 3
23 1 1
2
3
1 1 1 11 1 31
3
2
22 13 3 333
1
22 1 1 2
2 2
11
2
11
3 3 3312
2 333 3
2
1
3
3
2
33
1
23
3
2 3
2
1 3
1
2
3
2
3
3
3
2
3
1
3
3
1
2 2 2 3 332
1
1
2
1
1
2
1
3
2
23
2
3

3

2

23
1

3

Figure 9: Graph of the class in the principal plane of the MDA.
To apply the result of the MDA to the testing data, we can do a lot of things. A quite common
one is to project the new data points into the principal plane and to look for the classes of its
5 closest neighbors in the training data set. Then, we give to the new data the class the more
represented among these neighbors. The results are summarized as classification counts in the
following tabular with the true class in line and the predicted class in column,
10

Medium
Good

205
353
331
522
165
257

Good
63
74
38

These results lead to a misclassification rate equal to
1−

2.4

205 + 522 + 38
= 0.619.
2008

Comment these results.

A second approach is the use of a decision tree to predict the class of a wine. As above, we
compute the CART tree with the help of the training data set and we measure the quality of the
obtained classification procedure on the testing data set.
2.5 With the help of Figure 10, what is the main criterion that seems to lead to a good wine?
We apply the obtained tree to the testing data set. Here are the classification counts that we
get,
Medium
Good

314
296
167
618
19
272

Good
11
142
169

This leads to a misclassification rate equal to
1−

2.6

314 + 618 + 169
= 0.4517.
2008

Comment these results.

2.7 From the two considered approaches, can we conclude that there exist a way to predict
the quality of a wine from its physico-chemical properties?
2.8 What other method(s) would you propose to predict the quality of a white wine from
its physico-chemical properties?

11

alcohol&lt; 10.75
|
Medium

volatile.acidity&gt;=0.235
Medium

free.sulfur.dioxide&lt; 17.5

alcohol&lt; 11.74
Medium

Medium

alcohol&lt; 9.775

Medium

free.sulfur.dioxide&gt;=17.5
Good

Good

Medium

Medium

Figure 10: CART tree computed on the training data set.

3

Credits

The author of the picture of Gandalf (Figure 1) is Nidoart. This picture comes from Wikimedia Commons and is distributed according to the terms of licence CC BY-SA 3.0 (http:
The oenological data come from the work Modeling wine preferences by data mining from physicochemical properties of Cortez, Cerdeira, Almeida, Matos and Reis, published in 1998 in
volume 47 of Decision Support Systems.

12

### Sur le même sujet..

Ce fichier a été mis en ligne par un utilisateur du site. Identifiant unique du document: 00385937.

Pour plus d'informations sur notre politique de lutte contre la diffusion illicite de contenus protégés par droit d'auteur, consultez notre page dédiée.