Lecture 5 .pdf



Nom original: Lecture 5.pdfTitre: Lecture5Auteur: Giuliana Cortese

Ce document au format PDF 1.3 a été généré par pdftopdf filter / Mac OS X 10.9.1 Quartz PDFContext, et a été envoyé sur fichier-pdf.fr le 12/10/2015 à 18:38, depuis l'adresse IP 93.34.x.x. La présente page de téléchargement du fichier a été vue 369 fois.
Taille du document: 4.7 Mo (19 pages).
Confidentialité: fichier public


Aperçu du document


!
!
!

!
!

Lecture 5
How much variety is there?
Measures of variation

Center vs. variation

Clippers
Knicks

Two distributions can have same center
But differ with respect to variation
2 basketball teams
!
!

Similar mean height, between 6’6” and 6’7”
But they don’t match up well…

Charlie Ward
Mark Jackson
Larry Robinson
Latrell Sprewell
Lavor Postell
Allan Houston
Shandon Anderson
Clarence Weatherspoon
Kurt Thomas
Othella Harrington
Marcus Camby
Travis Knight
Felton Spencer

1100
1026
14
14
78.6
73.3

74
75
75
77
77
78
78
79
81
81
83
84
84

Tot.
n
Mean

Keyon Dooling
Jeff McInnis
Quentin Richardson
Corey Maggette
Eric Piatkowski
Elton Brand
Harold Jamison
Darius Miles
Obinna Ekezie
Sean Rooks
Lamar Odom
Michael Olowokandi

Basketball teams: Mean heights
2
3
4
5
6
7
8
9
10
11
12
13
14

SC
Tot.
nn

SC
M
Mean / n

What is Variability?

!Variability refers to how "spread out" a group of
values is.
! The terms variability, spread, and dispersion are
synonymous, and refer to how spread out a
distribution is.

75
76
78
78
78
80
81
81
81
82
82
84

956
1021
13
13

73.5
78.54

!

Measures of variability
range
interquartile range
variance
standard deviation.

There are four measures frequently used :
!
!
!
!

Range
It is simply the highest value minus the
lowest value.

Range

!

shortest 65 inches (Boykins)
tallest 84 inches
(Olowakandi)
Range 84-65=19 inches

E.g., Clippers

Calculation
! Largest minus smallest
value
!

!

!

!

“All the player heights fit in
a 19-inch range, from 65 to
84 inches.”

Interpretation
Really easy:

!

!

!

But

Sensitive to extreme values

Uses only extreme values!

Increases with n

!

!

Interquartile Range

IQR =3rd quartile – 1st quartile = Q3 – Q1

Range ignoring extreme values

Less sensitive to extreme values
Could be called “trimmed range”

Interquartile range: Motivation
!
!
!

Interquartile range: example (1)

Interquartile range: example (2)

Clippers
Player
Earl Boykins
Keyon Dooling
Jeff McInnis
Quentin Richardson
Corey Maggette
Eric Piatkowski
Elton Brand
Harold Jamison
Darius Miles
Obinna Ekezie
Sean Rooks
Lamar Odom
Michael Olowokandi

65
75
76
78
78
78
80
81
81
81
82
82
84

Inches(Y)

3rd quartile=81.5

median

1st quartile=77

n=13

Interquartile range: example with odd n
1
2
3
4
5
6
7
8
9
10
11
12
13

Knicks
Player
Howard Eisley
Charlie Ward
Mark Jackson
Larry Robinson
Latrell Sprewell
Lavor Postell
Allan Houston
Shandon Anderson
Clarence Weatherspoon
Kurt Thomas
Othella Harrington
Marcus Camby
Travis Knight
Felton Spencer

Inches(Y)
74
74
75
75
77
77
78
78
79
81
81
83
84
84

n=14
1st quartile=75”
median
3rd quartile=81”

IQR=4.5”

IQR=81-75=6”

Interquartile range: example with even n

Interpretation:
About half the players (7 of 13) have heights within a 4.5” range,
between 77” (6’5”) and 81.5” (6’9.5”).

1
2
3
4
5
6
7
8
9
10
11
12
13
14

Interpretation:
About half the players (6 or 8 of 14) have heights within a 6” range,
between 75” (6’3”) and 81” (6’9”).

Q1

35

25%

25%

Median
(Q2)

49

Q3

65

25%

94 age

maximum

Interquartile Range: example (3)

25%

minimum

15

Interquartile range
= 65 – 35 = 30

Variance and standard deviation
(in the sample)

! Variance (s2)
Standard deviation (s)
!
To
understand
! Must understand deviation

!

!

!

is the value for a particular case
= 65 for Earl Boykins
is the mean over all the cases

(Y quantitative variable)

Deviations from the mean
!

!

!
! Interpretation: He is 13.54” shorter than the team
mean

Variance (in the sample)
How close are the values in the distribution to the middle
of the distribution?

is the

Variance = average squared difference of the values from
the mean.

where s2
number of values (sample size).

!

Variance is in squared units
Need to un-square them

More variety"larger variance
Beyond that, not easy to interpret

Variance: Interpretation
!
!

!
!

Variance: example (1)

277.23
12
23.10

Clippers' Player
Height Deviation
Squared deviation
Earl Boykins
65
-13.54
183.29
Keyon Dooling
75
-3.54
12.52
Jeff McInnis
76
-2.54
6.44
Quentin Richardson
78
-0.54
0.29
Corey Maggette
78
-0.54
0.29
Eric Piatkowski
78
-0.54
0.29
Elton Brand
80
1.46
2.14
Harold Jamison
81
2.46
6.06
Darius Miles
81
2.46
6.06
Obinna Ekezie
81
2.46
6.06
Sean Rooks
82
3.46
11.98
Lamar Odom
82
3.46
11.98
Michael Olowokandi
84
5.46
29.83
78.54

Note influence of extreme cases (esp. Boykins)
1
2
3
4
5
6
7
8
9
10
11
12
13
Mean

Sum
N-1
Variance (s 2)

!

!

Standard deviation
The standard deviation is simply the square root
of the variance.

Has the same units as the original data

!

Example. For Clippers,
s = (23.10)1/2 = 4.81 inches
!

Standard deviation: Interpretation
!

!

!

!

Variance is in squared units
! Variance of Clipper heights is 23.10 inchessquared
Standard deviation (SD) is in original units
! SD of Clipper heights = 4.81 inches
Deviations from mean also in inches
! Boykins’s deviation –13.54 inches
Can compare
! Standard deviation is a …
! standard to which …deviations are
compared
!

Sample Standard Deviation
(Calculation Example)

13

13

14

14

14

15

15

15

16

16

16

Mean = X = 23.25

17

17

17

18

18

18

19

19

19

20 21

20 21

20 21

S = 4.570

Mean = 15.5

S = 0.926

Mean = 15.5

Mean = 15.5
S = 3.338

Age data (n=8) : 17 19 21 22 23 23 23 38
n=8

13

Comparing Standard
Deviations
12

Data A
11

12

Data C

12

Data B
11

11

SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Variance and Standard deviation: Exercise (1)

Variance and Standard deviation: Exercise(2)

Score
88
72
65
95
31

Exercise (4)

Variance and Standard deviation: Exercise (3)

!

Student

1
2
3
4
5

Calculate variance, standard deviation, range and IQR
Interpret range and IQR

Calculate and interpret the standard score of the most
extreme value.

Given exam scores

!

!
!

!

IQR
=91.5-48
=43.5
Mean

Answer
Q1=(31+65)/2=48

Q3=(88+95)/2=91.5

Interpretation
!
!
!

Score
31
65
72
88
95
70.2

Deviation Dev. Sq.
-39.2 1536.64
-5.2
27.04
1.8
3.24
17.8 316.84
24.8 615.04
Sum
2498.80
N-1
4
Variance
624.7
SD
25.0

All the scores fit in a 64-point range (31 to 95).
But over half the scores fit in a 43.5-point range.
(Here even the IQR is influenced by the lowest
score.)

Exercise (5)
Range=9-5=4
75th percentile=8
25th percentile =6.
IQR= 2.
The mean deviation from the mean is
0. This will always be the case.
The mean of the squared deviations
is 1.5. Therefore, the variance is 1.5.

Comparing measures of variation

Team
Range Variance Std Dev IQR
Clippers
19
23.1
4.81 4.5
Knicks
10
12.57
3.55
6

Using range, variance, or SD, Clippers look more variable.
But using IQR, Knicks look more variable. Why?

Influence of extreme values

Trimmed mean

Mean

Centrality measure

Not influential

Less influential

Influential

Extreme values

One extreme height (Boykins) expands range and
variance of Clippers, but can’t affect IQR.

Median

Variance (& SD)

Range

Less influential

Influential

Very influential

Variation measure Extreme values

IQR

Formulas for frequency tables

Q1: first value with c%>=25%
Q3: first value with c%>=75%

Tricky to get a recipe that’s always right.
Rough method, usually right for large n.

IQR from a frequency table
!
!
!
!

!

# households (f)
190
316
54
17
2
2
581

%
c%
32.7% 32.7%
54.4% 87.1%
9.3% 96.4%
2.9% 99.3%
0.3% 99.7%
0.3% 100.0%

Q1=1, Q3=2, IQR=Q3-Q1=1
Interpretation

People in household
1
2
3
4
5
6
TOTAL

IQR from a frequency table:
Example

!
!

!

More than 50% of surveyed households
had 1 to 2 residents.

!

!

!

Frequency table
Mean

Variance from a frequency table
Data set
Mean
!

!

Variance
(mean squared deviation)

Variance
(mean squared deviation)

f Y-Ybar
1
-13.54
1
-3.54
1
-2.54
3
-0.54
1
1.46
3
2.46
2
3.46
5.46
1
13
0

Variance from frequency table: Example
Y
65
75
76
78
80
81
82
84
78.54

(Y-Ybar)2 f(Y-Ybar)2
183.33
183.33
12.53
12.53
6.45
6.45
0.29
0.87
2.13
2.13
6.05
18.15
11.97
23.94
29.81
29.81
Sum
277.23
N-1
12
Variance
23.10

!

!
!
!

interquartile range (IQR)

is basis for standardization

Standard deviation

!

Robust

standard deviation (s)
variance (s2)
range

Sensitive (to extreme values)

Summary: Measures of variation
!

!

!

!

Dummy variables: review

e.g. Y=1 if a student is female, Y=0 if male
Some proportion p have Y=1 (female)
p is also the mean, i.e.

Suppose Y is a dummy variable
!
!
!

Dummy variables: variance & SD
!
!

Can calculate variance (& SD) in usual way
But there’s a shortcut
s2 = p(1-p)
s = (s2)1/2

s2

p
s2 = p(1-p)

2000
.55
.2475
.4975

1970 Citadel
.30
.03
.21
.0291
.4583 .1706

Average college

Dummy variance & SD:
Examples
mean
variance
standard deviation

Makes sense:
Colleges with more gender variety have larger
variance (& SD)

!

!

!
!
!

Summary

Lots of cases aren’t typical
Some important cases may be very atypical

“What’s typical?” isn’t the whole story
!
!

!

Variance s2, Standard deviation s
IQR

Measures of variation
!

Variance and s.d. are sensitive, IQR is robust
Remaining lectures use variance and s.d.
S.D. is basis for standardization

Overview and remarks
on variation

Transformations
!

Change feet to centimeters?
Pounds to kilograms
Add 20 points to each students score.

Often we change units of the data. What
happens?
!
!
!

Variance is on the squared scale
Mean and SD are on scale of the data.

!
the mean and standard deviation by the same
amount.
!
the variance is multiplied by the square of the
constant.

Multiplying each value in a data set by a
constant multiplies

Transformations
!

!
!

45

46

!

!

!

!

Transformations

Everything shifts together.
Spread of the items does not change.

! Adding the same value to each item in a
data set
! changes the mean by that amount
! but does not change the standard deviation
or variance.
!
!

Quick Tips:

The range does not use the concept of
deviations.
It is affected by outliers (large or small values
relative to the rest of the data set).
The range does not utilize all the information in
the data set only the largest and smallest values.
Thus it is not a very useful measure of spread
or variation.

47

Quick Tips:
If all of the observations have the same value,
the
variance in the sample (and standard
deviation) will be zero. That is, there is no
variability in the data set.
! The
variance
(standard deviation) is
influenced by outliers in the data set.
! The unit for the standard deviation is the
same as that for the raw data.
! Thus it is preferred to use the standard
deviation rather than the variance as the
measure of variability.

How many measurements
are within one, two, and three
standard deviations from the mean.
The empirical rule for
well-shaped distribution

A practical example

Quiz grades

47 53 53 60 61 61 62 62 63 63 63 64 64 65 65 65 66 66 67 67 68 68 68 69 69
69 69 70 70 70 70 70 71 71 72 72 72 73 73 73 73 73 73 74 74 74 74 74 75 75 75
76 76 76 77 77 77 77 78 78 78 79 79 79 79 80 80 80 80 80 81 81 81 81 82 82
82 82 82 82 82 84 85 85 85 86 87 87 87 88 89 90 91 92 93 94 96 103 104 105

/(n-1)= {(47-76)2 + (53-76)2 + (53-76)2 + ……

2

= (47 + 53 + 53 + ……. + 103 + 104 + 105)/150= 75.73
The mean of this population is 76
i

s2 = Σ (xi

+ (103-76)2 + (103-76)2 }/150=109
The variance of this population is 109

We find the standard deviation in this population data by taking
the square root of the variance.
s = (109)½ = 10.44

s

2

60

47 53 53 60 61 61 62 62
63 63 63 64 64 65 65 65

Frequency

3

70

80

4

90

100

5

110

6

7

87 87 88 89 90 91 92 93 94 96 103 104 105

We find that 80% of the values lie within 1s of the mean.

We find that 30 of 150 values (20%) do not lie within 1s of the mean.

s

50

30
20
10

Frequency

40

50

60

x

80

Histogram of x

70

90

76 – 2*10.44 = 55.12

76 + 2*10.44 = 96.88

110

103 104 105

100

We find that 6 of 150 values (4%) do not lie within 2s of the mean.

47 53 53

0

50

60

x

80

Histogram of x

70

90

100

110

76 – 3*10.44 = 107.32

76 + 3*10.44 = 44.68

We find that 100% of the values lie within 3s of the mean.

40

We find that 96% of the values lie within 2s of the mean.

30
10

105
104
103
96
94
93
92
91
90
89
88
87
86
85
84
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
65
64
63
62
61
60
53
47
1

40

x

Frequency

0

20

30
10

20
0

The Empirical Rule
• Knowing the value of the mean and the value of the standard
deviation for a data set can provide a great deal of information about
the data set.
• If the data set has a single mound and is symmetrical (“bell-shaped”),
then one can use some properties of this type of distributions.
• One property is called the Empirical Rule.
• The Empirical Rule relates the mean and the standard deviation of a
bell-shaped distribution.
• In particular, it relates the mean to one, two, and three standard
deviations.

One Sigma Rule

(σ = variance of the population)

• One can expect that approximately 68% of the data

values will lie within one standard deviation from the
mean.

• Approximately 33% (approximately 1/3) of the values

are outside one standard deviation from the mean

Two Sigma Rule

• One can expect that approximately 95% of the data

values will lie within two standard deviations from the
mean.

• Approximately 5% (1/20) of the values are outside

two standard deviations from the mean.

Three Sigma Rule
• One can expect that approximately
99.7% of the
a
data values will lie within three standard deviations
from the mean.
• Approximately 0.3% (1/333) of the values are
outside three standard deviations from the mean.

!

!

!

!

The Z score is:

Z Score

Z = (a given value - mean) / standard deviation

We can pick any point on the X axis in the figure (given value)

and find out how many standard deviations above or below the
mean that point falls.

A Z score represents the number of standard deviations a given

value (Xi ) is above or below the mean.

The larger the Z value, the further away Xi will be from the mean,

values beyond three standard deviations are very unlikely.

Z Score

For a data set that is normally distributed with a mean of
76 and a standard deviation of 10.44
Find out the Z score for the value 97.
!
This value (X = 97) is 2 units above the mean, with a
Z value of:
Z = (97- 76)/(10.44) = +2
!
This Z score shows that the raw score (97) is two
standard deviations above the mean.
Find out the X value for a Z score of -3
Z = (X- 76)/(10.44) = -3
X=76-3*10.44=44.68

!

!

!

Z = –13.54/4.81= -2.81. So Earl Boykins is 2.81 standard
deviations below the mean height for his team

Z=

General formula is a Z score

Standardization: deviation vs. standard deviation
!

!

Interpretation: The case is Z standard deviations from
the mean

Z
-2.82
-0.74
-0.53
-0.11
-0.11
-0.11
0.30
0.51
0.51
0.51
0.72
0.72
1.14

Standard scores: Interpretation
!

Clippers' Player
Height Deviation
1 Earl Boykins
65
-13.54
2 Keyon Dooling
75
-3.54
3 Jeff McInnis
76
-2.54
4 Quentin Richardson
78
-0.54
5 Corey Maggette
78
-0.54
6 Eric Piatkowski
78
-0.54
7 Elton Brand
80
1.46
8 Harold Jamison
81
2.46
9 Darius Miles
81
2.46
10 Obinna Ekezie
81
2.46
11 Sean Rooks
82
3.46
12 Lamar Odom
82
3.46
13 Michael Olowokandi
84
5.46
78.54
4.81

Extreme values" extreme standard scores
It’s rare to find Z>2 or Z<-2

!

Mean
Std. Dev.

!

Reversing standardization
Given
! standard score Z
! mean
Y
standard deviation SX
!

You can get back the raw value Xi

This is just a rearrangement of the standardization formula:
Z=

inches

Earl Boykins is 2.81 standard deviations below the mean
for his team.
His team has a mean height of 78.54 inches, and a
standard deviation of 4.81 inches
What is Earl Boykins’ height again?

Reversing standardization: Example
!

!

!

!

!

!

Comparing the variation…..

The Coefficient of Variation (CV)
For a sample, the Coefficient of Variation is the ratio of the
standard deviation over the mean:

CV=0.143, the standard deviation is 14.3% as large as the mean

(CV=0.1 in the first case, and CV=0.02 in the second case)

mean value is 100, but only moderately large when the mean value is 500

For example, a standard deviation of 10 may be perceived as large when the

This coefficient provides a proportionate measure of variation
!

!

!

!

!

!

!

Comparing the variation…..

For distributions having the same mean, the distribution with

the largest standard deviation has the greatest variation.

When considering distributions with different means,

decision makers can't compare the uncertainty in distribution

only by comparing standard deviations.

Then coefficients of variation for different distributions are

compared, the distribution with the largest coefficient of

variation has the greatest relative variation

The Coefficient of Variation (CV)

The CV has no units since the standard deviation and the

mean have the same units, and thus cancel out each other.

The CV allows us to compare the variation of two (or

more) different variables (measured on different scales).

Example 1
The mean number of parking tickets issued in a
neighborhood over a four-month period was 90,
and the standard deviation was 5. The average
revenue generated from the tickets was $5,400,
and the standard deviation was $775. Compare
the variations of the two variables.

Solution

Since the CV is larger for the revenues, thereisis
Since
the
CV
is
larger
for
the
revenues,
there
morevariability
variabilityin
inthe
therecorded
recordedrevenues
revenuesthan
thanin
in
more
thenumber
numberof
oftickets
ticketsissued.
issued.
the

Example (2)

Night

Mean.......27.......................94

Mark teaches two sections of statistics. He gives each section a
different test covering the same material. The mean score on the
test for the “day” section is 27, with a standard deviation of 3.4.
The mean score for the “night” section is 74 with a standard
deviation of 8.0. Which section has the greatest variation or
dispersion of scores?

Day

S.D............3.4..................8.0

Direct comparison of the two standard deviations shows that the
night section has the greatest variation.

Solution

Comparing the coefficient of variations show quite
different results:

C.V.(day) = (3.4/27) x 100 = 12.6%

C.V.(night) = (8/94) x 100 = 8.5%

Thus, using the CV Mark finds that the “night” section test
results have a smaller variation relative to its mean than do
the “day” section test results.


Lecture 5.pdf - page 1/19
 
Lecture 5.pdf - page 2/19
Lecture 5.pdf - page 3/19
Lecture 5.pdf - page 4/19
Lecture 5.pdf - page 5/19
Lecture 5.pdf - page 6/19
 




Télécharger le fichier (PDF)


Lecture 5.pdf (PDF, 4.7 Mo)

Télécharger
Formats alternatifs: ZIP



Documents similaires


lecture 5
statistics equations answers quickstudy
ecography2005
lecture 7 part ii
pone 0037887
lecture 6

Sur le même sujet..