Lecture 4 .pdf

Nom original: Lecture 4.pdf
Titre: Lecture 4
Auteur: Giuliana Cortese

Ce document au format PDF 1.3 a été généré par pdftopdf filter / Mac OS X 10.9.1 Quartz PDFContext, et a été envoyé sur fichier-pdf.fr le 12/10/2015 à 18:38, depuis l'adresse IP 93.34.x.x. La présente page de téléchargement du fichier a été vue 509 fois.
Taille du document: 12.9 Mo (16 pages).
Confidentialité: fichier public

Aperçu du document

Lecture 4
Population and sample
Measures of center
What is typical? Measures of center

Population and sample
Cases (individuals) of interest?

Sampling (random)

Population and sample

Population

Inference

sample

Descriptive
statistics

NOTE: Sample should be representative of the population

Examples

A political scientist wants to know what percentage of college-age

Government economists inquire about average household income.

A market research firm wants to learn what percent of adults aged

Population data: data from every individual of interest
(are fixed and complete)

Wish information about a large group of individuals (Population)!

Collect information about only part of the group (Sample)

Time, cost, inconvenience, etc. forbid contacting every individual.

Sample data: data from only some individual of interest
(are not complete and vary from sample to sample)
(N)
(n)

Draw conclusions about the whole (Population)!

Mean:

What is typical?
Measures of center
!
! point on which a distribution would balance
value that minimizes the sum of the squared values
!

! Trimmed mean
! Median: the value that minimizes the sum of
absolute deviations
Mode: the most common value
!

Mode
The value that occurs most frequently

!

!

!
!
!

Mode: quantitative variables (1)

modes or modal heights are 63” and 68” (bimodal)

Quantitative variable (height)
!

Interpretation:
!

cf
3
4
5
8
10
11
12

%
c%
25.00% 25.00%
8.33%
33.33%
8.33%
41.67%
25.00% 66.67%
16.67% 83.33%
8.33%
91.67%
8.33% 100.00%

The most common heights for recent Miss Americas are
63” and 68”

Height (X) # winners (f)
63
3
64
1
66
1
68
3
69
2
70
1
71
1

Mode: quantitative variables (2)

22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5

N = 96.00

Mean = 28.7

Std. Dev = 4.31

Starting salaries for BAs in sociology, 2000-2001

The most common starting salaries are in the mid-to-high \$20s

Not so easy if everyone has a slightly different value
Interpretation:
!

30

20

10

0

National Association of Colleges and Employers

Starting salary in thousands

survey of college placement offices

!
!

Mode: Qualitative variable (1)
Here mode is song
Interpretation:
!

The most common talent for recent Miss
Americas is popular song

Talent
# winners (f)
dance
3
instrument
3
opera
2
popular song
9
song and instrument
2

Mode: examples
Some data:
Age of participants: 17 19 21 22 23 23 23 38
Mode = 23 (occurs 3 times)
Frequency table:
Major
Frequency (f) Proportion (P) Percentage
criminology
22
.489
48.9%
sociology
16
.356
35.6%
no information
3
.067
6.7%
education
1
.022
2.2%
env science
1
.022
2.2%
history
1
.022
2.2%
political science
1
.022
2.2%
Total
45
1
100.00%

Mode = “criminology” (occurs 22 times)

!

!

Mean (average)

The balancing point……..

Mean (average)

the most common measure of central tendency.
the balancing point of distribution.

The arithmetic average, or mean, is
!
!

!
!
!

Name
Suzette Charles
Sharlene Wells
Susan Akin
Kellye Cash
Kaye Lani Rae Rafko
Gretchen Carlson
Debbye Turner
Marjorie Vincent
Kate Shindle
Nicole Johnson
Angela Perez Baraquio
Katie Harman

Height
5' 3" (63")
5' 8" (68")
5' 9" (69")
5' 8" (68")
5' 10" (70")
5' 3" (63")
5' 8" (68")
5' 6" (66")
5' 11" (71")
5' 9" (69")
5' 4" (64")
5' 3" (63")

Weight
100
120
114
116
131
108
118
110
145
133
118
110

BMI
17.7
18.2
16.8
17.6
18.8
19.1
17.9
17.8
20.2
19.6
20.3
19.5

Katie Harman
(2002)

Isolated examples are useless
Let’s get (available) winners from the past 19 years

What’s typical?

“Is Miss America an Undernourished Role Model?”
! Rubinstein, Sharon MHS. Caballero, Benjamin MD, PhD. Journal
of the American Medical Association, 283(12):1569, March 22/29,
2000.
Yes example
! Suzette Charles (1984) weighed just 100 pounds
No example (Robert Renneisen Jr., CEO Miss America Org.)
! “Recent winners have had some of the highest body mass readings,
a reflection of pageant officials’ emphasizing brains over beauty.”

Arguing from examples
!

!

!

!
!
Year
1984
1985
1986
1987
1988
1989
1990
1991
1998
1999
2001
2002

BMI = 703 Weight / Height2. (Weight in pounds, height in inches.)
BMI&lt;19.1: “underweight” (NHANES)
BMI&lt;18.5: “undernourished” (WHO)
BMI=22.7: average for women age 20-29

Weight
100
120
114
116
131
108
118
110
145
133
118
110
sum
n
sum/n

16

17

18

BMI

19

20

center of mass, balance point

Calculating the mean: Example
BMI (Y)
17.7
18.2
16.8
17.6
18.8
19.1
17.9
17.8
20.2
19.6
20.3
19.5
223.6
12
18.6

BMI&lt;19.1: “underweight” (NHANES)
BMI&lt;18.5: “undernourished” (WHO)
BMI=22.7: average for women age 20-29

Mean is sensitive……..

…………..extreme values

21

20

25

30
BMI

without Ruben: 18.6

15

Miss America 2004=Ruben

Talent: popular song
height: 6’4” (76”)
weight: 350 pounds

!

Prediction

Extreme value
!

!
!
!

Year
1984
1985
1986
1987
1988
1989
1990
1991
1998
1999
2001
2002
2003

Name
BMI (Y)
Suzette Charles
17,7
Sharlene Wells
18,2
Susan Akin
16,8
Kellye Cash
17,6
Kaye Lani Rae Rafko
18,8
Gretchen Carlson
19,1
Debbye Turner
17,9
Marjorie Vincent
17,8
Kate Shindle
20,2
Nicole Johnson
19,6
Angela Perez Baraquio
20,3
Katie Harman
19,5
Ruben Studdard
42,6
sum
266,2
n
13
sum/n
20,5

BMI&lt;19.1: “underweight” (NHANES)
BMI&lt;18.5: “undernourished” (WHO)
BMI=22.7: average for women age 20-29

35

40

45

Mean: example

Some data:
Age of participants: 17 19 21 22 23 23 23 38

556.9546

33.3

The balancing point

66.7

GeometricHarmonic
Median Mean
Mean
Sum
49
46.66865 43.00606 46730

100.0

Mode
49

Mean of age in Kline’s data
Mean
50.19334

Means Section of AGE
Parameter
Value

14.0

0.0
0.0

4.7

9.3

Percent

!

!

Mean

Mean = 4

0 1 2 3 4 5 6 7 8 9 10

The mean is affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10

Mean = 3

Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Mean:

The median is equal to 2 (as before!)

Mean:

The median is equal to 2 (as before)

!

!

!

Trimmed mean
! Trim (discard) most extreme values
Calculate mean for the remaining values
!

Trimmed mean
A mean trimmed 10% is a mean computed with 10%
of the values trimmed off; 5% from the bottom and 5%
from the top.
A mean trimmed 50% is computed by trimming the
upper 25% of the values and the lower 25% of the
values and computing the mean of the remaining
values.
Not as sensitive to extreme values
! because it ignores them
not sensitive = robust
!

1.
2.
3.

Name
BMI (Y)
Susan Akin
16.8
Kellye Cash
17.6
Suzette Charles
17.7
Marjorie Vincent
17.8
Debbye Turner
17.9
Sharlene Wells
18.2
Kaye Lani Rae Rafko
18.8
Gretchen Carlson
19.1
Katie Harman
19.5
Nicole Johnson
19.6
Kate Shindle
20.2
Angela Perez Baraquio
20.3
Ruben Studdard
42.6
trimmed sum
206.8
trimmed n
11
trimmed mean
18.8

Calculate the trimmed mean

!

Year
1986
1987
1984
1991
1990
1985
1988
1989
2002
1999
1998
2001
2003

Sort data by Y
Trim largest and smallest values
Get mean of remaining values

Trim the 2 most extreme

!

Trim the 4 most extreme
4/13=31% trimmed mean

2/13=15% trimmed mean
!

etc.

!

!

!

7.5% off top + 7.5% off bottom
not 15% top, 15% bottom

15% trimmed

Note
!

!

Year
1986
1987
1984
1991
1990
1985
1988
1989
2002
1999
1998
2001
2003

Name
BMI (Y)
Susan Akin
16.8
Kellye Cash
17.6
Suzette Charles
17.7
Marjorie Vincent
17.8
Debbye Turner
17.9
Sharlene Wells
18.2
Kaye Lani Rae Rafko
18.8
Gretchen Carlson
19.1
Katie Harman
19.5
Nicole Johnson
19.6
Kate Shindle
20.2
Angela Perez Baraquio
20.3
Ruben Studdard
42.6
trimmed sum
168.9
trimmed n
9
trimmed mean
18.8

How much to trim?

Interpretation:
Ignoring the 2 most extreme
values, the average BMI is 18.8

!

!

!

!

e.g., 20% trimmed mean
trim 20%X13=2.6 cases
" round down to 2 cases

Often use round numbers,
then round down
!
!

!

Median
The median (50th percentile) is the
midpoint of a distribution: the same
number of values are above the median as
below it.

!

The trimmed mean is
therefore:

Calculate the trimmed mean
Table 1 shows the number
of touchdown (TD) passes
thrown by each of the 31
teams in the National
Football League in the
2000 season. The relevant
percentiles are shown in
Table 2.

!

!

!
!
!

Calculate the median

Median (1)

Year
1986
1987
1984
1991
1990
1985
1988
1989
2002
1999
1998
2001
2003

BMI (Y)
16.8
17.6
17.7
17.8
17.9
18.2
18.8
19.1
19.5
19.6
20.2
20.3
42.6
18.8
1
18.8

If n is even MR falls between two observations, the median
is the arithmetic mean of those observations (the median of
the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5).

If n is odd MR is equal to an actual observation and the
median is equal to the value of that observation (the
median of 2, 4, and 7 is 4).

Arrange the observations in increasing order
Find the middle rank with the formula MR=(n+1)/2
Identify the value of the median
!

!

!
!

its value=18.8

12 cases trimmed
1 is left

Trim till you just can’t
trim no more

For odd n, e.g. n=13
!

!

This value, 18.8, is the
median (also the mean)

Name
Susan Akin
Kellye Cash
Suzette Charles
Marjorie Vincent
Debbye Turner
Sharlene Wells
Kaye Lani Rae Rafko
Gretchen Carlson
Katie Harman
Nicole Johnson
Kate Shindle
Angela Perez Baraquio
Ruben Studdard
trimmed sum
trimmed N
median

!

!

!

BMI (Y)
16.8
17.6
17.7
17.8
17.9
18.2
median----&gt;
18.8
19.1
19.5
19.6
20.2
20.3
42.6
18.8
1
18.8

trimmed sum
n
trimmed N
median

Median: Interpretation

At most half of values are more
than the median
At most half are less

But this interpretation
isn’t always quite right
More precisely (gray)
!

!

or, equivalently (brackets)
!

!

At least half are the median or
less
At least half are the median or
more

!

!

!
!

their mean=18.5

Year
1986
1987
1984
1991
1990
1985
1988
1989
2002
1999
1998
2001

Median (2)

10 cases trimmed
2 are left

Trim till you just can’t
trim no more

For even n
!

!

This is the median

Name
Susan Akin
Kellye Cash
Suzette Charles
Marjorie Vincent
Debbye Turner
Sharlene Wells
Kaye Lani Rae Rafko
Gretchen Carlson
Katie Harman
Nicole Johnson
Kate Shindle
Angela Perez Baraquio
trimmed sum
trimmed n
N
median

median

Half of values are larger than the median
Half are smaller

Median: Interpretation
!
!

Year
1986
1987
1984
1991
1990
1985
1988
1989
2002
1999
1998
2001

Name
BMI (Y)
Susan Akin
16.8
Kellye Cash
17.6
Suzette Charles
17.7
Marjorie Vincent
17.8
Debbye Turner
17.9
Sharlene Wells
18.2
Kaye Lani Rae Rafko
18.8
Gretchen Carlson
19.1
Katie Harman
19.5
Nicole Johnson
19.6
Kate Shindle
20.2
Angela Perez Baraquio
20.3
trimmed sum
37.0
trimmed Nn
2
median
18.5

BMI (Y)
16.8
17.6
17.7
17.8
17.9
18.2
18.8
19.1
19.5
19.6
20.2
20.3
37.0
2
18.5

!

!
!

mode
!
median At

mean
2

3

4

5

At least half of households (87%) have 2 residents or
fewer
least half (62%) have 2 residents or more.

Here far less than half of households have more than
the median number of residents.
But the careful interpretation is still correct:

Median: Interpretation

Households
1500

1000

500

1

Distribution of adult residents across US households (from GSS)

!

!

!

Common misconception

very sensitive to extremes

30

35

Median is not halfway between the largest and
smallest values
That’s the midpoint
!

it ignores them

25

midpoint

Median is not sensitive to extremes;
!

20

median

15

BMI

Median – the exact middle value

Central Tendency
!

40

Calculation:
! If there are an odd number of observations, find
the middle value
! If there are an even number of observations, find
the middle two values and average them.

45

Median: example
Some data:
Age of participants: 17 19 21 22 23 23 23 38

Median = (22+23)/2 = 22.5

Median of age in Kline’s data

!

!

Median

Median = 3

0 1 2 3 4 5 6 7 8 9 10

The median is not affected by extreme values
(outliers).

0 1 2 3 4 5 6 7 8 9 10

Median = 3

25%

Q1

25%

Q2

25%

Q3

25%

Quartiles

SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

!

!

!

The first quartile, Q1, is the value for which 25% of
the observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50%
are larger)
Only 25% of the observations are greater than the
third quartile

!
!

!
!

!

Finding the quartiles
Arrange the observation in increasing order
Find the position of 1° and 3° quartiles with
the following formulas
!
!

Position of 1° quartile =(n+1)/4
Position of 3° quartile =[3*(n+1)]/4

Identify the value of the 1° and 3° quartiles
Calculate the interquartile range

Measures of central tendency
The center of a distribution could be defined
three ways:
!

!

!

the point on which a distribution would
balance,
the value whose average absolute deviation
from all the other values is minimized
the value whose squared difference from all the
other values is minimized.

!

Balance scale

Central tendency is the point at which the
distribution is in balance.

Five numbers 2, 3, 4, 9, 16 placed upon a balance scale. If
each number weighs one pound, and is placed at its
position along the number line, then it would be possible to
balance them by placing a fulcrum at 6.8 (the mean) and not
at a 4 (the median)

!

the same distribution can't
be balanced by placing the
fulcrum to the left of
center.

Balance scale

Symmetric distributions
the distribution is balanced by
placing the fulcrum in the
geometric middle.

!

Asymmetric distributions
To balance it, we cannot put the fulcrum halfway between
the lowest and highest values (as we did in Figure 3). Placing
the fulcrum at the "half way" point would cause it to tip
towards the left

Central tendency and symmetry
!

!

!

The mean, median, and mode are identical in a
symmetric bell-shaped distribution.
When distributions have a positive skew, the
mean is typically higher than the median.
When distributions have a negative skew, the
mean is typically smaller than the median.

This histogram shows the salaries of
major league baseball players (in
thousands of dollars).
No single measure of central tendency
is sufficient for data such as these.

Distribution with a very large positive skew

!

!

!

!

mode of \$250,000 or median of
\$500,000 do not give any indication that
some players make many millions of
dollars.
mean of \$1,183,000 doesn’t tell that one
third of baseball players make that
much.

!

When the various measures differ, you
should report the mean, median, the
mean trimmed 50%, and quartiles.

SMALLEST ABSOLUTE DEVIATION
!

!
!

Second definition of center of a distribution: the number for
which the sum of the absolute differences is smallest (the
median)
Compute the sum of the absolute differences.
Consider the distribution made up of the five numbers 2, 3,
4, 9, 16.
!
!
!

!

The sum of the absolute differences from 10 is 28
The sum of the absolute differences from 5 is 21.
Is there a value for which the sum of the absolute difference is
even smaller than 21?
Yes. For these data, there is a value for which the sum of absolute
deviation is only 20.

SMALLEST SQUARED DIFFERENCES
(the mean)
!

!

!

Changing the target from 10 to 5,
we calculate the sum of the
squared differences from 5 as
9 + 4 + 1 + 16 + 121 = 151.
Can you find the target number
for which the sum of squared
deviations is 134.8?
The target that minimizes the sum
of squared differences provides
another useful definition of
central tendency.

!
!
!

!
!
!

the distribution balances
at the mean of 6.8 and not at
the median of 4.

not affected by outlier values
compromise between mean and median

Trimmed mean:

nearly all values trimmed
affected only by middle values
crude but robust

Median:

affected by all values
including extreme values
efficient but sensitive

Mean:

Summary: Sensitive vs. Robust Statistics
!

!

!

!
!

Working with frequency tables

Mean of a frequency table

Height (Y) # winners (f)
63
3
64
1
66
1
68
3
69
2
70
1
71
1
471
12

N=7 values?

Still wrong

n=12 winners?
crazy

Suppose the heights of Miss Americas are summarized in a
frequency table.
Can you use the same formula for the mean?

TOTAL

No!

Mean formula for a frequency table

∑ fY = 802

fY
189
64
66
204
138
70
71

total winners

total

In frequency tables, each line can represent more than one case.
So you use a different formula!

N
n = 12

Data set
Each line represents one case
Each line gets equal weight

!
!

3
1
1
3
2
1
1
N
n = 12

∑S fX= 802

189
64
66
204
138
70
71

Frequency table
Some lines represent multiple cases

Comparison of mean formulas

Height (Y) # winners (f)
63
3
64
1
66
1
68
3
69
2
70
1
71
1
Sum

!
!
!

63
64
66
68
69
70
71
Sum

Both formulas give
total inches (802)
over total winners (12)

!

!

!

!

!

Mean of a dummy variable

Y
1
0

f
7
5

nN = 12
S fX = 7

fX
7
0

p
0.58
0.42

…is the proportion p of cases with a value of 1

Underweight
1
1
1
1
1
0
1
1
0
0
0
0

Underweight.
Y=1 if BMI&lt;19.1. Otherwise Y=0.
BMI
17.7
18.2
16.8
17.6
18.8
19.1
17.9
17.8
20.2
19.6
20.3
19.5

58% of recent Miss Americas are underweight

Median from a frequency table

%
c%
25.00%
25.00%
8.33%
33.33%
16.67%
50.00%
16.67%
66.67%
16.67%
83.33%
8.33%
91.67%
8.33% 100.00%

then median is average of that value and the next

If there is a value with cum. % of 50%
!

So median is average of 66” and 68”
median is 67”

Here 66” has c%=50%
!
!

!

Height (Y) # winners (f)
63
3
64
1
66
2
68
2
69
2
70
1
71
1
N
12

Exactly half of recent Miss Americas shorter than 67”
Exactly half are taller

Interpretation:
!

median

(not real data)

!
!
!

!

Summary: Interpretation

Mean: average, center of gravity
Trimmed mean: mean of non-extreme values
Median: halfway point,
!

!

At least half the values are at or above the median.
At least half are at or below.

(but not halfway between smallest and largest)
!

Mode: Most common value

X

X

Nominal Ordinal
X

X

X

X

Interval,
ratio

Summary: Measures and variables
Mode
Median
(Trimmed)
Mean

Can treat dummy variables (with values 0,1)
like interval variables

Sur le même sujet..

Ce fichier a été mis en ligne par un utilisateur du site. Identifiant unique du document: 00360463.

Pour plus d'informations sur notre politique de lutte contre la diffusion illicite de contenus protégés par droit d'auteur, consultez notre page dédiée.