# Lecture 3 .pdf

Nom original: Lecture 3.pdf
Titre: Lecture3-M
Auteur: Giuliana Cortese

Ce document au format PDF 1.3 a été généré par pdftopdf filter / Mac OS X 10.7.5 Quartz PDFContext, et a été envoyé sur fichier-pdf.fr le 12/10/2015 à 18:38, depuis l'adresse IP 93.34.x.x. La présente page de téléchargement du fichier a été vue 514 fois.
Taille du document: 5.8 Mo (20 pages).
Confidentialité: fichier public

### Aperçu du document

Lecture 3

frequency polygon

Histogram
Box-plot
Scatter plots

Pie chart
Bar chart
Line chart

Graphs

Common graphs for frequency distributions

Graphs for “quantitative” variables

There are many types of graphs that can be used
to portray distributions of quantitative variables.

histograms, best-suited for large amounts of data
frequency polygons
box plots good at depicting differences between
distributions
scatterplots used to show the relationship between
two variables.

Graphs for “qualitative” variables

Qualitative data do not come with a pre-established ordering (the
way numbers are ordered)

Bar/Pie Charts

Used for categorical variables to show frequency or
proportion in each category.
Translate the data from frequency tables into a pictorial
representation…

Graphs for qualitative variables (example)
When Apple Computer introduced the iMac computer in August
1998, the company wanted to learn whether the iMac was
expanding Apple’s market share. 500 iMac customers were
interviewed. Each customer was categorized as:
1) previous Macintosh owners;
2) previous Windows owner;
new computer purchaser.
3)

There is no natural sense in which the category of previous
Windows users comes before or after the category of previous
iMac users.

Table 1 shows the frequencies and
the relative frequencies (proportion
of responses in each category).

Graphs for qualitative variables (example)

For example, the relative frequency
for "none" of 0.17 = 85/500
Although most iMac purchasers were
Macintosh owners, 12% of
purchasers were former Windows
users, and 17% of purchasers were
buying a computer for the first time.

Pie charts

Each category is represented by a
slice of the pie.

The area of the slice is

proportional to the percentage of

responses in the category (the

relative frequency multiplied by
100)

Graphical mistakes to avoid with pie charts

can be confusing when used to compare the outcomes of two
different surveys or experiments.

not recommended when you have a large number of categories.

Pie charts…..

with a small number of observations, it can be misleading to
label the pie slices with percentages. The slices should be
labeled with the actual frequencies observed instead of with
percentages.

For example, if just 5 people had been interviewed by Apple Computers,
and 3 were former Windows users, it would be misleading to display a
pie chart with the Windows slice showing 60%.

Perspective distortion

Tilt pie away
 Perspective shrinks back
Comparisons even harder

16
14
12

8

10

6
4
2
0

Current
Military
26%

Past Military
20%

Human
Resources
32%

Criminology

Majors in Soc 549

General
Physical Government
Resources
16%
6%

Sociology

Bar chart (column chart)

Federal budget, from the website of the War Resisters’ League

It is used to represent the
frequencies of different
categories.
More common than pie
Can show order and
changes over time
 Appropriate for nominal
as well as for ordinal
and interval
Easy to compare vertical
distances

Psychology

Bar Charts

The Y-axis shows the number of
observations in each category
Categories are shown on the X
axis

yes

Bar Chart (examples)

Example: the bar chart shown in
Figure 2 shows how many purchasers
of iMac computers were previous
Macintosh users, previous Windows
users, and new computer purchasers.

no

Other uses
The Y-axis is not frequency
but rather the signed quantity
percentage increase.

Bar charts can also
show change over
time.

Other uses

Example: Figure 2 shows the
percent increases in the Dow
Jones, Standard and Poor 500 (S
&amp; P), and Nasdaq stock indexes
from May 24th 2000 to May 24th
2001. Both the S &amp; P and the
which means that they decreased
in value.

Example: Figure shows the
percent increase in the
consumer price index (CPI)
over four three-month periods.
The fluctuation in inflation is
apparent in the graph.

Comparing distributions

Often we need to compare the "distributions"
of responses between the surveys or
conditions. Bar charts are often excellent for
illustrating differences between two
distributions.

Comparing distributions
Figure 3 shows the number of
people playing card games at the
Yahoo web site on a Sunday and
on a Wednesday on a day in the
Spring of 2001.

There were more players overall
on
Wednesday compared to
Sunday.
There were about twice as many
people playing hearts on
Wednesday as on Sunday.

15
14
13
12
11

9

10

8

Sociology

Majors in Soc 549

Criminology

Psychology

Graphical mistakes to avoid with bar
charts (Axis Distorsion)

The heights of the pictures
accurately represent the number
of buyers. F i g u r e 6 i s
viewer's attention will be
captured by areas. This can
exagerate the size differences
between the groups.
In terms of percentages, the ratio
of previous Macintosh owners to
previous Windows owners is
about 6 to 1. But the ratio of the
two areas in Figure 6 is about 35
to 1.

Graphical mistakes to avoid with bar charts

Baseline=bottom of the Y-axis,
representing the least number of
cases in a category.
Normally, this number should be
zero. Indeed, start vertical above
zero exaggerates all differences

A distortion in bar charts may
result from setting the
baseline to a value other than
zero.

Figure 7 shows the iMac data with a
baseline of 50. The number of
wndows-switchers seems minuscule
compared to its true value of 12%.

7

 Reduces differences
(caps same size)

6

8

10

12

14

14

12

8

10

6

4

2

0

Psychology

Sociology

Sociology

Criminology

Criminology

Psychology

Graphical mistakes to avoid with bar charts
(Perspective distortion )

Exaggerates
differences
 Hides side of smaller bars
 Also hides part of top
Rotation would make it worse

4

2

0

Graphs for quantitative variables

Line graphs are appropriate only when both
the X- and Y-axes display ordinal and
interval (rather than qualitative) variables.

Line Graph

Although bar graphs can also be used in this
situation, line graphs are generally better at
comparing changes from period to period.

16
14
12
10
8
6
4
2
0

16
14
12

Sociology

Sociology

Majors in Soc 549

Criminology

Criminology

Psychology

Psychology

Bar vs. line: similarities
 Bar and line charts
almost equivalent

 Connect tops
 Remove bottoms
You get a line chart!

10
8
6
4
2
0

Line chart

16

14

12

8

10

6

4

2

0

16

14

12

Senior

Sociology

Junior

Criminology

Sophomore

Psychology

Bar vs. line: Differences
 Line suggests trend
more strongly

interval variables

8

10

6

4

2

0

Bar vs. line: Differences

16

Line eases comparison of groups
14

16
12

0

2

4

6

8

10

14

Social statistics
Sociology of Sport

12
10
8
6
4
2
0

15
14
13
12
11

9

10

8
7

14

10

12

6

8

4

0

2

Sociology

Sociology

Criminology

Social statistics
Sociology of Sport

S1

Psychology

Psychology

Criminology

Graphical mistakes to avoid with line charts

Or break vertical

Axis distorsion: start vertical
above zero

Exaggerates trend
start vertical above zero

Tilt horizontal

Perspective distorsion:

Especially helpful for comparing sets of data.

A graphical device for understanding the shapes
of distributions.

Frequency Polygon

cumulative

Choose a class interval.
Draw an X-axis representing the
values of the scores in your data.
Mark the middle of each class
interval with a tick mark
Label it with the middle value
represented by the class.
Draw the Y-axis to indicate the
frequency of each class.
Place a point in the middle of
each class interval at the height
corresponding to its frequency.
Graph will touch the X-axis on
both sides.

Frequency Polygon

A good choice for displaying
frequency distributions.

Comparing distributions

small rectangle: 20 trials
large rectangle: 20 trials.
the Time to reach the target was
recorded on each trial.
the two distributions (one for each
target) are plotted together.

Goal: move a computer mouse to a
target on the screen as fast as
possible.

The figure shows that it generally
took longer to move the mouse to
the small target than to the large
one.

The Y value for each point is the
number of cases in the
corresponding class interval plus
all numbers in lower intervals.

Cumulative frequency polygon

Example:
 there are no scores in the
interval labeled "35,“ three in
the interval "45,"and 10 in the
interval "55.
 Therefore the Y value
corresponding to "55" is 13.
 Since 642 students took the
test, the cumulative frequency
for the last interval is 642.

Comparing distributions

Box plot and histograms: for
continuous variables

To show the distribution (shape, center,
range, variation) of continuous variables.

Histograms

Useful
 for displaying the shape of a distribution.
when there are a large number of observations

Histograms

Bin widths = widths of the class intervals,

Class intervals: range of values broken into intervals

Placing the limits of the class intervals midway between two
numbers (e.g., 49.5)

Count the number of scores falling into each interval (class
frequencies.)

Create a frequency table. To simplify the table, group values
together

Steps

Horizontal can represent equal or unequal class intervals
(“bins”)
Vertical: bars represent class frequencies
 The height of each bar corresponds to its class frequency
(for constant class intervals)
 The area of each bar corresponds to its class frequency
(for variable class intervals)

Ensures that every score will fall in an interval rather than on the
boundary between intervals.

Area and not height

no

yes

no

yes

This choice affects the shape of the histogram.

Histograms (choice of bin widths)

“Rules of thumb“
Sturgis's rule: set the number of intervals as close as possible
to 1 + Log2(N), where Log2(N) is the base 2 log of the
number of observations.
Rice’s rule: set the number of intervals to twice the cube root
of the number of observations.
Best advice: experiment with different choices of width and

choose a histogram according to how well it communicates the

Histogram of x

20

22

24

Frequency

18

Histogram of x

20

22

24

Histograms (choice of bin widths)

18

16

x
N=1000, number of classes=50

Histograms (choice of bin widths)

16

Histogram of x

20

22

x
N=1000, number of classes=50

18

Histograms (relative frequencies)

Histograms can be based on
relative frequencies:
the
proportion of scores in each
interval rather than the number of
scores.
The Y axis runs from 0 to 1

dividing each class frequency by the
total number of observations,
plotting the quotients on the Y axis
(labeled as proportion).

Histogram based on frequencies
can be changed to one based on
relative frequencies by

Density

shape of the distribution.

16

x
N=1000, number of classes=11

0.30

0.25

0.20

0.15

0.10

0.05

0.00

60
50
40
30
20
10
0

250
200
150
100
50
0

Frequency

24

Box-plots

Useful for identifying outliers and for comparing distributions.
Steps

Compute the 25th, 50th, and 75th percentiles in the distribution

Lower hinge 25th percentile

Higher hinge 75th percentile

Put "whiskers" above and below each box to give additional

Whiskers are vertical lines that end in a horizontal stroke.

Whiskers are drawn from the upper and lower hinges to the upper

Put additional marks beyond the whiskers for outside values (small
o’s or asterisks)

Example
Students in Introductory Statistics were presented with a page containing 30 colored
rectangles.

Compare the scores for the 16 men and 31 women who participated in the
experiment by making separate box plots for each gender.

Discuss distribution of the scores for the 31 women

Task: name the colors as quickly as possible and record their times.

the 25th percentile is 17, the 50th percentile is 19, and the 75th percentile is 20.

Example

Students in Introductory Statistics were presented with a page containing 30 colored
rectangles.

Discuss distribution of the scores for the 31 women

Task: name the colors as quickly as possible and record their times.

Compare the scores for the 16 men and 31 women who participated in the
experiment by making separate box plots for each gender.

the 25th percentile is 17, the 50th percentile is 19, and the 75th percentile is 20.

Shock Index Units

2.0

1.3

0.7

0.0

minimum (or Q11.5IQR)

75th percentile (0.8)
median (.66)
25th percentile (0.55)

Q3 + 1.5IQR = .
8+1.5(.25)=1.175

Outliers

maximum (1.7)

Box Plot: Shock Index

“whisker”
interquartile range
(IQR) = .8-.55 = .25

SI

100.0

66.7

33.3

0.0

Box Plot: Age
More symmetric

interquartile range

AGE

Variables

maximum

75th percentile

median

25th percentile

minimum

Comparing distributions

For the men the 25° percentile is
19, the 50th percentile is 22.5, and
the 75th percentile is 25.5.

Women generally named the
colors faster than the men did,
although one woman was
slower than almost all of the men.

half the women's times are between
17 and 20
half the men's times are between 19
and 25.

Half the scores in a distribution
are between the hinges (recall that
the hinges are the 25th and 75th
percentiles),

Years

Or…..
The means are indicated by
green lines rather than plus
signs.
The mean of all scores is
indicated by a gray line.
Individual scores are represented
by dots.
The box for the women is wider
than the box for the men because
the widths of the boxes are
proportional to the number of
subjects of each gender (31
women and 16 men).
Jitter the points. one dot for each
subject.

1980

1985

1990

1995

Time series:
don’t show distributions, show change over time

1975

BAs in social science and history
(National Center for Educational Statistics)

% women

50%
45%
40%
35%
30%
25%
20%
15%
10%
5%
0%
1970

Axis distortion:

1995

start (or break) vertical above zero
BAs in social science and history
46%
44%

1990

34%

36%

1990

1975

1980

Pie

Nominal

Book approves

Ordinal

1990

if continuous

if continuous

Interval

1985

Squeeze vertical or stretch horizontal

50%

45%

40%

35%

30%

0%
1970

5%

10%

15%

20%

42%

1985

% women 25%

1980

40%

1975

% women 38%

32%
30%
1970

50%

1980

Squeeze horizontale or stretch vertical
45%
40%
35%
30%

20%

%
25%
women

15%

5%

10%

0%
1970

Bar

Book disapproves

Summary: Graphical display of
distributions

Line
Histogram
Boxplot

1995

Summary: Common distortions

False perspective
 e.g., tilting a pie chart
Shortening an axis; e.g.,
 not starting the vertical at 0
 breaking the vertical
 squishing the horizontal
Reasons
 Make small differences look big,
Or make big differences look small

Honest aspect ratio is 3:2 (Tufte)

Squeeze one axis

Start or break vertical axis above zero
Add disproportionate areas in a meaningless 3rd
dimension
Use blocking &amp; tilting

Perspective distortion

Axis distortion

Summary: Graphical distortion

If you have to use 3D, avoid abuses

Don’t stretch axes
Don’t start or break axes above zero
Don’t use 3-D

Keep it simple

With just a few numbers,
consider a table instead of a graph

Maximize differences that serve your purpose
Minimize differences that work against you

Use every trick (3D, distorted axes)

Characterizing events/phenomena
by time, place, and person
Who? How many? =Person (variables: age,
race, gender, education, working status,…)
Where?= Place
When? How long?= Time

Person

How many?

Race
Black
White

Pop. Size
1,450,675
5,342,532

# Salmonella cases
119
497

Person representation
Lists (tables)
Graphs
 Pie and bar charts
 Hystograms
Frequency polygon

Time

When? How long?

Time representation

Place

Where?

Place representation

Clustering::

Unimodal, bimodal, multimodal

Positive (right)
Negative (left)

Symmetric
Skewed

Shapes

Review of shape

Normal distribution

200

100

0

82.00

80.40

78.80

77.20

75.60

74.00

72.40

70.80

69.20

67.60

66.00

64.40

62.80

61.20

59.60

58.00

Symmetry

symmetric
bell-shaped
very specific numeric properties

Symmetry, no skew
Two tails,
or no tails
The normal curve

Important example:

Frequency

14.0

9.3

4.7

AGE (Years)

66.7

survey of college placement offices

National Association of Colleges and Employers

Starting salary in thousands

22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5

N = 96.00

Mean = 28.7

Std. Dev = 4.31

Starting salaries for BAs in sociology, 2000-2001

100.0

Not skewed, but not
bell-shaped either…

Histogram: Age

33.3

0

10

20

30

Shape of distributions:
Positive or right skew

0.0
0.0

Stretched (Skewed)
to the right

A few large values
Floor but no ceiling

Common cause

Peak on left
Long right tail

Positive or right skew
Characteristics:

Percent

mode around \$27K
0

10

20

30

12
10
8
6
4
2
0

40.0

45.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

100.0

95.0

N = 101.00

Mean = 75.4

Std. Dev = 15.79

Assignment 1 scores, sociology 549, winter 2001

35.0

Assignment 1 scores

survey of college placement offices

National Association of Colleges and Employers

Starting salary in thousands

22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5

N = 96.00

Mean = 28.7

Std. Dev = 4.31

Starting salaries for BAs in sociology, 2000-2001

Unimodal distributions

Ceiling but no floor

14

Shape of distributions:
Negative or left skew

A few small values

Stretched (Skewed)
to the left

Peak on right
Long left tail

Negative or left skew
Characteristics
mirror positive skew:

Common cause

peak
most common value

Mode

one peak
e.g., starting salaries

Unimodal

the most common salaries
are in the high \$20s

Interpretation

500

400

300

200

100

0
1

2
3

NUMBER OF CHILDREN

0

Bimodal distributions

modes at 0, 20, 40

(primary)
mode

Multimodal distributions

Bimodal
 two modes
 e.g., # children
 modes at 0 and 2
Interpretation?

more than 2 modes
e.g., hours worked by
OSU sociology
students

Multimodal

Count

4
5

6
7

secondary
modes

EIGHT OR MORE

Symmetry

Positive skew

Negative Skew

bimodal

Most of the scores are in the middle
of the distribution, with fewer scores in
the extremes. The distribution is not
symmetric. Indee they extend to the
right farther than they do on the left.
The distribution is therefore said to be
skewed

### Sur le même sujet..

Ce fichier a été mis en ligne par un utilisateur du site. Identifiant unique du document: 00360462.

Pour plus d'informations sur notre politique de lutte contre la diffusion illicite de contenus protégés par droit d'auteur, consultez notre page dédiée.