# Lecture 3 .pdf

Nom original:

**Lecture 3.pdf**Titre:

**Lecture3-M**Auteur:

**Giuliana Cortese**

Ce document au format PDF 1.3 a été généré par pdftopdf filter / Mac OS X 10.7.5 Quartz PDFContext, et a été envoyé sur fichier-pdf.fr le 12/10/2015 à 18:38, depuis l'adresse IP 93.34.x.x.
La présente page de téléchargement du fichier a été vue 512 fois.

Taille du document: 5.8 Mo (20 pages).

Confidentialité: fichier public

### Aperçu du document

Lecture 3

frequency polygon

Histogram

Box-plot

Scatter plots

Pie chart

Bar chart

Line chart

Graphs

Common graphs for frequency distributions

Graphs for “quantitative” variables

There are many types of graphs that can be used

to portray distributions of quantitative variables.

histograms, best-suited for large amounts of data

frequency polygons

box plots good at depicting differences between

distributions

scatterplots used to show the relationship between

two variables.

Graphs for “qualitative” variables

Qualitative data do not come with a pre-established ordering (the

way numbers are ordered)

Bar/Pie Charts

Used for categorical variables to show frequency or

proportion in each category.

Translate the data from frequency tables into a pictorial

representation…

Graphs for qualitative variables (example)

When Apple Computer introduced the iMac computer in August

1998, the company wanted to learn whether the iMac was

expanding Apple’s market share. 500 iMac customers were

interviewed. Each customer was categorized as:

1) previous Macintosh owners;

2) previous Windows owner;

new computer purchaser.

3)

There is no natural sense in which the category of previous

Windows users comes before or after the category of previous

iMac users.

Table 1 shows the frequencies and

the relative frequencies (proportion

of responses in each category).

Graphs for qualitative variables (example)

For example, the relative frequency

for "none" of 0.17 = 85/500

Although most iMac purchasers were

Macintosh owners, 12% of

purchasers were former Windows

users, and 17% of purchasers were

buying a computer for the first time.

Pie charts

Each category is represented by a

slice of the pie.

The area of the slice is

proportional to the percentage of

responses in the category (the

relative frequency multiplied by

100)

Graphical mistakes to avoid with pie charts

can be confusing when used to compare the outcomes of two

different surveys or experiments.

not recommended when you have a large number of categories.

Pie charts…..

with a small number of observations, it can be misleading to

label the pie slices with percentages. The slices should be

labeled with the actual frequencies observed instead of with

percentages.

For example, if just 5 people had been interviewed by Apple Computers,

and 3 were former Windows users, it would be misleading to display a

pie chart with the Windows slice showing 60%.

Perspective distortion

Add a meaningless 3rd dimension

Tilt pie away

Edge adds to front

Perspective shrinks back

Comparisons even harder

16

14

12

8

10

6

4

2

0

Current

Military

26%

Past Military

20%

Human

Resources

32%

Criminology

Majors in Soc 549

General

Physical Government

Resources

16%

6%

Sociology

Bar chart (column chart)

Federal budget, from the website of the War Resisters’ League

It is used to represent the

frequencies of different

categories.

More common than pie

Can show order and

changes over time

Appropriate for nominal

as well as for ordinal

and interval

Easy to compare vertical

distances

Psychology

Bar Charts

The Y-axis shows the number of

observations in each category

Categories are shown on the X

axis

yes

Bar Chart (examples)

Example: the bar chart shown in

Figure 2 shows how many purchasers

of iMac computers were previous

Macintosh users, previous Windows

users, and new computer purchasers.

no

Other uses

The Y-axis is not frequency

but rather the signed quantity

percentage increase.

Bar charts can also

show change over

time.

Other uses

Example: Figure 2 shows the

percent increases in the Dow

Jones, Standard and Poor 500 (S

& P), and Nasdaq stock indexes

from May 24th 2000 to May 24th

2001. Both the S & P and the

Nasdaq had “negative increases”

which means that they decreased

in value.

Example: Figure shows the

percent increase in the

consumer price index (CPI)

over four three-month periods.

The fluctuation in inflation is

apparent in the graph.

Comparing distributions

Often we need to compare the "distributions"

of responses between the surveys or

conditions. Bar charts are often excellent for

illustrating differences between two

distributions.

Comparing distributions

Figure 3 shows the number of

people playing card games at the

Yahoo web site on a Sunday and

on a Wednesday on a day in the

Spring of 2001.

There were more players overall

on

Wednesday compared to

Sunday.

There were about twice as many

people playing hearts on

Wednesday as on Sunday.

15

14

13

12

11

9

10

8

Sociology

Majors in Soc 549

Criminology

Psychology

Graphical mistakes to avoid with bar

charts (Axis Distorsion)

The heights of the pictures

accurately represent the number

of buyers. F i g u r e 6 i s

misleading because the

viewer's attention will be

captured by areas. This can

exagerate the size differences

between the groups.

In terms of percentages, the ratio

of previous Macintosh owners to

previous Windows owners is

about 6 to 1. But the ratio of the

two areas in Figure 6 is about 35

to 1.

Graphical mistakes to avoid with bar charts

Baseline=bottom of the Y-axis,

representing the least number of

cases in a category.

Normally, this number should be

zero. Indeed, start vertical above

zero exaggerates all differences

A distortion in bar charts may

result from setting the

baseline to a value other than

zero.

Figure 7 shows the iMac data with a

baseline of 50. The number of

wndows-switchers seems minuscule

compared to its true value of 12%.

7

Add meaningless 3rd dimension

Reduces differences

(caps same size)

6

8

10

12

14

14

12

8

10

6

4

2

0

Psychology

Sociology

Sociology

Criminology

Criminology

Psychology

Graphical mistakes to avoid with bar charts

(Perspective distortion )

Add 3rd dimension and overlap

Exaggerates

differences

Hides side of smaller bars

Also hides part of top

Rotation would make it worse

4

2

0

Graphs for quantitative variables

Line graphs are appropriate only when both

the X- and Y-axes display ordinal and

interval (rather than qualitative) variables.

Line Graph

Although bar graphs can also be used in this

situation, line graphs are generally better at

comparing changes from period to period.

16

14

12

10

8

6

4

2

0

16

14

12

Sociology

Sociology

Majors in Soc 549

Criminology

Criminology

Psychology

Psychology

Bar vs. line: similarities

Bar and line charts

almost equivalent

Start with a bar chart

Connect tops

Remove bottoms

You get a line chart!

10

8

6

4

2

0

Line chart

16

14

12

8

10

6

4

2

0

16

14

12

Senior

Sociology

Junior

Criminology

Sophomore

Psychology

Bar vs. line: Differences

Line suggests trend

more strongly

Helpful with ordinal or

interval variables

Misleading with nominal

8

10

6

4

2

0

Bar vs. line: Differences

16

Line eases comparison of groups

14

16

12

0

2

4

6

8

10

14

Social statistics

Sociology of Sport

12

10

8

6

4

2

0

15

14

13

12

11

9

10

8

7

14

10

12

6

8

4

0

2

Sociology

Sociology

Criminology

Social statistics

Sociology of Sport

S1

Psychology

Psychology

Criminology

Graphical mistakes to avoid with line charts

Or break vertical

Axis distorsion: start vertical

above zero

Exaggerates trend

start vertical above zero

Add meaningless 3rd dimension

Tilt horizontal

Perspective distorsion:

Especially helpful for comparing sets of data.

A graphical device for understanding the shapes

of distributions.

Frequency Polygon

cumulative

Choose a class interval.

Draw an X-axis representing the

values of the scores in your data.

Mark the middle of each class

interval with a tick mark

Label it with the middle value

represented by the class.

Draw the Y-axis to indicate the

frequency of each class.

Place a point in the middle of

each class interval at the height

corresponding to its frequency.

Graph will touch the X-axis on

both sides.

Frequency Polygon

A good choice for displaying

frequency distributions.

Comparing distributions

small rectangle: 20 trials

large rectangle: 20 trials.

the Time to reach the target was

recorded on each trial.

the two distributions (one for each

target) are plotted together.

Goal: move a computer mouse to a

target on the screen as fast as

possible.

The figure shows that it generally

took longer to move the mouse to

the small target than to the large

one.

The Y value for each point is the

number of cases in the

corresponding class interval plus

all numbers in lower intervals.

Cumulative frequency polygon

Example:

there are no scores in the

interval labeled "35,“ three in

the interval "45,"and 10 in the

interval "55.

Therefore the Y value

corresponding to "55" is 13.

Since 642 students took the

test, the cumulative frequency

for the last interval is 642.

Comparing distributions

Box plot and histograms: for

continuous variables

To show the distribution (shape, center,

range, variation) of continuous variables.

Histograms

Useful

for displaying the shape of a distribution.

when there are a large number of observations

Histograms

Bin widths = widths of the class intervals,

Class intervals: range of values broken into intervals

Placing the limits of the class intervals midway between two

numbers (e.g., 49.5)

Count the number of scores falling into each interval (class

frequencies.)

Create a frequency table. To simplify the table, group values

together

Steps

Horizontal can represent equal or unequal class intervals

(“bins”)

Vertical: bars represent class frequencies

The height of each bar corresponds to its class frequency

(for constant class intervals)

The area of each bar corresponds to its class frequency

(for variable class intervals)

Ensures that every score will fall in an interval rather than on the

boundary between intervals.

Area and not height

no

yes

no

yes

This choice affects the shape of the histogram.

Histograms (choice of bin widths)

“Rules of thumb“

Sturgis's rule: set the number of intervals as close as possible

to 1 + Log2(N), where Log2(N) is the base 2 log of the

number of observations.

Rice’s rule: set the number of intervals to twice the cube root

of the number of observations.

Best advice: experiment with different choices of width and

choose a histogram according to how well it communicates the

Histogram of x

20

22

24

Frequency

18

Histogram of x

20

22

24

Histograms (choice of bin widths)

18

16

x

N=1000, number of classes=50

Histograms (choice of bin widths)

16

Histogram of x

20

22

x

N=1000, number of classes=50

18

Histograms (relative frequencies)

Histograms can be based on

relative frequencies:

the

proportion of scores in each

interval rather than the number of

scores.

The Y axis runs from 0 to 1

dividing each class frequency by the

total number of observations,

plotting the quotients on the Y axis

(labeled as proportion).

Histogram based on frequencies

can be changed to one based on

relative frequencies by

Density

shape of the distribution.

16

x

N=1000, number of classes=11

0.30

0.25

0.20

0.15

0.10

0.05

0.00

60

50

40

30

20

10

0

250

200

150

100

50

0

Frequency

24

Box-plots

Useful for identifying outliers and for comparing distributions.

Steps

Compute the 25th, 50th, and 75th percentiles in the distribution

Lower hinge 25th percentile

Higher hinge 75th percentile

Put "whiskers" above and below each box to give additional

information about the spread of data.

Whiskers are vertical lines that end in a horizontal stroke.

Whiskers are drawn from the upper and lower hinges to the upper

and lower adjacent values

Put additional marks beyond the whiskers for outside values (small

o’s or asterisks)

Example

Students in Introductory Statistics were presented with a page containing 30 colored

rectangles.

Compare the scores for the 16 men and 31 women who participated in the

experiment by making separate box plots for each gender.

Discuss distribution of the scores for the 31 women

Task: name the colors as quickly as possible and record their times.

the 25th percentile is 17, the 50th percentile is 19, and the 75th percentile is 20.

Example

Students in Introductory Statistics were presented with a page containing 30 colored

rectangles.

Discuss distribution of the scores for the 31 women

Task: name the colors as quickly as possible and record their times.

Compare the scores for the 16 men and 31 women who participated in the

experiment by making separate box plots for each gender.

the 25th percentile is 17, the 50th percentile is 19, and the 75th percentile is 20.

Shock Index Units

2.0

1.3

0.7

0.0

minimum (or Q11.5IQR)

75th percentile (0.8)

median (.66)

25th percentile (0.55)

Q3 + 1.5IQR = .

8+1.5(.25)=1.175

Outliers

maximum (1.7)

Box Plot: Shock Index

“whisker”

interquartile range

(IQR) = .8-.55 = .25

SI

100.0

66.7

33.3

0.0

Box Plot: Age

More symmetric

interquartile range

AGE

Variables

maximum

75th percentile

median

25th percentile

minimum

Comparing distributions

For the men the 25° percentile is

19, the 50th percentile is 22.5, and

the 75th percentile is 25.5.

Women generally named the

colors faster than the men did,

although one woman was

slower than almost all of the men.

half the women's times are between

17 and 20

half the men's times are between 19

and 25.

Half the scores in a distribution

are between the hinges (recall that

the hinges are the 25th and 75th

percentiles),

Years

Or…..

The means are indicated by

green lines rather than plus

signs.

The mean of all scores is

indicated by a gray line.

Individual scores are represented

by dots.

The box for the women is wider

than the box for the men because

the widths of the boxes are

proportional to the number of

subjects of each gender (31

women and 16 men).

Jitter the points. one dot for each

subject.

1980

1985

1990

1995

Time series:

don’t show distributions, show change over time

1975

BAs in social science and history

(National Center for Educational Statistics)

% women

50%

45%

40%

35%

30%

25%

20%

15%

10%

5%

0%

1970

Axis distortion:

1995

start (or break) vertical above zero

BAs in social science and history

46%

44%

1990

34%

36%

1990

1975

1980

Pie

√

√

Nominal

√

√

Book approves

Ordinal

1990

if continuous

if continuous

√

Interval

1985

Squeeze vertical or stretch horizontal

50%

45%

40%

35%

30%

0%

1970

5%

10%

15%

20%

42%

1985

% women 25%

1980

40%

1975

% women 38%

32%

30%

1970

50%

1980

Squeeze horizontale or stretch vertical

45%

40%

35%

30%

20%

%

25%

women

15%

5%

10%

0%

1970

Bar

Book disapproves

Summary: Graphical display of

distributions

Line

Histogram

Boxplot

1995

Summary: Common distortions

False perspective

e.g., tilting a pie chart

Shortening an axis; e.g.,

not starting the vertical at 0

breaking the vertical

squishing the horizontal

Reasons

Add visual interest

Make small differences look big,

Or make big differences look small

Honest aspect ratio is 3:2 (Tufte)

Squeeze one axis

Start or break vertical axis above zero

Add disproportionate areas in a meaningless 3rd

dimension

Use blocking & tilting

Perspective distortion

Axis distortion

Summary: Graphical distortion

If you have to use 3D, avoid abuses

Don’t stretch axes

Don’t start or break axes above zero

Don’t use 3-D

Keep it simple

Graphics: Good advice

With just a few numbers,

consider a table instead of a graph

Maximize differences that serve your purpose

Minimize differences that work against you

Use every trick (3D, distorted axes)

Graphics: Evil advice

Characterizing events/phenomena

by time, place, and person

Who? How many? =Person (variables: age,

race, gender, education, working status,…)

Where?= Place

When? How long?= Time

Person

How many?

Race

Black

White

Pop. Size

1,450,675

5,342,532

# Salmonella cases

119

497

Person representation

Lists (tables)

Graphs

Pie and bar charts

Hystograms

Frequency polygon

Time

When? How long?

Time representation

Place

Where?

Place representation

Clustering::

Unimodal, bimodal, multimodal

Positive (right)

Negative (left)

Symmetric

Skewed

Shapes

Review of shape

Normal distribution

200

100

0

82.00

80.40

78.80

77.20

75.60

74.00

72.40

70.80

69.20

67.60

Height of adult males (inches)

66.00

64.40

62.80

61.20

59.60

58.00

Symmetry

symmetric

bell-shaped

very specific numeric properties

Symmetry, no skew

Two tails,

or no tails

The normal curve

Important example:

Frequency

14.0

9.3

4.7

AGE (Years)

66.7

survey of college placement offices

National Association of Colleges and Employers

Starting salary in thousands

22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5

N = 96.00

Mean = 28.7

Std. Dev = 4.31

Starting salaries for BAs in sociology, 2000-2001

100.0

Not skewed, but not

bell-shaped either…

Histogram: Age

33.3

0

10

20

30

Shape of distributions:

Positive or right skew

0.0

0.0

Stretched (Skewed)

to the right

A few large values

Floor but no ceiling

Common cause

Peak on left

Long right tail

Positive or right skew

Characteristics:

Percent

mode around $27K

0

10

20

30

12

10

8

6

4

2

0

40.0

45.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

100.0

95.0

N = 101.00

Mean = 75.4

Std. Dev = 15.79

Assignment 1 scores, sociology 549, winter 2001

35.0

Assignment 1 scores

survey of college placement offices

National Association of Colleges and Employers

Starting salary in thousands

22.5 25.0 27.5 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5

N = 96.00

Mean = 28.7

Std. Dev = 4.31

Starting salaries for BAs in sociology, 2000-2001

Unimodal distributions

Ceiling but no floor

14

Shape of distributions:

Negative or left skew

A few small values

Stretched (Skewed)

to the left

Peak on right

Long left tail

Negative or left skew

Characteristics

mirror positive skew:

Common cause

peak

most common value

Mode

one peak

e.g., starting salaries

Unimodal

the most common salaries

are in the high $20s

Interpretation

500

400

300

200

100

0

1

2

3

NUMBER OF CHILDREN

0

Bimodal distributions

modes at 0, 20, 40

(primary)

mode

Multimodal distributions

Bimodal

two modes

e.g., # children

modes at 0 and 2

Interpretation?

more than 2 modes

e.g., hours worked by

OSU sociology

students

Multimodal

Count

4

5

6

7

secondary

modes

EIGHT OR MORE

Symmetry

Positive skew

Negative Skew

bimodal

Most of the scores are in the middle

of the distribution, with fewer scores in

the extremes. The distribution is not

symmetric. Indee they extend to the

right farther than they do on the left.

The distribution is therefore said to be

skewed