M. Ramscar et al. / Topics in Cognitive Science 6 (2014)

Importantly, linguistic distributions are skewed at every level of description (Baayen,
2001). Consider the relationship between word types (e.g., dog) and tokens (how often
“dog” occurs; Fig. 1). In English, a few words occur very frequently (the, and), such that
half of the tokens in any large natural sample will come from only 100 or so high-frequency types. The relative frequency of these types decreases rapidly (the most-frequent
word may be twice as frequent as the second-most), and frequency differences between
types decrease as their relative frequency declines. This means that the other half of a
large natural sample will be composed of ever-fewer tokens of a very large number of
types, with ever-smaller frequency differences between them. Typically, around half of
these types occur just once.
This is a very long-tailed distribution: the Corpus of Contemporary American English
(COCA; Davies, 2009) contains 425 million entries sampled from a broad range of written sources. Repetitions of the most frequently used 100 words account for 208 million
of these entries. The remaining 217 million entries represent 2,800,000 words. Accordingly, although individual low-frequency types are, by definition, rare, their distribution
means the chance of encountering a low-frequency token in any sentence is very high
obius, 2003).
This distribution ensures both that any English speaker learns only a fraction of the
language’s total vocabulary, and that individual speakers’ vocabularies will grow steadily
across the life span. However, the vocabulary tests that are typically used to control for
the growth of knowledge in studies of cognitive aging (Salthouse & Mandell, in press)
assume vocabulary size is age-invariant in adults (Bowles & Salthouse, 2008; Carroll,
1993; Spearman, 1927), an assumption seemingly confirmed by psychometric vocabulary

Fig. 1. The frequencies of the 1,000 most common words in the Corpus of Contemporary American English
(Davies, 2009) plotted by rank.