Basic structure

(Adapted from the corpus website and on http://davies-linguistics.byu.edu/personal/)

 

Genre distribution

 The major sources for each genre are as follows:


Fiction

Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010)

Magazine

Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s)

Newspaper

PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010)

Non-fiction

Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010)
- In each decade, the non-fiction is balanced across the Library of Congress classification system

The corpus is balanced by genre across the decades. For example, fiction accounts for 48-55% of the total in each decade (1810s-2000s), and the corpus is balanced across decades for sub-genres and domains as well (e.g. by Library of Congress classification for non-fiction; and by sub-genre for fiction -- prose, poetry, drama, etc). This balance across genres and sub-genres allows researchers to examine changes and be reasonably certain that the data reflects actual changes in the "real world", rather than just being artifacts of a changing genre balance.

 

Timeline

1810-2000

DECADE

FICTION

POPULAR
MAGAZINES

NEWSPAPERS

NON-FICTION
BOOKS

TOTAL

% FICTION

1810s

641,164

88,316

0

451,542

1,181,022

0.54

1820s

3,751,204

1,714,789

0

1,461,012

6,927,005

0.54

1830s

7,590,350

3,145,575

0

3,038,062

13,773,987

0.55

1840s

8,850,886

3,554,534

0

3,641,434

16,046,854

0.55

1850s

9,094,346

4,220,558

0

3,178,922

16,493,826

0.55

1860s

9,450,562

4,437,941

262,198

2,974,401

17,125,102

0.55

1870s

10,291,968

4,452,192

1,030,560

2,835,440

18,610,160

0.55

1880s

11,215,065

4,481,568

1,355,456

3,820,766

20,872,855

0.54

1890s

11,212,219

4,679,486

1,383,948

3,907,730

21,183,383

0.53

1900s

12,029,439

5,062,650

1,433,576

4,015,567

22,541,232

0.53

1910s

11,935,701

5,694,710

1,489,942

3,534,899

22,655,252

0.53

1920s

12,539,681

5,841,678

3,552,699

3,698,353

25,632,411

0.49

1930s

11,876,996

5,910,095

3,545,527

3,080,629

24,413,247

0.49

1940s

11,946,743

5,644,216

3,497,509

3,056,010

24,144,478

0.49

1950s

11,986,437

5,796,823

3,522,545

3,092,375

24,398,180

0.49

1960s

11,578,880

5,803,276

3,404,244

3,141,582

23,927,982

0.48

1970s

11,626,911

5,755,537

3,383,924

3,002,933

23,769,305

0.49

1980s

12,152,603

5,804,320

4,113,254

3,108,775

25,178,952

0.48

1990s

13,272,162

7,440,305

4,060,570

3,104,303

27,877,340

0.48

2000s

14,590,078

7,678,830

4,088,704

3,121,839

29,479,451

0.49

TOTAL

207,633,395

97,207,399

40,124,656

61,266,574

406,232,024

0.51