Basic structure
(Adapted from the corpus website and on http://davies-linguistics.byu.edu/personal/)
Genre distribution
The major sources for each genre are as follows:
Fiction |
Project Gutenberg (1810-1930), Making of America (1810-1900), scanned books (1930-1990), movie and play scripts, COCA (1990-2010) |
Magazine |
Making of America (1810-1900), scanned and PDF (1900-1990), COCA (1990-2010)
- In each decade, the magazines are balanced across at least ten magazines (with equivalent sub-genres for the 1900s) |
Newspaper |
PDF > TXT of at least five newspapers (1850-1980), COCA etc (1990-2010) |
Non-fiction |
Project Gutenberg (1810-1900), www.archive.org (1810-1900), scanned books (1900-1990), COCA (1990-2010)
- In each decade, the non-fiction is balanced across the Library of Congress classification system |
The corpus is balanced by genre across the decades. For example, fiction accounts for 48-55% of the total in each decade (1810s-2000s), and the corpus is balanced across decades for sub-genres and domains as well (e.g. by Library of Congress classification for non-fiction; and by sub-genre for fiction -- prose, poetry, drama, etc). This balance across genres and sub-genres allows researchers to examine changes and be reasonably certain that the data reflects actual changes in the "real world", rather than just being artifacts of a changing genre balance.
Timeline
1810-2000
DECADE |
FICTION |
POPULAR
MAGAZINES |
NEWSPAPERS |
NON-FICTION
BOOKS |
TOTAL |
% FICTION |
1810s |
641,164 |
88,316 |
0 |
451,542 |
1,181,022 |
0.54 |
1820s |
3,751,204 |
1,714,789 |
0 |
1,461,012 |
6,927,005 |
0.54 |
1830s |
7,590,350 |
3,145,575 |
0 |
3,038,062 |
13,773,987 |
0.55 |
1840s |
8,850,886 |
3,554,534 |
0 |
3,641,434 |
16,046,854 |
0.55 |
1850s |
9,094,346 |
4,220,558 |
0 |
3,178,922 |
16,493,826 |
0.55 |
1860s |
9,450,562 |
4,437,941 |
262,198 |
2,974,401 |
17,125,102 |
0.55 |
1870s |
10,291,968 |
4,452,192 |
1,030,560 |
2,835,440 |
18,610,160 |
0.55 |
1880s |
11,215,065 |
4,481,568 |
1,355,456 |
3,820,766 |
20,872,855 |
0.54 |
1890s |
11,212,219 |
4,679,486 |
1,383,948 |
3,907,730 |
21,183,383 |
0.53 |
1900s |
12,029,439 |
5,062,650 |
1,433,576 |
4,015,567 |
22,541,232 |
0.53 |
1910s |
11,935,701 |
5,694,710 |
1,489,942 |
3,534,899 |
22,655,252 |
0.53 |
1920s |
12,539,681 |
5,841,678 |
3,552,699 |
3,698,353 |
25,632,411 |
0.49 |
1930s |
11,876,996 |
5,910,095 |
3,545,527 |
3,080,629 |
24,413,247 |
0.49 |
1940s |
11,946,743 |
5,644,216 |
3,497,509 |
3,056,010 |
24,144,478 |
0.49 |
1950s |
11,986,437 |
5,796,823 |
3,522,545 |
3,092,375 |
24,398,180 |
0.49 |
1960s |
11,578,880 |
5,803,276 |
3,404,244 |
3,141,582 |
23,927,982 |
0.48 |
1970s |
11,626,911 |
5,755,537 |
3,383,924 |
3,002,933 |
23,769,305 |
0.49 |
1980s |
12,152,603 |
5,804,320 |
4,113,254 |
3,108,775 |
25,178,952 |
0.48 |
1990s |
13,272,162 |
7,440,305 |
4,060,570 |
3,104,303 |
27,877,340 |
0.48 |
2000s |
14,590,078 |
7,678,830 |
4,088,704 |
3,121,839 |
29,479,451 |
0.49 |
TOTAL |
207,633,395 |
97,207,399 |
40,124,656 |
61,266,574 |
406,232,024 |
0.51 |
|
|
|