Basic structure

(Source: the PPCEME corpus description.)

The PPCEME is divided into three subcorpora.

  1. The Helsinki directories, consisting of roughly 573,000 words, contain the Helsinki Corpus in parsed, POS-tagged, and unannotated form.
  2. The Penn1 directories, consisting of roughly 615,000 words, contain a first supplement to the Helsinki Corpus. As far as possible, we have used material by the same authors and from the same editions as the material in the Helsinki Corpus. Where necessary (where the Helsinki Corpus contains an exhaustive sample of a text), we have added new material.
  3. The Penn2 directories, consisting of roughly 606,000 words, contain a second supplement to the Helsinki Corpus. Again, we have tried to use material by the same authors and from the same editions as the material in the Helsinki Corpus. However, the Penn2 directories contain more new material than the Penn1 directories.

Word counts

Table 1. Word count summary by time period and subcorpus.

Period Helsinki Penn 1 Penn 2 Total
E1 1500-1569 196754 194018 185423 576195
E2 1570-1639 196742 223064 232993 652799
E3 1640-1710 179477 197908 187631 565016
Total 572973 614990 606047 1794010

Word count summary by time period and subcorpus.

Figure 1. Word count summary by time period and subcorpus.

Table 2. Word count summary by text genre.

Text genre Number of words Percentage
Bible 134275 7,7 %
Travelogue 125337 7,2 %
Diary, private 123106 7,0 %
Drama, comedy 120428 6,9 %
Letters, private 116915 6,7 %
Fiction 116494 6,7 %
Law 115863 6,6 %
Educational treatise 113032 6,5 %
Handbook, other 112419 6,4 %
History 108706 6,2 %
Proceedings, trials 105090 6,0 %
Sermon 97400 5,6 %
Philosophy 85107 4,9 %
Science, other 79050 4,5 %
Letters, non-private 59868 3,4 %
Biography, other 52755 3,0 %
Science, medicine 41786 2,4 %
Biography, autobiography 41379 2,4 %
Total 1749010 100,0 %

Word count summary by text genre.

Figure 2. Word count summary by text genre.