Basic structure of the Corpora of Early English Correspondence

The sampling unit in the CEEC corpora is the individual letter writer. In principle, writers of both sexes and all sections of the social hierarchy are included from each successive 20-year period covered by the corpus. The continuity of regional coverage was taken into account by selecting informants from London, East Anglia, the North (the counties north of Lincolnshire), as well as members of the Royal Court, many of whom lived in Westminster. These four areas cover about half of the original CEEC.

Where possible, the minimum of ten medium-length letters by informant was aimed at, but especially with writers whose writing spanned two or three successive 20-year periods, more were included. By contrast, fewer than ten letters were available from many women writers and writers coming from the lower social ranks.

To make the CEEC as flexible a tool as possible, its different versions come in two basic file formats. The format based on the edited letter collections the material has been selected from has its advantages: it may offer the corpus user ready access to some sociolinguistically relevant communities, such as extended families and their networks of friends, partners, colleagues and other contacts. For many sociolinguistic studies, however, files representing individual letter writers are preferable.

The original 1998 version of the CEEC has 778 letter writers, retrieved from 96 collections, and it exists in both collection-based and individual files. This version consists of 244 files for individual writers with the minimum of 2,000 words, and six period-specific files containing the data from the rest of the writers. The CEEC Supplement is based on 19 collections (94 informants), and the 18th-century Extension on 77 collections (308 informants). Both exist in the two formats.

The two published versions of the CEEC only come in collection-based formats. The Sampler version, CEECS, consists of 23 collections with 194 informants. It is further divided into two partly overlapping subperiods, CEECS1 (1418–1638) and CEECS2 (1580–1680), which constitute separate files.

The linguistically annotated version, PCEEC, has 666 informants from a total of 84 collections. It comes in collection-based files of three different kinds: plain text files, part-of-speech tagged files, and parsed text files. The collections can be periodized following, for instance, the Helsinki Corpus periods: M3 (1350–1419), M4 (1420–1499), E1 (1500–1569), E2 (1570–1639), and E3 (1640–1710), as this information has been encoded into each letter, together with the year of writing.