Corpora of Early English Correspondence

The Corpus of Early English Correspondence (CEEC) has been compiled to facilitate sociolinguistic research into the history of English. The project was originally set up to test how methods developed by sociolinguists of present-day languages could be applied to historical data. The CEEC family of corpora currently covers four hundred years from 1400 to 1800, and consists of a number of daughter corpora. The original corpus, which spans the decades from 1410 to 1680, was completed in 1998, and its sampler version (CEECS) was made publicly available the same year. Based on the original, the Parsed Corpus of Early English Correspondence (PCEEC) was released in 2006 and revised by Beatrice Santorini in 2022. The 18th-century extension (CEECE) and the supplement of the original (CEECSU), and their attendant sender and letter databases have been completed. There is also an XML Edition and a standardized-spelling version of the corpora (SCEEC), as well as POS tagged and sampler versions of the extension (TCEECE, CEECES). The ultimate aim of the compilers is to combine these subcorpora into one structured whole, which will amount to over 5 million running words.

Project leader: Terttu Nevalainen, University of Helsinki
Co-founder of project: Helena Raumolin-Brunberg, University of Helsinki
Time of compilation: 1993– (ongoing)
Size: 5.1 million words
Language: English (Late Middle, Early Modern, Late Modern)
Number of letter collections: 188
Number of letter writers: c. 1,200
Number of letters: c. 12,000
Period: 1402–1800
Released: CEECS 1998, PCEEC 2006/2022, CEECES 2021–2022
Funding: Academy of Finland: 1.9.1993–31.12.1995; University of Helsinki: 1.1.1996–30.06.1998; Academy of Finland, University of Helsinki: 1.1.2000– (National Centre of Excellence funding for the VARIENG Research Unit)
Project home page: https://www.helsinki.fi/en/researchgroups/varieng/research/corpus-of-early-english-correspondence

Table 1. The CEEC family.

  CEEC CEECE CEECSU TOTALS
words 2,597,795

2,219,422

442,484

5,259,701

collections

96

77

19

192

letters

5,961

4,923

829

11,713

writers

778

308

94

1,180

time span c. 1410–1681 1653–1800 1402–1663 1402–1800

Table 2. Published versions of CEEC corpora.

  CEECS PCEEC CEECES
words

450,085

2,159,132

1,140,286

collections

23

84

42

letters

1,123

4,970

2,624

writers

194

666

200

time span 1418–1680 1410–1681 1653–1800

Poster by Samuli Kaislaniemi

Figure 1. The CEEC family of corpora, with special reference to CEECE and CEECSU (click to view PDF file).

Tietoa suomeksi / Information in Finnish

Reference lines and copyright

CEEC = Corpus of Early English Correspondence. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.

CEECS = Corpus of Early English Correspondence Sampler. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.

PCEEC:

Parsed Corpus of Early English Correspondence, parsed version. 2006. Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.

Parsed Corpus of Early English Correspondence, tagged version. 2006. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.

Parsed Corpus of Early English Correspondence, text version. 2006. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin, with additional annotation by Ann Taylor. Helsinki: University of Helsinki and York: University of York. Distributed through the Oxford Text Archive.

PCEEC2 = Parsed Corpus of Early English Correspondence 2, parsed version. 2022. Revised and corrected by Beatrice Santorini. Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. https://github.com/beatrice57/pceec2

CEECE:

CEECE = Corpus of Early English Correspondence Extension. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki.

TCEECE = Tagged Corpus of Early English Correspondence Extension. 2020. Annotated by Lassi Saario & Tanja Säily. Spelling standardized by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma & Anna-Lina Wallraff. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki.

CEECES:

CEECES 1 = Corpus of Early English Correspondence Extension Sampler, part 1. 2021. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.4644243

CEECES 2 = Corpus of Early English Correspondence Extension Sampler, part 2. 2022. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.5887100

TCEECES = Tagged Corpus of Early English Correspondence Extension Sampler. 2022. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. Spelling standardized by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma & Anna-Lina Wallraff. Annotated by Lassi Saario & Tanja Säily. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.5887230

CEECSU = Corpus of Early English Correspondence Supplement. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki.

SCEEC = Standardised-spelling Corpora of Early English Correspondence. 2012. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Jukka Keränen, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio. Standardised by Mikko Hakala, Minna Palander-Collin and Minna Nevala. Department of English / Department of Modern Languages, University of Helsinki.

Manual

CEECS: Nurmi, Arja (ed.) (1998) Manual for the Corpus of Early English Correspondence Sampler CEECS. Department of Modern Languages. University of Helsinki. https://icame.info/icame_static/manuals/CEECS/INDEX.HTM

PCEEC: Ann Taylor and Beatrice Santorini (2006) http://www-users.york.ac.uk/~lang22/PCEEC-manual/

PCEEC2 annotation manual: Beatrice Santorini (2022) https://www.ling.upenn.edu:/~beatrice/corpus-ling/annotation-202x

TCEECE: Saario, Lassi and Tanja Säily (2020) POS Tagging the CEECE: A Manual to Accompany the Tagged Corpus of Early English Correspondence (TCEECE). Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/tceece_doc.html

CEECES: Samuli Kaislaniemi (2022) https://doi.org/10.5281/zenodo.4644243

XML Edition: Saario, Lassi (2020) Conversion of the CEEC-400 into XML: A Manual to Accompany the XML Edition. Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/xml_doc.html

Compilers

Project leader: Terttu Nevalainen
Senior scholar: Helena Raumolin-Brunberg
CEEC: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
CEECS: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
PCEEC: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
CEECE: Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio.
CEECSU: Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio.

Student assistants

CEEC, CEECS, PCEEC: Kirsi Heikkonen, Alistair Melville-Smith, Taru Nurmi, Arja-Liisa Rossi, Reza Sanatnama, Heli Tissari and Anne Virolainen.
CEECSU, CEECE: Maarit Alanko, Annemieke Bijkerk, Teo Juvonen, Emma Murros, Tuuli Tahko and Eero Timoskainen.

Annotation

PCEEC: Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen.
TCEECE: Lassi Saario and Tanja Säily.

File format

The coding system is based on the set of ASCII codes (96 printable characters). The names of the files follow the published letter editions that form the basis of the corpus structure and are explained in more detail in the manual.

Availability

ICAME CD-ROM (CEECS)

The Oxford Text Archive (CEECS, PCEEC)

Github (PCEEC2)

Zenodo (CEECES 1, CEECES 2, TCEECES)