Corpora of Early English Correspondence
The Corpus of Early English Correspondence (CEEC) has been compiled to facilitate sociolinguistic research into the history of English. The project was originally set up to test how methods developed by sociolinguists of present-day languages could be applied to historical data. The CEEC family of corpora currently covers four hundred years from 1400 to 1800, and consists of a number of daughter corpora. The original corpus, which spans the decades from 1410 to 1680, was completed in 1998, and its sampler version (CEECS) was made publicly available the same year. Based on the original, the Parsed Corpus of Early English Correspondence (PCEEC) was released in 2006 and revised by Beatrice Santorini in 2022. The 18th-century extension (CEECE) and the supplement of the original (CEECSU), and their attendant sender and letter databases have been completed. There is also an XML Edition and a standardized-spelling version of the corpora (SCEEC), as well as POS tagged and sampler versions of the extension (TCEECE, CEECES). The ultimate aim of the compilers is to combine these subcorpora into one structured whole, which will amount to over 5 million running words.
Project leader: Terttu Nevalainen, University of Helsinki
Co-founder of project: Helena Raumolin-Brunberg, University of Helsinki
Time of compilation: 1993– (ongoing)
Size: 5.1 million words
Language: English (Late Middle, Early Modern, Late Modern)
Number of letter collections: 188
Number of letter writers: c. 1,200
Number of letters: c. 12,000
Period: 1402–1800
Released: CEECS 1998, PCEEC 2006/2022, CEECES 2021–2022
Funding: Academy of Finland: 1.9.1993–31.12.1995; University of Helsinki: 1.1.1996–30.06.1998; Academy of Finland, University of Helsinki: 1.1.2000– (National Centre of Excellence funding for the VARIENG Research Unit)
Project home page: https://www.helsinki.fi/en/researchgroups/varieng/research/corpus-of-early-english-correspondence
Table 1. The CEEC family.
|
CEEC |
CEECE |
CEECSU |
TOTALS |
words |
2,597,795 |
2,219,422 |
442,484 |
5,259,701 |
collections |
96 |
77 |
19 |
192 |
letters |
5,961 |
4,923 |
829 |
11,713 |
writers |
778 |
308 |
94 |
1,180 |
time span |
c. 1410–1681 |
1653–1800 |
1402–1663 |
1402–1800 |
Table 2. Published versions of CEEC corpora.
|
CEECS |
PCEEC |
CEECES |
words |
450,085 |
2,159,132 |
1,140,286 |
collections |
23 |
84 |
42 |
letters |
1,123 |
4,970 |
2,624 |
writers |
194 |
666 |
200 |
time span |
1418–1680 |
1410–1681 |
1653–1800 |
Tietoa suomeksi / Information in Finnish
Reference lines and copyright
CEEC = Corpus of Early English Correspondence. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.
CEECS = Corpus of Early English Correspondence Sampler. 1998. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin at the Department of Modern Languages, University of Helsinki.
PCEEC:
Parsed Corpus of Early English Correspondence, parsed version. 2006. Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.
Parsed Corpus of Early English Correspondence, tagged version. 2006. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.
Parsed Corpus of Early English Correspondence, text version. 2006. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin, with additional annotation by Ann Taylor. Helsinki: University of Helsinki and York: University of York. Distributed through the Oxford Text Archive.
PCEEC2 = Parsed Corpus of Early English Correspondence 2, parsed version. 2022. Revised and corrected by Beatrice Santorini. Annotated by Ann Taylor, Arja Nurmi, Anthony Warner, Susan Pintzuk, and Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. https://github.com/beatrice57/pceec2
CEECE:
CEECE = Corpus of Early English Correspondence Extension. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki.
TCEECE = Tagged Corpus of Early English Correspondence Extension. 2020. Annotated by Lassi Saario & Tanja Säily. Spelling standardized by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma & Anna-Lina Wallraff. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki.
CEECES:
CEECES 1 = Corpus of Early English Correspondence Extension Sampler, part 1. 2021. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.4644243
CEECES 2 = Corpus of Early English Correspondence Extension Sampler, part 2. 2022. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.5887100
TCEECES = Tagged Corpus of Early English Correspondence Extension Sampler. 2022. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily & Anni Sairio at the Department of Languages, University of Helsinki. Spelling standardized by Mikko Hakala, Minna Palander-Collin, Minna Nevala, Emanuela Costea, Anne Kingma & Anna-Lina Wallraff. Annotated by Lassi Saario & Tanja Säily. XML conversion and encoding by Lassi Saario. Helsinki: VARIENG. https://doi.org/10.5281/zenodo.5887230
CEECSU = Corpus of Early English Correspondence Supplement. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki.
SCEEC = Standardised-spelling Corpora of Early English Correspondence. 2012. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Jukka Keränen, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio. Standardised by Mikko Hakala, Minna Palander-Collin and Minna Nevala. Department of English / Department of Modern Languages, University of Helsinki.
CEECS: Nurmi, Arja (ed.) (1998) Manual for the Corpus of Early English Correspondence Sampler CEECS. Department of Modern Languages. University of Helsinki. https://icame.info/icame_static/manuals/CEECS/INDEX.HTM
PCEEC: Ann Taylor and Beatrice Santorini (2006) http://www-users.york.ac.uk/~lang22/PCEEC-manual/
PCEEC2 annotation manual: Beatrice Santorini (2022) https://www.ling.upenn.edu:/~beatrice/corpus-ling/annotation-202x
TCEECE: Saario, Lassi and Tanja Säily (2020) POS Tagging the CEECE: A Manual to Accompany the Tagged Corpus of Early English Correspondence (TCEECE). Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/tceece_doc.html
CEECES: Samuli Kaislaniemi (2022) https://doi.org/10.5281/zenodo.4644243
XML Edition: Saario, Lassi (2020) Conversion of the CEEC-400 into XML: A Manual to Accompany the XML Edition. Helsinki: VARIENG. https://varieng.helsinki.fi/CoRD/corpora/CEEC/xml_doc.html
Compilers
Project leader: Terttu Nevalainen
Senior scholar: Helena Raumolin-Brunberg
CEEC: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
CEECS: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
PCEEC: Jukka Keränen, Minna Nevala, Arja Nurmi and Minna Palander-Collin.
CEECE: Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio.
CEECSU: Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio.
Student assistants
CEEC, CEECS, PCEEC: Kirsi Heikkonen, Alistair Melville-Smith, Taru Nurmi, Arja-Liisa Rossi, Reza Sanatnama, Heli Tissari and Anne Virolainen.
CEECSU, CEECE: Maarit Alanko, Annemieke Bijkerk, Teo Juvonen, Emma Murros, Tuuli Tahko and Eero Timoskainen.
Annotation
PCEEC: Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk and Terttu Nevalainen.
TCEECE: Lassi Saario and Tanja Säily.
File format
The coding system is based on the set of ASCII codes (96 printable characters).
The names of the files follow the published letter editions that form the basis of the corpus structure and are explained in more detail in the manual.
Availability
ICAME CD-ROM (CEECS)
The Oxford Text Archive (CEECS, PCEEC)
Github (PCEEC2)
Zenodo (CEECES 1, CEECES 2, TCEECES)
|
|