Corpus of Early English Correspondence

The Corpus of Early English Correspondence (CEEC) was compiled with historical sociolinguistics in mind.

The original concept was to test how methods created by sociolinguists studying present-day languages could be applied to historical data. As the corpus yielded promising results, the research team has found many important links between language change and social variables. Some of these findings are reported in Nevalainen & Raumolin-Brunberg (2003), Laitinen (2007), Nevala (2004), Nurmi (1999), and Palander-Collin (1999). The application of sociolinguistic methods is made possible by an extensive database containing background information about letter writers. This database is currently being extended to cover information on letter recipients as well. A second database contains information on each letter, including data on eg authenticity in addition to sender and recipient. An interface, allowing combined searches from these databases and making the selection of desired social groups for study has been developed and is currently being tested.

Project members
Description

The Corpus of Early English Correspondence is these days a cover term for a family of corpora. Work on the original Corpus of Early English Correspondence (CEEC) began in 1993, and was completed in 1998. The table below gives the data on the various versions of the corpus. 

Corpus time covered words letters writers collections published
CEEC 1410?-1681 2.7 million 6039 778 96 --
CEECS 1418-1680 0.45 million 1147 194 23 1998
PCEEC 1410?-1681 2.2 million 4979 657 84 2006
CEECE 1681-1800 c. 2.2 million c. 4900 > 300 74 --
CEECSU 1402-1663 c. 0.44 million c. 900 > 100 20 --

The first published version of the corpus was the Corpus of Early English Correspondence Sampler (CEECS), which contains a selection of letters from the bigger corpus. Although the texts included in the sample corpus were chosen because they were no longer under copyright, the CEECS is a fairly accurate small-scale copy of the full CEEC, giving similar results for many linguistic phenomena. CEECS was published in 1998 and is available through the Oxford Text Archive and ICAME. The manual is also available on-line. 

The next released evolution of the corpus is the Parsed Corpus of Early English Correspondence (PCEEC), containing the bulk of the collections included in the original CEEC, 2.2 million words in all. The PCEEC has part-of-speech tagging and syntactic parsing, as well as a text only version. The annotation was realised in collaboration with Professor Anthony Warner, Dr. Ann Taylor and Dr. Susan Pintzuk at the Department of Language and Linguistic Science of the University of York. The tagging was carried out by Arja Nurmi, and the parsing by Ann Taylor. PCEEC was published in 2006, and is available through the Oxford Text Archive and ICAME. The manual of annotation is available online and the manual of texts will be made available on the Varieng website.

Since 1998 the corpus has also been extended to cover a longer time period, but also given further substance to the original timespan. The Corpus of Early English Correspondence Extension (CEECE) takes off from 1681 where the old CEEC ends. Compilation work has been completed, and pilot studies are now being carried out on the corpus by the CEEC team. The Corpus of Early Correspondence Supplement (CEECSu) represents the other way in which the corpus has been supplemented: by the addition of material covering the time period of the original corpus, partly using less than ideal editions to supplement groups poorly represented in the corpus (e.g. women), partly to include editions published since the completion of CEEC in 1998. 

The compilation of these corpora has been team work from beginning to end. All members of the CEEC team have participated in the selection of material, including numerous library visits in Finland and abroad. Senior members have assumed overall responsibility for the planning of the corpus and data selection, while junior members have been responsible for scanning, coding and proof-reading of the data. The teams have consisted of the following members: 

CEEC, CEECS, PCEEC: Terttu Nevalainen (leader), Jukka Keränen, Minna Nevala (née Aunio), Arja Nurmi, Minna Palander-Collin, and Helena Raumolin-Brunberg.

CEECE, CEECSu: Terttu Nevalainen (leader), Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Helena Raumolin-Brunberg, Anni Sairio (née Vuorinen), and Tanja Säily.