Basic structure of The Helsinki Corpus

The Helsinki Corpus covers a thousand years of English texts, from the eighth to the beginning of the eighteenth century. It consists of about 1.5 million words, including more than four hundred samples of continuous text varying from 2000 to 20,000 words. Only shorter texts are included in toto. The Corpus is divided into three sections: Old English, Middle English, and Early Modern English. Each section consists of three or four sub-periods of 70, 80, or 100 years. The chronological division and the size of the parts of the corpus can be seen here. For a more detailed general introduction of The Helsinki Corpus, see Rissanen (2005). See also the Manual of the Corpus (Kytö 1996).

The periods and sub-periods are not of equal size because the number and quality of available texts varies considerably between periods. The text samples were keyed in from printed editions because using manuscripts would have meant years of transcription work in foreign libraries, which would have been financially impossible. For the Old English part of the Corpus, the machine-readable transcript produced by the Dictionary of Old English project was used, after checking had been done on source text editions and manuscripts..

At the beginning of each text sample there is a set of parameter codes giving information on the date, author, dialect, genre and other relevant features of the text. No grammatical or syntactic coding is included. There are, however, later versions of the corpus including this kind of coding.

To ensure the representativeness of the selection of the corpus texts, three external factors affecting the choice of variant forms were taken taken into consideration: (1) Regional variation, i.e. dialects; (2) Sociolinguistic variation, including the author’s gender, age, social background and education; (3) Genre variation, i.e. texts of different types written for different purposes, etc.

Dialects are observed only in the Old and Middle English sections of the corpus. The parameters giving sociolinguistic information become relevant in the late Middle and Early Modern English subsections. The genre or text type distribution of the samples is noted across the Corpus, although even this variation becomes richer and more relevant in later sections. The genres are grouped under six major “prototypical” text categories. Samples of Bible translations from the tenth century to the King James Bible (1611) and of five English versions of Boethius’ De Consolatione Philosophiae from Old English to the late 17 th century are included. Texts giving information on forms and constructions typical of spoken language include Late Middle English and Early Modern English drama and private letters and Early Modern English sermons, trial records and the dialogue in fictitious anecdotes and jests. [Reference to Kytö and Rissanen in Rissanen & al. 1993].