Basic structure of the corpus

The CLMETEV contains three sub-periods of equal length:

I) 1710–1780: 3 million words, 32 texts by 23 different authors;
II) 1780–1850: 5.7 million words, 64 texts by 46 different authors;
III) 1850–1920: 6.3 million words, 80 texts by 51 different authors.

pie chart

The authors contributing to a sub-period are born within a corresponding time-span, as follows:

I) Texts published in 1710–1780, by authors born between 1680–1750;
II) Texts published in 1780–1850, by authors born between 1750–1820;
III) Texts published in 1850–1920, by authors born between 1820–1890.

No author contributes more than 200,000 words of text to the corpus. Where possible, different texts for a given author were sampled rather than including a single long text. With respect to sociolinguistic and genre coverage, because of biases in the source material, the typical text in the corpus is a novel written by an adult literate high-class male. Nevertheless, care has been taken to include as many texts as possible that deviate from the inevitable standard, and to favour those texts over texts that answer to the standard description. Consequently, the corpus also to some extent represents women fiction, formal non-fictional texts (history, religion, science), and more or less informal non-fiction writing (letters, diaries, though some are clearly meant to be literate). No systematic attempt has been made, however, to create a fully balanced corpus with gender/genre/...-matching between sub-periods. The consequence is that the corpus does represent a variety of text types, and goes some way toward redressing possible biases, but is still not suited to study variation itself.