Basic structure of the corpus
The CLMETEV contains three sub-periods of equal length:
I) 1710–1780: 3 million words, 32 texts by 23 different authors;
II) 1780–1850: 5.7 million words, 64 texts by 46 different authors;
III) 1850–1920: 6.3 million words, 80 texts by 51 different authors.
The authors contributing to a sub-period are born within a corresponding time-span, as follows:
I) Texts published in 1710–1780, by authors born between 1680–1750;
II) Texts published in 1780–1850, by authors born between 1750–1820;
III) Texts published in 1850–1920, by authors born between 1820–1890.
No author contributes more than 200,000 words of text to the corpus. Where possible,
different texts for a given author were sampled rather than including a single long
text. With respect to sociolinguistic and genre coverage, because of biases in the
source material, the typical text in the corpus is a novel written by an adult literate
high-class male. Nevertheless, care has been taken to include as many texts as possible
that deviate from the inevitable standard, and to favour those texts over texts that
answer to the standard description. Consequently, the corpus also to some extent
represents women fiction, formal non-fictional texts (history, religion, science), and
more or less informal non-fiction writing (letters, diaries, though some are clearly
meant to be literate). No systematic attempt has been made, however, to create a fully
balanced corpus with gender/genre/...-matching between sub-periods. The consequence is
that the corpus does represent a variety of text types, and goes some way toward
redressing possible biases, but is still not suited to study variation itself.
|