Corpus of Late Modern English Texts (extended version)

The CLMETEV represents formal, written, British English for the period 1710–1920. It consists entirely of public domain texts that are available through various online archiving projects (Project Gutenberg, The Oxford Text Archive, the Victorian Women Writers Project).

Project leader: Hendrik De Smet
Time of compilation: 2003–2006
Size: 15 million words
Language: English
Domain and genre: multi-genre
Number of texts/samples: 176
Period: 17101920
Released: 2006

Reference line and copyright

The Corpus of Late Modern English Texts (Extended Version). 2006. Compiled by Hendrik De Smet. Department of Linguistics, University of Leuven.


The CLMETEV has no manual. However, the corpus comes with an index file that specifies for each text (i) it's author, (ii) the author's year of birth, (iii) the year of publication of the text, (iv) whether the text has been sampled or contained as a whole, (v) the size of the text in number of words. Additionally, the basic structure of (an earlier version of) the corpus, as well as some of its strengths and weaknesses, have been discussed in De Smet, H. 2005. A corpus of Late Modern English texts. ICAME-Journal 29: 69-82 (available online A brief description of the corpus is also found at


Initiated as part of the compiler's MA thesis, the compilation of CLMETEV was completed in stages over the course of two subsequent research projects, first on a grant from the Research Fund of the University of Leuven (OT/2003/20/TBA) and later on a PhD grant from the Research Foundation - Flanders.


The CLMETEV is freely available to all interested. The corpus can be donwloaded at A user-id and password are required, which can be obtained by contacting the compiler. Users are invited to stay in touch, to provide feedback, or to let the compiler know about publications based on the corpus.

Associated projects

CEN (Corpus of English Novels) is a corpus of novels by 25 British and North American authors. Like the CLMETEV it is based on public domain texts. It contains some 25 million words of text and covers the period 1881-1922. The corpus is available for download at

CEMET (Corpus of Early Modern English Texts) has been constructed as a precursor to the CLMETEV, covering the period 1640-1710, likewise based on public domain texts. It contains about 2 million words of text. Because genre balance and the reliability of editions used are inferior to CLMETEV, the corpus can presently only be obtained on special request


CoRD Entry submitted on January 6, 2009 by Hendrik De Smet, Department of Linguistics, University of Leuven.
Data for the Cord entry was edited by Hendrik de Smet.