A Corpus of late Modern English Prose

A corpus of informal private letters by British writers, covering the period 1861 to 1919.
All decades in that range are represented, four by about 20,000 words of text each. The decade 1880-89 has only about 6,000, 1890-99 about 13,000.
However, the range of dates by birth-date of writer is narrower: 1837-67.
Corpus constructed 1992-1994 by David Denison with the very considerable assistance of Graeme Trousdale and Linda van Bergen.

Project leader: David Denison
Time of compilation: 1992-1994
Size: approximately 100,000 words
Language: LModE
Period: 1861 to 1919
Released: 1994
Project home page:http://personalpages.manchester.ac.uk/staff/david.denison/lmode_prose.html

Manual

Denison, David. 1994. A corpus of late Modern English prose. In Merja Kytö, Matti Rissanen & Susan Wright (eds.), Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 (Language and Computers - Studies in Practical Linguistics 11), 7-16. Amsterdam and Atlanta GA: Rodopi.

For further information on the coding system see:

Kytö, Merja 1994 Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts, 2nd edn. Helsinki: Helsinki University Press for Department of English, University of Helsinki.

Selection of editions based on the list in:

Nevalainen, Terttu. 1991. BUT, ONLY, JUST: Focusing adverbial change in Modern English 1500-1900 (Mémoires de la Société Néophilologique de Helsinki 51). Helsinki: Société Néophilologique, pp.109-11.

Compilers

David Denison with the assistance of Linda van Bergen and Graeme Trousdale.

Availability

The Corpus has been lodged with the Oxford Text Archive. Scholars can also get the corpus from David Denison on request, who will mail two versions in a zip file:

the 7-file "plain" version
a 1-file WordCruncher-indexed version with associated files (see below).

There is also a README file and a file of abbreviations and non-standard spellings.

Technical information

The plain version of the text is stored in 7 files totalling a little under 600 Kb. The files are extended (8-bit) Ascii, and the text is coded as far as possible according to the conventions used in the Helsinki Corpus, that is, with COCOA-style brackets giving information on writer, recipient, relationship, date, genre, page, etc, enclosed within carets. Two subperiods are identified: items dated 1860-1889 are coded as L86 and 1890-1919 as L89. Note, though, that the "social" info - on relationship, social status, degree of formality, etc - is not complete and is often deliberately underspecified. Almost all such caret brackets are on separate lines and start in column 1, apart from embedded editorial comments on e.g. cancelled text.

Probably more convenient is the version prepared for the now-obsolete WordCruncher for DOS but usable with any concordance software. The single 600 KB text file LMODEPRS.BYB has identical text and coding to the plain version, with additional marking of sentence and page boundaries. (Page boundary markers are always made to coincide with a sentence boundary.) Sentence boundaries are marked with |s (vertical bar + s), page boundaries by |p, and books (a WordCruncher concept) by |b.

The WordCruncher software preindexed a text for rapid search and retrieval, and in this version of the corpus the spelling and text files are concatenated in such a way that users of WordCruncher Viewer could search either

the whole corpus or, by the use of Bookmarks (which we have preset),
material from any one of the five editions
letters only, excluding journal entries

The lists of abbreviations/spellings can be referred to from within WordCruncher. They are not indexed, nor is editorial and reference coding within <...> or [...] brackets. The text file, LMODEPRS.BYB, is supplied with associated WordCruncher index and other files.