Basic structure

(Source: the MEG-C manual)

The Middle English Grammar Corpus (MEG-C) is a text corpus consisting of samples of the texts localised in the Linguistic Atlas of Late Mediaeval English (McIntosh, Samuels and Benskin 1986, henceforth LALME). Shorter texts are included in their entirety and longer ones in 3000-word samples. More than a thousand texts were localised in LALME; the corpus will eventually include samples of all those texts that the project team will be able to access. Apart from the LALME texts, it is planned that the corpus will in the future also contain subcorpora of Early Middle English texts (1100–1300) as well as of late mediaeval texts not included in LALME. The LALME texts are, however, the first priority.

The corpus will form the main source material for further work within the Middle English Grammar Project. All the text samples are entered into a database together with information about extralinguistic variables such as date, genre, script type, etc.; each word will be analysed into its spelling components and linked to headwords representing both Present-Day English and the immediate source language before Middle English (e.g. Old English or medieval French).

Principles of compilation

The selection of texts for MEG-C is, in the first instance, defined by a single external criterion: inclusion as a localized text in LALME. If other texts are included in later versions of the corpus, these will form distinguishable sub-corpora, marked with different code letters and subject to their own criteria of inclusion.

The LALME-based corpus, when finished, will potentially consist of all the texts that are included in the Linguistic Profiles section of LALME, with the exception of two groups: 1) texts localized in Scotland and 2) early texts that fall outside the main chronological span of LALME and are also included in the Linguistic Atlas of Early Middle English (LAEME). The geographical scope of the Middle English Grammar Project is limited to England and Wales. This is not because the Scottish material would be uninteresting, but rather because it is felt to be a whole field of study of its own. For medieval Scots materials, the user is referred to the Linguistic Atlas of Older Scots at the University of Edinburgh.

The chronological scope of the corpus is the same as that of the main LALME material, that is, ca 1325–1500. During the compilation of LALME, a separate survey of earlier materials was not yet envisaged, and thus a small group of thirteenth-century texts was also included, on the grounds that the dialectal material they provided was too important to ignore (LALME I: 3). These texts are not included in the present LALME-based corpus, but will, it is hoped, eventually form part of a subcorpus of Early Middle English texts. Text materials from the earlier period (1150–1325) are now available in the Linguistic Atlas of Early Middle English at the University of Edinburgh.

Apart from these two excluded groups, all texts listed in the LALME Linguistic Profiles section will, as far as possible, be included, whether or not they have been assigned a specific grid reference to the map. Thus, texts simply labelled as ‘Northern’ are included if they are represented by a Linguistic Profile in LALME.

In practice, it is not envisaged that the corpus will ever be able to include every single one of the texts defined above. Shelf marks and repositories have in some cases changed since the LALME survey, and some texts have become difficult or impossible to trace; other texts may simply be difficult or impossible to access. The main principle of compilation for the present corpus must thus be a relatively flexible one: the corpus seeks to represent as large a proportion as possible of the texts localized in LALME, excluding the Scottish and Early Middle English texts.

Within each version of the corpus, it would be desirable to present as full a range as possible when it comes to the geographical and chronological distribution of texts, as well as the representation of text genres and script types. The geographical coverage of LALME is in itself far from even: some areas simply provide more material than others. The distribution in terms of genres is similarly uneven. However, in terms of both geography and genre distribution, there is a very wide range, with large groups of texts from many areas and genres. In terms of chronology and script type, the distribution is more skewed still: the great majority of LALME texts are dated to the first half of the fifteenth century and are written in an anglicana script. The interim versions of the corpus should, ideally, reflect these distributions in LALME, if possible erring on the side of a more even overall coverage (for example, including a relatively high proportion of non-anglicana scripts). However, as the selection of texts within the first versions of the corpus is ultimately dependent on the availability of texts and on the practicalities involved in their transcription and proofreading, it may fall somewhat short of this ideal.

Texts in version 2009.1

The present version of the corpus contains 320 text samples. Altogether, these contain ca 450 000 words.

The geographical distribution of texts is skewed towards the Northern and Western parts of England. This reflects the history of the transcription process. When the project work began in 1998, it was agreed to divide the geographical area into three main regions: the East, West and North. The Glasgow team took responsibility for the Eastern part, and, as most of the early transcription work was carried out at Glasgow, the majority of the first texts to be transcribed were Eastern ones. With project funding at Stavanger, large-scale transcription of the Western and Northern materials began in 2006. At this point, the transcription conventions had evolved considerably. Proofreading the later transcriptions has been easier and faster than proofreading the early ones, and facsimile reproductions against which to proofread have also been more readily available to the Stavanger team. As a consequence, the corpus so far contains a relatively low proportion of Eastern texts. For the next version, it will be a priority to rectify this geographical skewing.

Version 2009.1 contains a somewhat higher proportion of legal documents than the LALME material at large. This is partly for reasons of availability, but it should also be useful for the purpose of comparing what may be considered the two main groups of late-medieval texts, viz. legal documents and ‘literary texts’ (cf LALME I: 39). Otherwise, the genre distribution follows broadly that in LALME. There has been no attempt to select texts with any particular chronological distribution pattern in mind: the texts range from early fourteenth-century ones to texts dated to around 1500; nearly half the texts are dated to the first half of the fifteenth century.

Transcription conventions

The transcriptions reproduce the text at what might be called a rich diplomatic level. This includes the following features:

spelling, distinguishing between 31 letters including the sub-graphemic distinctions between <i/j> and <u/v>, but not other variant forms such as different forms of <r>, single and double compartment <a>, and so on
capitalization
abbreviations and some final flourishes/otiose strokes
accents over i’s.
punctuation, using the full stop, semicolon, colon and slash for the following types of MS punctuation marks: dot, punctus elevatus (with or without a long top stroke) and virgule.
word division
line division, initial large capitals and paraphs
rubrics/headings
folio or page references
some corrections and marginal additions, if plausibly contemporary and helpful for reading the text.

Versions of the corpus

The Middle English Grammar Corpus is published in two different flavours.

a) The first flavour is called MEG-C Base. The files in MEG-C Base are in UTF-8 format. MEG-C Base contains the transcriptions that reflect manuscript reality most closely, as well as most of the information and the annotation added by the compilers. Thus this version is the one that the users of MEG-C should consult when in need of more information. The files of this version can either be viewed on-line or downloaded as a .zip archive.

b) The second flavour of the corpus, MEG-C Html represents the texts as .html files. This version is meant for easy browsing and reading the pages on screen. The differences between MEG-C Base and MEG-C Html are as follows:

In MEG-C Html, the default case is lower case. Capital letters are represented in CAPS, making the coding for them unnecessary, and abbreviations are expanded in italics.
Words divided from a line to a new line have been joined silently.
All scribal and compilatorial coding has been deleted, so that paraphs, underlining, superscript, deletion etc. are represented iconically.
Compilatorial comments have been kept to the minimum.

In Version 1.1., there will be links to Catalogue entries from the corpus file headers.

MEG-C Html is also available as .pdf files, the links to which are found beneath each corresponding .html link. These files can be viewed on-line, and they are also available in a .zip archive intended for downloading.