Basic structure
(Source: the MEG-C manual)
The Middle English Grammar Corpus (MEG-C) is a text corpus consisting of samples of the texts localised in the Linguistic Atlas of Late Mediaeval English (McIntosh, Samuels and Benskin 1986, henceforth LALME). Shorter texts are included in their entirety and longer ones in 3000-word samples. More than a thousand texts were localised in LALME; the corpus will eventually include samples of all those texts that the project team will be able to access. Apart from the LALME texts, it is planned that the corpus will in the future also contain subcorpora of Early Middle English texts (1100–1300) as well as of late mediaeval texts not included in LALME. The LALME texts are, however, the first priority.
The corpus will form the main source material for further work within the Middle English Grammar Project. All the text samples are entered into a database together with information about extralinguistic variables such as date, genre, script type, etc.; each word will be analysed into its spelling components and linked to headwords representing both Present-Day English and the immediate source language before Middle English (e.g. Old English or medieval French).
Principles of compilation
The selection of texts for MEG-C is, in the first instance, defined by a single external
criterion: inclusion as a localized text in LALME. If other texts are included in later versions of the corpus, these will form distinguishable sub-corpora, marked with different code letters and subject to their own criteria of inclusion.
The LALME-based corpus, when finished, will potentially consist of all the texts that are included in the Linguistic Profiles section of LALME, with the exception of two groups: 1) texts localized in Scotland and 2) early texts that fall outside the main chronological span of LALME and are also included in the Linguistic Atlas of Early Middle English (LAEME). The geographical scope of the Middle English Grammar Project is limited to England and Wales. This is not because the Scottish material would be uninteresting, but rather because it is felt to be a whole field of study of its own. For medieval Scots materials, the user is referred to the Linguistic Atlas of Older Scots at the University of Edinburgh.
The chronological scope of the corpus is the same as that of the main LALME material, that is, ca 1325–1500. During the compilation of LALME, a separate survey of earlier materials was not yet envisaged, and thus a small group of thirteenth-century texts was also included, on the grounds that the dialectal material they provided was too important to ignore (LALME I: 3). These texts are not included in the present LALME-based corpus, but will, it is hoped, eventually form part of a subcorpus of Early Middle English texts. Text materials from the earlier period (1150–1325) are now available in the Linguistic Atlas of Early Middle English at the University of Edinburgh.
Apart from these two excluded groups, all texts listed in the LALME Linguistic Profiles section will, as far as possible, be included, whether or not they have been assigned a specific grid reference to the map. Thus, texts simply labelled as ‘Northern’ are included if they are represented by a Linguistic Profile in LALME.
In practice, it is not envisaged that the corpus will ever be able to include every single one of
the texts defined above. Shelf marks and repositories have in some cases changed since the
LALME survey, and some texts have become difficult or impossible to trace; other texts may
simply be difficult or impossible to access. The main principle of compilation for the present
corpus must thus be a relatively flexible one: the corpus seeks to represent as large a
proportion as possible of the texts localized in LALME, excluding the Scottish and Early
Middle English texts.
Within each version of the corpus, it would be desirable to present as full a range as possible
when it comes to the geographical and chronological distribution of texts, as well as the
representation of text genres and script types. The geographical coverage of LALME is in
itself far from even: some areas simply provide more material than others. The distribution in
terms of genres is similarly uneven. However, in terms of both geography and genre
distribution, there is a very wide range, with large groups of texts from many areas and
genres. In terms of chronology and script type, the distribution is more skewed still: the great
majority of LALME texts are dated to the first half of the fifteenth century and are written in
an anglicana script. The interim versions of the corpus should, ideally, reflect these
distributions in LALME, if possible erring on the side of a more even overall coverage (for
example, including a relatively high proportion of non-anglicana scripts). However, as the
selection of texts within the first versions of the corpus is ultimately dependent on the
availability of texts and on the practicalities involved in their transcription and proofreading,
it may fall somewhat short of this ideal.
Texts in version 2009.1
The present version of the corpus contains 320 text samples. Altogether, these contain ca 450 000 words.
The geographical distribution of texts is skewed towards the Northern and Western parts of
England. This reflects the history of the transcription process. When the project work began
in 1998, it was agreed to divide the geographical area into three main regions: the East, West
and North. The Glasgow team took responsibility for the Eastern part, and,
as most of the early transcription work was carried out at Glasgow, the majority of the first
texts to be transcribed were Eastern ones. With project funding at Stavanger, large-scale
transcription of the Western and Northern materials began in 2006. At this point, the
transcription conventions had evolved considerably. Proofreading the later transcriptions has
been easier and faster than proofreading the early ones, and facsimile reproductions against
which to proofread have also been more readily available to the Stavanger team. As a
consequence, the corpus so far contains a relatively low proportion of Eastern texts. For the
next version, it will be a priority to rectify this geographical skewing.
Version 2009.1 contains a somewhat higher proportion of legal documents than the LALME
material at large. This is partly for reasons of availability, but it should also be useful for the
purpose of comparing what may be considered the two main groups of late-medieval texts,
viz. legal documents and ‘literary texts’ (cf LALME I: 39). Otherwise, the genre distribution
follows broadly that in LALME. There has been no attempt to select texts with any particular
chronological distribution pattern in mind: the texts range from early fourteenth-century ones
to texts dated to around 1500; nearly half the texts are dated to the first half of the fifteenth
century.
Transcription conventions
The transcriptions reproduce the text at what might be called a rich diplomatic level. This includes the following features:
- spelling, distinguishing between 31 letters including the sub-graphemic distinctions
between <i/j> and <u/v>, but not other variant forms such as different forms of <r>,
single and double compartment <a>, and so on
- capitalization
- abbreviations and some final flourishes/otiose strokes
- accents over i’s.
- punctuation, using the full stop, semicolon, colon and slash for the following types of
MS punctuation marks: dot, punctus elevatus (with or without a long top stroke) and
virgule.
- word division
- line division, initial large capitals and paraphs
- rubrics/headings
- folio or page references
- some corrections and marginal additions, if plausibly contemporary and helpful for
reading the text.
Versions of the corpus
The Middle English Grammar Corpus is published in two different flavours.
a) The first flavour is called MEG-C Base. The files in MEG-C Base are in UTF-8 format. MEG-C Base contains the
transcriptions that reflect manuscript reality most closely, as well as most of the information
and the annotation added by the compilers. Thus this version is the one that the users of
MEG-C should consult when in need of more information. The files of this version can either
be viewed on-line or downloaded as a .zip archive.
b) The second flavour of the corpus, MEG-C Html represents the texts as .html files. This
version is meant for easy browsing and reading the pages on screen. The differences between
MEG-C Base and MEG-C Html are as follows:
- In MEG-C Html, the default case is lower case. Capital letters are represented in
CAPS, making the coding for them unnecessary, and abbreviations are expanded in
italics.
- Words divided from a line to a new line have been joined silently.
- All scribal and compilatorial coding has been deleted, so that paraphs, underlining,
superscript, deletion etc. are represented iconically.
- Compilatorial comments have been kept to the minimum.
In Version 1.1., there will be links to Catalogue entries from the corpus file headers.
MEG-C Html is also available as .pdf files, the links to which are found beneath each
corresponding .html link. These files can be viewed on-line, and they are also available in a
.zip archive intended for downloading. |