The Old Bailey Proceedings, 1674-1834
Evaluating and annotating a corpus of 18th- and 19th-century spoken English

Magnus Huber, Department of English, University of Giessen

1. Introduction

In their search for records of authentic spoken language predating the invention of audio recording technology in the second half of the 19th century, historical linguists have turned to written genres that are believed to be closer to speech than the average written document. Such genres include drama, dialogue in prose fiction, sermons, and trial proceedings.

The proceedings of the Old Bailey, London's central criminal court, were published from 1674 to 1834 and constitute a large body of texts from the beginning of Late Modern English. The Proceedings contain over 100,000 trials, totalling ca. 52 million words and its verbatim passages are arguably as near as we can get to the spoken word of the period. The material thus offers the rare opportunity of analyzing everyday language in a period that has been neglected both with regard to the compilation of primary linguistic data and the description of the structure, variability, and change of English.

The time span covered by the corpus and the available sociobiographical speaker information allow fine-tuned studies, including historical sociolinguistic approaches. In addition, because of sheer size the Proceedings are a valuable textual source for the analysis of low-frequency features. For instance, an analysis of the present and past tense forms of the ten most frequent verbs in the Proceedings (know, go, see, say, take, live, come, give, get, tell) shows that overt inflection in the first person singular (e.g. I says, mostly with past reference) has a very low frequency of just over 0.1%. Analysing such marginal phenomena is impossible with most existing historical corpora, running up to a couple of million words at most. By contrast, since the total number of 1sg forms of these ten verbs amounts to over half a million tokens in the Proceedings, 0.1% corresponds to 547 tokens of inflected 1sg forms, enough for a basic multivariate analysis.

The Old Bailey Corpus (OBC) is based on the Proceedings and documents spoken English from the 1720s onward. Digitized transcripts of the Old Bailey Proceedings were obtained from Robert Shoemaker (Department of History, Humanities Research Institute, University of Sheffield) and Tim Hitchcock (Department of History and Social Sciences, University of Hertfordshire). Turning the digitized transcripts into a linguistic corpus consists of two main stages:

localization and tagging of direct speech in the 52 million word pool corpus, and
sociolinguistic mark-up based on sociobiographical speaker data found in the context.

This article will start with an overview of the historical background and the structure of the Proceedings, the 'raw' material that is being turned into the OBC. Section 3 will be concerned with assessing the linguistic reliability of the OBC as a linguistic source, relying on linguistic and extra-linguistic evidence. These sections demonstrate that the annotation of early trial proceedings has to be preceded by a consideration of the complex issues surrounding the genesis and the conventions of this genre. The annotation process in compiling the OBC will be described in Section 4.

2. The Old Bailey and the Proceedings

London's Central Criminal Court is still known as the Old Bailey, after the street near St. Paul's Cathedral where the courthouse is located. Its original jurisdiction was London and Middlesex and up to the early 19th century the court met eight times a year. In 1834 the Old Bailey was renamed Central Criminal Court, its jurisdiction enlarged to include parts of the neighbouring counties and the yearly sessions increased to twelve. Most of the people tried at the Old Bailey came from the capital, although there are also cases of prosecutors, defendants and witnesses from other parts of the country or even from abroad. [1]

The first published accounts of trials at the Old Bailey are from 1674 and were entitled News from Newgate: OR, An Exact and true Accompt of the most Remarkable, TRYALS OF Several Notorious Malefactors: At the Gaol delivery of Newgate, holden for the CITY of LONDON, and COUNTY of MIDDLESEX. In the Old Baily … (16740429). [2] Towards the end of that year, the title was changed to News from the Sessions, OR, A True Relation of all the PROCEEDINGS AT THE Sessions in the Old bayly … (16740909) and the reports continued to be published as the Proceedings until 1834, the end of the period considered in this article.

The Proceedings started as commercial enterprise. True crime sold well, so publishers sent scribes to the Old Bailey, who recorded the trials in shorthand. Because of their sensationalist stance, early Proceedings are rather judgmental. Gradually, however, the City of London gained more and more control over the publications. As early as 1679, the Court of Aldermen in London ordered that accounts of proceedings at the Old Bailey could only be published with the approval of the Lord Mayor and the other justices, and accordingly the tone gets more objective. Over the 18th century, the Proceedings develop into official records of the trials at the Old Bailey. In 1778 the City stipulated that they should represent a 'true, fair, and perfect narrative' of what happened in court and from 1787 it paid a subsidy to the publisher to ensure continued publication. Because the Proceedings had come to be considered official records, the publisher was required to supply 320 free copies to the City.

2.1 The Old Bailey Proceedings Online

The 1,219 surviving Proceedings published between 1674 and 1834 were digitized by the Humanities Research Institute, University of Sheffield, and the Higher Education Digitisation Service, University of Hertfordshire, directed by social historians Robert Shoemaker and Tim Hitchcock. The material can be searched and accessed at the Old Bailey Proceedings Online and contains over 100,000 trials, totalling ca. 52 million words. The website also contains images of the original pages of the Proceedings. [3]

The Old Bailey Proceedings Online is an extremely valuable resource. The site contains helpful background information on the publication history of the Proceedings as well as on the historical background of crime and justice in the London from the late 17th to the early 19th centuries. Most importantly, it provides access to the unique collection of primary sources through a sophisticated search engine. The Proceedings can be searched by keyword, person name, place, crime, verdict and punishment. There is also an advanced option that combines several parameters for more complex searches.

However, the Old Bailey Proceedings Online was not created for the needs of linguists. While some degree of linguistic analysis is possible, the possibilities to search for high-frequency functional morphemes — a major interest in corpus linguistics — are rather restricted. Since the online Proceedings contain over 50 million words, the search engine initiates a search by consulting Lucene indices rather than performing a full-text search. This makes online searches very quick and returns hundreds to thousands of results in a fraction of a second. These are displayed in the manner of a concordance with links to the individual trials containing the search word.

As mentioned, there are a number of disadvantages for the linguist. First of all, the concordance-like list of hits cannot be manipulated in any way: the search word is not centered, there is only one concordance entry per trial even though this may contain several hits, and there is no way of sorting or deleting entries. While this is awkward for linguistic purposes, the most significant drawback is that search words are not allowed to begin with wild cards, which makes searching for inflections impossible. In addition, in order to speed up retrieval, the function words a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with are not included in the indices and can thus not be searched for. For the historical sociolinguist, there is the further restriction that the advanced search function does not include a keyword search. To take an example: while it is possible to retrieve all trials that include female defendants between the age of 20 and 30 (in order to e.g. compare them to other age groups), one can not search for female defendants between 20 and 30 who use, for example, don't (as opposed to do not). The latter would have to be done manually, by copying the trials (and there are 3,453 in this particular case!) and performing the text search in a different program. One can limit searches to particular parts of the Proceedings (front matter, trials, punishment summaries, etc.), but not to spoken language alone, since that was not tagged in the electronic version of the Proceedings.

2.2 Sociobiographical characteristics of speakers in the Old Bailey Proceedings

Figure 1 is based on Table 1 and shows the age structure of persons involved in the trials at the Old Bailey, based on the information in the Old Bailey Proceedings Online statistics search function:

Table 1. Age structure of participants in the Old Bailey trials.
Age	1670-	1680-	1690-	1710-	1720-	1730-	1740-	1750-	1760-	1770-	1780-	1790-	1800-	1810-	1820-	1830-
0-10	5	14	7	0	4	8	1	5	16	8	37	12	35	47	40	29
11-20	25	36	15	5	13	12	11	32	139	81	210	1388	1931	3845	5298	2633
21-30	0	0	0	0	0	0	2	2	6	4	13	2192	3647	5191	4253	2248
31-40	1	0	0	0	0	0	0	0	4	0	1	1120	2098	2653	1757	856
41-50	2	0	0	0	0	0	0	0	0	1	0	579	1080	1453	966	400
51-60	1	2	0	0	0	0	0	0	0	0	5	297	545	782	464	198
60+	0	4	0	1	1	0	0	1	1	6	10	142	275	353	236	75

Figure 1. Age structure of participants in the Old Bailey trials.

Note that the figures also include the age of persons only mentioned during, but not necessarily present at, the trials. Nevertheless, the above gives a good impression of the situation: only from the 1790s onwards is age more systematically mentioned in the original Proceedings (and therefore tagged in the electronic version), and only from then on is the age structure of society more accurately reflected in the markup. The aim is to provide more age information for the time predating the 1790s during the annotation of the OBC.

The gender structure of the speakers in the Proceedings is indicated in Table 2 and Figure 2. Since speech passages and speakers are not identified in the original electronic version, an indirect approach via the sex of the defendants was taken here.

Table 2. Defendant gender structure in the *Old Bailey Proceedings Online*.
Gender	1670s	1680s	1690s	1700s	1710s	1720s	1730s	1740s	1750s	1760s	1770s	1780s	1790s	1800s	1810s	1820s	1830s
M	415	2341	2386	472	2150	3640	3517	3051	3154	3501	5614	7374	5441	6589	11719	17786	8968
F	159	916	1695	487	1390	2152	2082	2006	1672	1565	2216	2438	1820	2597	3355	4374	2353

Figure 2. Defendant gender structure in the Old Bailey Proceedings Online.

On average, 72.6% of the defendants are men, and it is expected that the final version of the OBC will contain roughly the same percentage of male speakers. It might be objected that a corpus for sociolinguistic research should aim for a balanced representation of the genders, but this is impossible with the Proceedings since the trials always involve males (judges, prosecutors, lawyers, etc. were exclusively male) but not necessarily females. In addition, the very high number of speakers ensures that even in decades where the percentage of women is low, as in the 1820s and 1830s, their absolute number is still in the thousands.

The Old Bailey Proceedings Online lists over 4,000 occupations and status labels of the participants in the trials at the Old Bailey, from accoutrement-maker to yeoman. These labels are given in their original spelling, they are sometimes more detailed than one would need them to be (for example we find servants to blacksmiths, to gentlemen, to goldsmiths, to leather-sellers, to midwifes, to poulterers, to public houses, to washerwomen, etc.). After standardization of these labels, the actual number of occupations will be much lower and easier to handle for the end user.

The above-mentioned constraints limit the usefulness of the Old Bailey Proceedings Online for linguistic purposes and were one of the motivations for turning the Proceedings into a linguistic corpus. Since the Proceedings cannot be downloaded from the website in their entirety and individual trials are only displayed as raw text, without tags, a copy of the XML-tagged version was obtained from Tim Hitchcock and Robert Shoemaker. This version is currently annotated by myself and my team. The major task is to identify speech passages and link them to sociobiographical speaker parameters such as sex, age, or profession.

3. The reliability of the Old Bailey Proceedings as a linguistic source

This section starts with a description of how spoken language is distributed in the Proceedings. This will be followed by two subsections assessing the reliability of these trial accounts as a linguistic source, first by considering external factors surrounding the genesis of these texts and by comparing a trial from the Proceedings to an alternative account (3.2), and second, by testing their internal linguistic consistency through a quantitative analysis of negative contraction (3.3).

3.1 Spoken language in the Old Bailey Corpus

Figure 3 shows the number of 1st and 2nd person singular and plural pronouns as a rough measure of the amount of direct speech reported in the first six decades of publication of the Proceedings. Figure 4 relates the number of pronouns to the total number of words by indicating mean frequencies of pronouns. The reason for relying on this indirect approach via pronouns is that formal text-structuring conventions of marking direct speech varied a lot in the early years and makes automatic tagging (see 4.2) almost impossible. The pronoun forms counted are I, my, mine, me, myself, you, your, yours, yourself, yourselves, thou, thy, thine, thee, thyself, thee, we, ours, us and ourselves. Our was excluded because it frequently occurs in 'our Lord the King'. As there are a number of alternative versions of the Proceedings in the early years, only the longer version was included in the count.

Figure 3. Absolute number of 1st and 2nd person pronouns by year.

Figure 4. Mean frequencies of 1st and 2nd person pronouns per 100 words by year.

The figures show that direct speech became more common only in the 1720s, although there is some measure of spoken language even in earlier trial accounts, particularly in the 1674-1679 and 1692-1695 periods. The comparatively high amount of direct speech in 1678, 1692 and 1706 is due to individual Proceedings, 16781211, 16920406 and 17061206, which report considerably more spoken language than the other proceedings in those years. A closer look at these pre-1734 Proceedings reveals that a good part of the direct speech was not originally uttered in court but is actually embedded in 3rd person narration. That is, the spoken language reported in these early accounts is not that of plaintiffs, defendants or other participants in the lawsuit but that of a third party, as illustrated by the following excerpt:

Watson for himself said, That being ordered by the Plaintiff to Arrest Dorothy Midgley, when he came to the door, he heard the Boy say, I will run my Spit in some of your guts; but putting him aside, he Arrested his Prisoner, and heard some body cry out, I am killed; upon which he run to him … (16781211-23, my emphasis)

These spoken passages are generally short and there is little information on the sociobiography of the speakers. However, the major limitation of their usefulness is the fact that there is a considerable time lapse (weeks or even months) between the original speech event and its recording. The reliability of the data is further diminished due to the intermediary role of the person reporting the utterance in question, who is the immediate source for what the scribe takes down.

Figure 5 gives the total number of words per decade, as well as the proportion of direct speech from 1734 onwards:

Figure 5. Total number of words per decade and proportion of direct speech, 1734-1834.

From the 1730s onward, a relatively high proportion (almost 85%) of the Proceedings is made up of spoken language. The Proceedings therefore constitute a rich source of data for the study of speech in the 18th and 19th centuries.

3.2 Testing the reliability of the Old Bailey Corpus: external factors

It has been argued that from a historian's point of view the material reported in the Proceedings is rather accurate:

Although initially aimed at a popular rather than a legal audience the material reported was neither invented nor significantly distorted. The Old Bailey Courthouse was a public place, with numerous spectators, and the reputation of the Proceedings would have quickly suffered if the accounts had been unreliable. Their authenticity was one of their strongest selling points, and a comparison of the text with other manuscript and published accounts of the same trials confirms that they accurately report what was said in court. (Hitchcock & Shoemaker 2007b)

But Hitchcock & Shoemaker go on to caution that a comparison with alternative accounts of the same trials show that the Proceedings are not complete — though often the most comprehensive account — and that even the most detailed later Proceedings are only partial transcripts of what was said: 'At the very least, in an attempt to save space, minor details and repetitions, perceived as unimportant, were frequently left out of recorded testimony' (Hitchcock & Shoemaker 2007b). In spite of this, and in the absence of better data, the records of the trials at the Old Bailey are arguably as near as we can get to the spoken word of the 18th and early 19th centuries.

As shown above, about 85% of the text in the Proceedings from the 1730s onwards is direct speech. For a linguist trying to reconstruct the speech of the period, an important development in the Proceedings is the switch from third-person to first-person accounts in the 1710s. The early Proceedings tended to give more or less judgmental — and sometimes sensationalist — accounts of the 'most notable trials', as in the trial of Elizabeth Scot for theft on 16 January 1682:

Extract 1:

Elizabeth Scot was Indicted for stealing Plate, to the value of 30 pounds on the 10 of December from Mr. Comissary of Algare-Parish, which was evidently pro[ven] against her, she being taken with it in her Lap, upon which, [she plea]ded, that she had been drinking, and knew not what she did, but that served not her turn, for she was found Guilty. (u16820116a-1, my emendation)

The detail reported for individual trials increased considerably in the 18th century, when scribes reported witness testimonies, statements and arguments of the prosecution and the defence, cross-examinations, etc. Compare Extract 1 with the following extract from the trial of Elizabeth Whitney on 27 February 1740, which includes monologues as well as shorter question-answer exchanges and amounts to over 1,500 words:

Extract 2:

Elizabeth Whitney alias Dribray, and Mary Nash alias Goulding, were indicted for assaulting George Stacey, in the Dwelling-House of William Needham, putting him in Fear, &c. and taking from him, a Moidore, a Thirty-six Shilling-piece, and 30 Guineas. Nov. 20.

Mr Stacey. On the 20th of November I was going towards Temple-Bar, and the Prisoners followed me, and asked me to give them a Quartern of Brandy, or Gin. I did not value such a little Matter, so I followed them into Needham's House, at the Rose by Temple-Bar, and there we had 3 Quarterns of Gin. […] But getting off my Chair, both of them took hold of me by Force of Arms, and held me so, that I could not help myself. I was a little in Liquor, but not much. I knew very well what I did, and I stood upon my Feet for some Time before they got me down; but at last they did get me down, and I cried Murder! Help, for God's Sake! Whitney told me, if I cried out, she would cut my Throat. They gagged me; and my Mouth was so full of Blood, that I could not speak; and Blood likewise gushed out of my Nose. Then they took from me all my Money; 30 Guineas, and upwards, a Moidore, and a Thirty-six Shilling-Piece. […]

Whitney. Was any Thing found upon me?

Mr Stacey. No, I believe not; but she was a Party concerned: her Hand was in my Pocket, as well as Nash's. They both attacked me; one of them throtled me, and the other robbed me.

Whitney. Was not you in Needham's House before I came in?

Mr Stacey. No; I was not, I followed you in.

Sarah Scot. I have a Sister, (one Murphin) who keeps a Stall at Temple-Bar, and she sent for me that Night, to come and open Oysters for her. […]

The Jury found both the Prisoners Guilty, Death. (17400227-2)

Historical reliability as described by Hitchcock & Shoemaker 2007b is not the same as linguistic reliability. The omission or misrepresentation of factual detail in a historical document does not necessarily mean that the spoken language reported in that same document is unreliable. As an example, I will consider the recording of non-standard features, which could be taken as an indication of linguistic faithfulness. The Proceedings are generally written in standard orthography, but sometimes we find non-standard pronunciation (and morpho-syntax) in individual speakers, such as in the following deposition by an Irishman:

James Fitzgerald depos'd to this Effect: On the 25th of February last, about 11 at Night, O' my Shoul, I wash got pretty drunk, and wash going very shoberly along the Old-Baily, and there I met the Preeshoner upon the Bar, as she wash going before me. I wash after asking her which Way she wash walking: And she made a Laugh upon my Faush, and told me to Newtoner's-Lane. […] (17250407-66)

Non-standard phonological and morpho-syntactic detail of this kind is often found in the speech of Irishmen and other foreigners. A certain degree of stereotyping for comic effect on the part of the scribe cannot completely be ruled out, especially if the speaker in question dominates the trial in terms of length of utterance, as in this case. Incidentally, the publisher 'earned a censure from the City authorities for the 'lewd and indecent manner' in which the trial was reported' (Hitchcock & Shoemaker 2007a, Shoemaker forthcoming), which is an indication of the control that the City exerted not only on what was reported by also on the language in which it was reported (see also 3.3.1).

Sometimes, however, non-standard passages are embedded in otherwise completely serious discourse, with no indication of any comic intentions, as in the testimony of Osborn Jones, possibly a Welshman:

I came home to Tinner ant wass coing into my own Room, put the Prissoner's Wife callt to me and sait Here iss your coot Oman. So I hust her a pit, and ask her why a Tiffel she coudn't keep in her own Hapitation when I wanted my Tinner. So the Prissoners Wife prought out a pag with a crate teal of coolt in it. There was a crate many Pieces, a crate teal pigger as Guineas. […] (17350522-1)

The recording of non-standard features seems to be rather unbiased here and the non-standard spelling faithfully indicates a typical feature of Welsh English, the strong aspiration of voiced plosives (therefore perceived as voiceless by speakers of English English, see e.g. Thomas 1994: 122-123).

Nevertheless, even if the Proceedings were a 100% accurate record of the historical facts (which they are not), this would not automatically mean that the direct speech passages are a completely faithful picture of what was said in court. Written representations of spoken language can be several steps removed from the actual speech act and it is the task of the linguist to reconstruct the original speech event on the basis of the written text. This is what Schneider (2002: 68) calls the Principle of Filter Removal:

a written record of a speech events stands like a filter between the words as spoken and the analyst. As the linguist is interested in the speech event itself (and, ultimately, the principles of language variation and change behind it), a primary task will be to 'remove the filter' as far as possible, i.e. to assess the nature of the recording process in all possible and relevant ways and to evaluate and take into account its likely impact on the relationship between the speech event and the record, to reconstruct the speech event itself, as accurately as possible.

After a categorization of text types and their proximity to speech, Schneider (2002: 73) goes on to say that '[d]irect transcripts are clearly the most reliable and potentially the most interesting among all these text types' and names trial proceedings as characteristic examples of this category. Still, it is clear that as written records, even trial proceedings cannot be a completely faithful representation of the speech event and have to be handled with care. In addition to a consideration of the recording conditions, Schneider (2002: 86) lists internal consistency and external fit as important criteria for assessing the validity of written texts representing spoken language: internal consistency refers to the consistent portrayal of variable features across large corpora, ideally deriving from several sources (e.g. different authors), while external fit measures the degree to which results of analyses based on a specific corpus agree with findings of other studies. Culpeper & Kytö (2000) compare four 17th-century speech-related text-types (witness depositions, trial proceedings, prose fiction, and comedies) with the aim to establish how true they are to the original speech event. Based on the criteria of lexical repetitions, turn-taking features, and single-word interactive features (e.g. demonstrative pronouns), they conclude that 'there is a strong case for drama, but that there is also a case for trial proceedings' (2000: 195). [4])

Kytö & Walker (2003) assess the faithfulness of trial proceedings and witness depositions in representing authentic speech. Although both are purportedly verbatim texts (or at least conventionally assumed to be such), 'one could not expect the same standard of accuracy in quoting spoken interaction as one would when quoting a written text' (224). In addition, they caution that '[e]ven with the most faithful of records, it is to be expected that certain typical features of speech such as false starts, pauses, slips of the tongue, and the like would be filtered out …' (225). This is certainly true for the Old Bailey Proceedings, which for the most part lack some of the non-fluency characteristics of unscripted spoken language, such as hesitations (uhm, er, etc.), unfinished sentences, repetitions, etc. Using sources like trial transcripts one has to bear in mind that the primary aim of the scribe was not to record linguistic detail but the substance of the trial.

Just as Schneider (2002), Kytö & Walker (2003: 228) acknowledge that 'written records of a speech event are susceptible to interference — whether conscious or inadvertent — throughout the production process'. With this in mind, I will now attempt to assess the faithfulness of the spoken language in the Proceedings. Following the agenda set up by Schneider (2002: 86), I will do this by discussing the recording conditions, external fit, and internal consistency.

3.2.1 The genesis of the Proceedings of the Old Bailey: general remarks and recording conditions

From the original speech event during a trial at the Old Bailey to the printed Proceedings, we can distinguish at least five consecutive stages (where t = time):

t1	speech event
t2	recording (shorthand, orthographic notes)
t3	preparation of MS for printer (e.g. expanding shorthand notes into orthographic text)
t4	proofreading
t5	typesetting

Each of these stages could potentially have altered the linguistic material of the utterance. At present it is still unclear whether the Proceedings actually went through t3 and t4 — it is imaginable, though rather unlikely, that typesetters worked directly from the shorthand manuscript. Be that as it may, we have to remove several layers of filters, imposed by the scribes (first while taking the shorthand notes in court and later when expanding them for the publisher), by the proofreaders as well as printers, by the typesetters and by the publishers (who, in addition to their own idiosyncrasies, might impose a house style).

The accounts were published just a couple of weeks after the trials. For example, the Proceedings of the sessions on the 11 and 12 December 1678 were licensed for publication a mere week later, on 18 December. This practice of rapid publication continued with the much longer later Proceedings (cf. e.g. 18281204, which were published before the end of the year). In fact, once the Proceedings came to be regarded as an official record, the city took an interest in ensuring speedy publication, as can be seen on the title page of the Proceedings published in December 1775:

At a Common Council holden in the Chamber of the Guildhall of the City of London on Friday the 17th of November 1775, A MOTION was made and QUESTION put, That the whole Proceedings on the King's Commission of the Peace, Oyer and Terminer, and Gaol Delivery for the City of London, and also the Gaol Delivery for the County of Middlesex, held at Justice Hall in the Old Bailey, be regularly, as soon as possible after every Session, published by the Recorder, and authenticated with his Name: The same was resolved in the Affirmative. (17751206, my emphasis)

Schneider (2002: 72) mentions 'the temporal distance between the speech event itself and the time of recording' as one parameter influencing the accuracy of the written record as a representation of the original utterance: the longer the interval between the two, the higher the risk of misremembrance. In the case of the Proceedings, t1 and t2 are near simultaneous (the scribe took notes during the utterance) and t3-t5 followed shortly after, i.e. the time factor does not pose much of a problem here.

What seems potentially more problematic is the recording technique used at t2: not all techniques (mechanical recording, shorthand, longhand etc.) are equally suitable to record linguistic detail. Kytö & Walker (2003: 228) mention that one of the factors influencing the reliability of a written record in terms of its faithfulness to the speech event is the script (notes or shorthand) used by the scribe. The (somewhat idealizing) implication is that shorthand is more reliable than notes because the latter are by nature sketchy and would have to be expanded later, relying more or less heavily on the memory of the scribe, while the former records the totality of the event in situ.

3.2.2 The uses and limitations of shorthand transcripts

From at least 1749 onwards, but probably from the very beginning in the 1670s, the proceedings at the Old Bailey were recorded in shorthand. A thorough analysis of 18th-century shorthand practices and their influence on the linguistic reliability of the Old Bailey Proceedings would go beyond the scope of this paper, but a brief overview of the possibilities and limitations of stenography with regard to the faithful representation of the original speech event will show the important consequences that the script has for the preservation of linguistic detail.

One of the more influential and popular 18th-century shorthand systems was developed by Thomas Gurney, the scribe who took down the Proceedings from at least 1749 to his death in 1770. Gurney's Brachygraphy or short-writing first appeared in 1752, ran through twelve editions in the 18th century and was reprinted several times in the 19th century. If we assume that in recording the trials Thomas Gurney, and later his son Joseph, who succeeded his father in 1770, used a shorthand system identical or similar to that described in Brachygraphy, then a closer inspection of this system may reveal important clues as to the linguistic reliability of the Proceedings. I will start with a brief characterization of the script and then focus on the implications for the rendering of spoken language in the Proceedings.

3.2.2.1 Gurney’s shorthand: an overview

Gurney's (1752: 3) avowed objective was to enable the shorthand writer 'to take a Speech, or Sermon verbatim, as a Person talks in common'. His script consists of an 'alphabet' of invented symbols for consonants and vowels but has some characteristics of a consonantal writing system in that vowels can be left out. For example, he transcribes <lmntsn> 'lamentation' or <msngr> 'messenger' (p. 11). In spite of this, vowels are often indicated by diacritics (through dots or the vertical position of the following consonant). High-frequency words 'such as Prepositions, & terminations' are represented by 'arbitrary Characters' (3). These logographic elements are mostly derived from symbols of the basic alphabet and extended iconically (e.g. little 'little' vs. large 'large', both from the symbol for l, p. 12) and can represent several words (for instance a dot. can mean 'they, thee, the, thy, of', p. 14). In principle, however, Gurney's shorthand follows conventional orthography, as illustrated by his transcription of loan, which indicates both <o> and <a> (p. 13):


'loan'	l.o.an

However, the script also has some phonological traits in that e.g. <gh> in brought is omitted (p. 13):


'brought'	br.ot

In a similar manner, law is rendered as ˙ (l-a, p. 27).

The following example illustrates the phonological principle in that the <e> in single and line is omitted. It also contains orthographic elements in that (1) – stands for <in> and thus represents both [In] (in single) and [aIn] (in line) and (2) the velar nasal in single is expressed by two symbols, – and . [5]


'single line'	s.in.g.ll.in

I will now consider some implications of Gurney's stenographic script for the faithfulness of his trial transcripts. As mentioned before, this is meant as a first approach to the problem, not as an exhaustive analysis.

3.2.2.2 Capturing morpho-syntax through shorthand

The symbols introduced in Gurney's chapter on 'Persons Moods & Tenses' (19-22) do not distinguish between inflected and uninflected auxiliaries ( stands for 'may' or 'mayst', for 'can' or 'canst', for 'should' or 'shouldst', etc., p. 18-20). One could argue that this renders the Proceedings less reliable as far as the inflection of verbs in the 2sg present tense is concerned, but apart from the fact that the context would disambiguate the possible readings (you may but thou mayst), we know that by 1700 thou and the appropriate -est inflection had undergone functional contraction and were largely restricted to dialects, biblical and archaic language, and the speech of Quakers (Görlach 1991: 85, 88). [6] However, even if 2sg verbal inflection is only marginally relevant in the 18th century, the foregoing is an indication that even shorthand could not have been absolutely accurate in the recording of details like inflections. A further example is provided by the symbol for the indefinite article, a dot placed on the top left of the noun phrase, which stands for the two allomorphs a and an (p. 16). A study on variation in this area (e.g. a ~ an before nouns starting h-) has to proceed very carefully indeed since the shorthand manuscript would not have distinguished between a and an, using a simple dot in both cases. Only when expanding his shorthand notes for the typesetter would the scribe choose a particular allomorph. That is, the form of the indefinite article that we find on the printed page of the Proceedings depended for a good part on the scribe's memory. With high-frequency items like inflections or articles it is very unlikely that the scribe would have remembered the exact variant used in every single instance, not even after only a couple of hours.

This is a rather sobering finding, given that shorthand-based recordings of spoken language have so far been accepted as relatively faithful in the literature (see above). Nevertheless, Gurney's brachygraphy does record other details, including features of spoken language like auxiliary contractions: 'you will' (you w-il) vs. 'you'll' (you-l, p. 27). But even here are difficulties, e.g. when we turn to proclitic 't (< it): Gurney's shorthand representation of 'twill (< it will) is ambiguous since the symbol for t is also used as an abbreviation for it (p. 11): Gurney transcribes orthographic 'twill ׀ (p. 26-27), with a space between t/it and will (similarly 'tmay ׀ ˙). Because of this space there is no way of knowing, on the basis of the shorthand manuscript alone, whether ׀ represents the full form it or proclitic 't.

The foregoing remarks demonstrate the need for a study establishing whether results based on an analysis of features that can be unambiguously encoded in the shorthand script (e.g. contraction of will/shall) are more reliable than results based on features where the shorthand system is ambiguous (e.g. it ~ 't). In any case, linguists analyzing spoken language recorded in stenographic writing will do well by familiarizing themselves with shorthand practices of the period to assess the reliability of their material.

Unfortunately, the examples in Gurney's manual do not contain instances of negative contractions like can't or doesn't, the feature analyzed in the following sections. Gurney's chapter on 'The Negative not ¬ ' (p. 23) only transcribes the expanded forms would not, cannot, shall not, must not, might not, may not, ought not, and was not. Therefore, although it is theoretically possible to represent negative contractions in Gurney's system, it is impossible to say whether Gurney would have differentiated between cannot and can't, etc. This can only be checked by comparing an original shorthand manuscript with the printed version of the Proceedings, but so far, such manuscripts have not come to light (Tim Hitchcock, p.c. 2007-02-20).

What we do have, however, is interesting evidence concerning the working methods of the scribes in the trial accounts themselves. In the second trial of Elizabeth Canning, Thomas Gurney was asked to report to the court what he had taken down during the first trial. [7] The attorney then proceeded:

Mr. Davy. As far as you have mentioned, are you able to say upon your oath, that that was the evidence that the girl upon her oath then gave in court?

Gurney. The substance of it is the evidence she gave in court.

Note Gurney's use of 'substance' rather than something like 'her very words' (cf. also 17560528-45). Later in the trial, Gurney was asked to compare the testimony of Elizabeth Canning’s mother with what she had said in the first trial:

Mr. Davy to Thomas Gurney. You hear the evidence this woman has given; look at your minutes, and give an account of what she said in her evidence on that trial, as to the state and condition in which her daughter came home, and particularly how she was dress'd.

[Gurney cites the evidence]

Mr. Davy. Were these her own words?

Gurney. I have here mentioned the person she where she said I. I will not take upon me to say these are the very words she made use of, or that she made use of no more words; it is my method, if a question brings out an imperfect answer, and is oblig'd to be ask'd over again, and the answer comes more strong, I take that down as the proper evidence, and neglect the other; […] It is not to be expected I should write every unintelligible word that is said by the evidence. (The Trial of Elizabeth Canning, 1754, pp. 19-20, 104)

Again, Gurney made it clear that while he strove to be faithful to the spoken word, this was not always possible or even desirable (on the role of the scribe, cf. also Shoemaker forthcoming). In a 1758 trial, he was asked to recount the statement of a foreigner, which he did in Standard English, adding that 'I took that to be his meaning which I have printed, he speaking as most of the foreign Jews do, a sort of broken English', making it clear that there was a linguistic difference between the actual speech act and its representation in the Proceedings (17580113-30).

3.2.3 Comparison of a trial in the Proceedings with an alternative account

The left column of the text in Appendix A shows an extract from the trial of John Ayliffe, 17591024-27, and an alternative account of the same trial, entitled The tryal at large of John Ayliffe (henceforth referred to as Tryal) in the right column. [8] One can see at first glance that the account in the Proceedings is considerably shorter than the alternative Tryal (718 as opposed to 1,290 words). It is interesting that both versions were 'Printed, and sold by M. Cooper at the Globe in Pater-noster-Row', so one would suppose that Cooper would either have produced a longer and a condensed version from the same manuscript or produced the longer Tryal first and abridged it for inclusion in the Old Bailey Proceedings. However, things are not that straightforward. There is some of overlap between the two versions, as in e.g. lines 3, 30, 52:

	Proceedings	Tryal
3	Thomas. I am clerk to Mr Jones, a Stationer in the Temple.	Henry Thomas. I am clerk to Mr Jones, a Stationer, in the Temple.
30	Prisoner. I should be glad to look at that deed.	Prisoner. I should be glad to look at that deed.
52	Hargrave. It is: I saw Mr Ayliffe sign this receipt for 1700 l.	Walter Hargrave. It is: I saw Mr Ayliffe sign this receipt for 1700 l.

Sometimes the Proceedings simply omit some text of the longer version, either complete speech acts as in lines 21 or 42:

	Proceedings	Tryal
21		John Fannen. I am not sure; but to the best of my remembrance, it was sometime the beginning of December last, at Mr Fox's house.
42		Lord Chief Justice. Let it be read.

or parts of a speech act, e.g. line 48:

	Proceedings	Tryal
48	Hargrave. By Mr Ayliffe: I saw him seal and deliver it.	Walter Hargrave. By Mr Ayliffe. – I saw him sign, seal, and deliver it, as his act and deed.

But there are also more serious differences between the two versions:

The Proceedings contain material not found in Tryal:

	Proceedings	Tryal
19	Fannen. This deed (taking the counterpart in his hand) was executed by Mr Ayliffe in my presence.	John Fannen. This deed was executed by Mr Ayliffe in my presence.

Complete paraphrasis:

	Proceedings	Tryal
60	Hargrave. Because he said he was not willing Mr Fox should know of it?	Walter Hargrave. The reason Mr Ayliffe gave, was, that he would not on any account have it come to Mr Fox's ears.

Lexical differences, e.g. positively > particularly, leave out > leave a blank:

	Proceedings	Tryal
7	Thomas. I can't particularly say that; sometimes we leave a blank by the gentlemens desire, perhaps they may add another covenant, or something of that sort, I can't recollect the reason for that.	Henry Thomas. I cannot positively say. – We sometimes leave out the conclusion by gentlemen's desire, in order that they may add a covenant, or some such thing, if it should be thought necessary; but I cannot particularly recollect the reason why the conclusion was omitted in this case.

Differences in morphology and syntax (e.g. sometimes we leave vs. we sometimes leave line 7, can't vs. cannot, was you? vs. Are you? line 24, two sentences in Tryal fused into one, line 55):

	Proceedings	Tryal
7	Thomas. I can't particularly say that; sometimes we leave a blank by the gentlemens desire, perhaps they may add another covenant, or something of that sort, I can't recollect the reason for that.	Henry Thomas. I cannot positively say. – We sometimes leave out the conclusion by gentlemen's desire, in order that they may add a covenant, or some such thing, if it should be thought necessary; but I cannot particularly recollect the reason why the conclusion was omitted in this case.
24	Q. Was you a subscribing witness to this deed?	Mr Wedderburne. Look upon the back of this deed; are you one of the subscribing witnesses?
55	Q. Do you remember any request being made, and by whom, to keep this mortgage a secret?	Mr Aston. Do you remember any request being made at this time, to keep the mortgage of that lease a secret? and if you do, tell us by whom such request was made, and who were present.

The last point in particular casts doubt on Kytö & Walker's (2003: 234) statement that '[w]hat a 'faithful' or 'verbatim' record is generally expected to convey, to a large extent, is the lexical items and grammatical structures'. The differences between the two versions suggest that they come from two different scribes rather than being an abridged and an expanded version based on the same manuscript. What this shows us yet again is that the Proceedings (just like other early trial accounts) cannot naïvely be taken to contain truly verbatim accounts of the trials at the Old Bailey, even though they were taken down in shorthand. At the same time, however, they are not automatically less reliable than other accounts.

3.3 Testing the reliability of the Old Bailey Corpus: a quantitative analysis of negative contraction

In the following I will test the reliability of the Proceedings as a source of spoken language on the basis of the variation between contracted and uncontracted forms of negated auxiliaries, such as do not vs. don't.

The choice of negative contraction as a diagnostic feature for the linguistic reliability of a text representing speech is motivated by the fact that negative contraction is an established characteristic of present-day spoken language (Greenbaum & Nelson 2002: 211; Mazzon 2004: 105), including legal English. For example, in the 13 courtroom texts of spoken English (127,474 words) included in the web-version of the British National Corpus, contracted forms account for 72.4% of all negated auxiliaries (807/1,115), while in the corresponding written category (non-academic political, legal and educational texts; 4,477,831 words in 93 texts) they make up just 15.3% (1,808/11,852), and many of these actually occur in quotes of spoken language. [9] Given the tendency of contracted forms to predominate in today's spoken English, we proceed on the hypothesis that negative contraction is more frequent in the spoken passages than in the prose text of the Proceedings.

Tables 3 to 6 show the distribution by decade of contracted and uncontracted negation involving auxiliaries in the prose and speech passages of the Old Bailey Corpus (OBC) from 1732 to 1834. [10] The tables subsume orthographical variants under the forms indicated in the first column. Thus haven't includes haven't, ha'n't, han't, and so on. Tables 3 and 4 are based on the speech passages, Table 5 on the prose passages, and Table 6 presents a summary:

Table 3. Negative contraction in the *Old Bailey Corpus*, speech only.
Auxiliary	1732-	1740-	1750-	1760-	1770-	1780-	1790-	1800-	1810-	1820-	1830-	Total
aren't	1	4	0	0	0	0	0	10	0	1	2	15
can't	625	1081	1209	450	323	202	26	38	217	130	146	4447
don't	962	1043	1530	738	1478	710	4115	1314	789	1161	1388	15228
haven't	12	5	1	0	0	0	0	1	3	0	0	22
shan't	25	2	1	0	3	3	0	3	6	32	35	109
won't	147	54	3	2	27	13	42	48	33	168	115	652
Total	1767	2189	2743	1191	1831	927	4183	1414	1048	1492	1684	20467

Table 4. Uncontracted negation in the *Old Bailey Corpus*, speech only.
Auxiliary	1732-	1740-	1750-	1760-	1770-	1780-	1790-	1800-	1810-	1820-	1830-	Total
are not	81	133	139	166	245	715	600	423	286	570	422	3780
cannot [11]	271	1035	911	1718	2139	7983	6984	4698	2393	4959	4282	37373
could not	716	1378	1450	1779	1861	4038	3750	3093	3162	5548	3801	30576
dare not	3	4	10	12	7	24	8	7	11	20	15	121
did not	2049	4505	5075	5483	6660	16922	16246	12063	10771	23109	18396	121279
do not	72	709	782	1717	1386	8941	4364	4133	3224	7081	5697	38106
does not	32	133	100	102	113	715	512	259	216	409	264	2855
had not	420	749	910	1201	1209	2732	2933	2587	2528	5137	4193	24599
has not	48	95	95	82	85	360	276	198	136	417	392	2184
have not	115	267	340	318	324	1311	1304	869	682	2019	1780	9329
is not	164	478	550	579	679	2857	2248	1432	1256	2384	1811	14438
may not	8	21	20	18	22	104	90	33	13	21	21	371
might not	53	77	73	70	89	285	227	168	120	189	153	1504
must not	31	40	35	38	59	158	153	103	72	69	36	794
need not	36	44	55	54	72	166	85	76	62	126	88	864
ought not	7	14	2	11	26	74	54	36	24	30	43	321
shall not	25	66	68	120	96	286	190	116	85	132	101	1285
should not	117	253	267	368	373	924	782	634	540	1050	822	6130
will not	53	209	281	335	282	1104	819	439	343	712	570	5147
would not	693	1382	1313	1574	1488	2911	2636	2039	1810	3088	2210	21144
Total	4301	10210	11163	14171	15727	49699	41625	31367	25924	53982	42887	301056

Table 5. Uncontracted negation in the *Old Bailey Corpus*, prose only.
Auxiliary	1732-	1740-	1750-	1760-	1770-	1780-	1790-	1800-	1810-	1820-	1830-	Total
are not	1	5	1	1	3	16	11	1	3	1	0	43
cannot	3	18	5	4	6	38	18	4	1	3	0	100
could not	52	120	86	86	103	91	40	26	26	56	21	707
did not	69	154	190	209	400	302	187	250	94	487	178	2520
do not	0	3	2	0	1	8	6	0	2	2	0	24
does not	6	18	6	9	13	34	24	6	5	5	1	127
had not	25	51	26	23	41	39	18	25	11	41	32	332
has not	1	6	2	2	4	9	6	7	2	2	1	42
have not	0	4	2	1	1	2	3	1	0	0	1	15
is not	7	34	11	8	24	59	40	12	15	14	3	227
may not	1	3	0	1	1	0	2	0	0	0	0	8
might not	7	4	4	0	2	1	3	0	0	0	0	21
must not	3	1	0	1	2	2	1	0	0	2	0	12
need not	0	1	0	1	2	2	0	0	0	0	0	6
ought not	1	3	0	0	4	5	4	1	1	0	0	19
shall not	0	5	1	0	2	2	4	2	0	0	0	16
should not	16	17	5	12	12	11	9	3	0	10	4	99
was not	55	166	95	70	217	226	88	97	115	84	40	1253
will not	0	1	0	2	2	4	10	1	1	3	0	24
would not	29	44	30	16	46	38	16	14	15	20	7	275
Total	276	658	466	446	886	889	490	450	291	730	288	5870

Table 6. Summary of negative contraction in the *Old Bailey Corpus* (p ≤ 0.001).
	Speech		Prose
	N	%	N	%
Contracted	20473	6.4	5	0.1
Uncontracted	301056	93.6	5870	99.9

In the spoken passages from 1732 to 1834, there are over 20,000 instances of contracted auxiliaries, in other words, 6.4% of all negated auxiliaries show contracted n't in the speech passages. Most of the tokens are accounted for by don't, can't, and won't (there are only 18 tokens of aren't, haven't is represented with only 22 tokens, and shan't is found only 109 times). By contrast, in the prose passages of the same period there are only five contracted forms in total, less than 0.1% of all negated auxiliaries. [12]

The first solid conclusion we can draw from this is that there is a significant difference in the distribution of contracted and uncontracted negative auxiliaries in the OBC, with the former being almost exclusively confined to spoken language. The corpus therefore reflects the characteristics of spoken language, but it remains to be shown in how far this distributional pattern mirrors the actual spoken language of the period. In comparison to the BNC's 72.4% contraction rate in spoken legal English, the Old Bailey's 6.4% seem rather low. Several factors can account for this discrepancy:

Overall language change. Contraction may simply have been less frequent in the 18th and 19th centuries (overall and with regard to the cliticization of n't to individual auxiliaries).
Genre change. Generic conventions (broadly written vs. spoken) may have made it more difficult for contracted negators to appear in written genres than today.
Register change. There may have been a change in the linguistic choices in the legal register, with negative contraction felt to be more colloquial than today and thus inappropriate for a formal legal register, even when spoken.
The scribal filter effect mentioned above.

To test whether the seemingly low ratio of negative contraction in the OBC results from its possibly unfaithful representation of the characteristics of spoken language, I will first compare the picture we find in the Proceedings with what we know about the general development of negative contraction in the history of English. Contraction of not emerged in the 17th century, maybe as early as 1600 in speech and in the second half of the century in writing (Barber 1997: 180; Brainerd 1989; Strang 1970: 151; Warner 1993: 208). Lass (1999: 180) notes that '[c]litic spellings are uncommon until the 1660s; they are frequent in Restoration comedy, and by the early eighteenth century seem to be the norm in speech'. It is not clear on what evidence Lass bases this last claim, but given that negative contraction got more common in writing only towards the end of the 17th century, the situation presented by the Proceedings does not seem too far off the mark.

3.3.1 External fit

A clearer picture is afforded by comparing negative contraction in the OBC with that in another corpus of spoken English, the Corpus of English Dialogues 1560-1760 (CED). The CED includes five genres, trials, witness depositions, drama comedy, didactic works, and prose fiction (for a full documentation of the CED see Kytö & Walker 2006). Table 7 shows negative contraction in the CED trial texts from 1560 to 1760. [13] For comparison, the table also includes n't -forms in the OBC for the period of overlap with the CED, 1732-1759.

Table 7. Negative contraction in CED trials 1560-1760 and *Old Bailey Proceedings* 1732-59.
	Corpus of English Dialogues (N = 1147)										Old Bailey
	1560-99		1600-39		1640-79		1680-1719		1720-60		1732-59
Auxiliary	N	%	N	%	N	%	N	%	N	%	N	%
can't	0	0	0	0	6	4.2	169	68.4	60	44.1	2915	56.8
cannot	7		13		138		78		76		2217
don't	0	0	0	0	5	5.6	101	51.5	82	63.6	3535	69.3
do not	6		23		85		95		47		1563
shan't	0	0	0	0	0	0	1	14.3	0	0	23	12.6
shall not	1		100		19		6		3		159
won't	0	0	1	0	0	0	11	26.2	3	21.4	204	27.3
will not	7		13		49		31		11		543

The CED corroborates the claim in the literature that negative contraction started in the first half of the 17th century but became frequent only in the last decades of that century. Comparing the last sub-period of the CED (1720-1760) with the first sub-period of the OBC considered here (1732-1759), a Chi-square test shows a significant difference only for can't/cannot (p ≤ 0.01). The differences involving the other auxiliaries are not significant (don't/do not: p ≤ 0.2, shan't/shall not: p ≤ 1, won't/will not: p ≤ 1). In other words, for the period of overlap, the rate of negative contraction is rather similar in the two corpora. Even for the significant difference with regard to can't, there is only a 12.7% gap between the CED and the OBC. There are two further parallels: (1) both corpora show an absence of negative contraction with past forms of auxiliaries, [14] and (2) for the period of overlap, the OBC shows negative contraction with the same auxiliaries that attract n't in the CED. [15] Thus, at least as far as the period of overlap is concerned, the distribution of contracted vs. uncontracted negatives in the speech passages of the OBC is similar to that of a sampled corpus, the CED. The OBC can therefore be taken to be just as representative of spoken language as other trial texts.

The comparatively low rate of 6.4% negative contraction in the OBC is due to several factors: first of all, negative contraction is only attested with non-past auxiliary forms, while the percentage was calculated on the basis of non-past and past expanded forms. Secondly, in the OBC, n't attaches only to six auxiliaries and not to the others (n't has a larger range today). If one factors out past forms and those auxiliaries that do not attract enclitic negatives, then the picture looks more familiar from the perspective of present-day English (though still not identical, which one would not expect it to be, given the time difference of ±200 years). Note that a strictly formal approach has been taken in Table 8: starting form negative contraction, only those uncontracted auxiliary forms are included that match the contracted counterpart.

Therefore, since be shows negative contraction only in the form aren't (there are no tokens of isn't), the figures for e.g. uncontracted are not exclude am not, is not, was not, and were not. Similarly, do not excludes does not and did not, and have not excludes has not and had not:

Table 8. Negative contraction and uncontracted forms in the *Old Bailey Corpus*, speech only.
Auxiliary	1732-	1740-	1750-	1760-	1770-	1780-	1790-	1800-	1810-	1820-	1830-
aren't	1	4	0	0	0	0	0	10	0	1	2
are not	81	133	139	166	245	715	600	423	286	570	422
can't	625	1081	1209	450	323	202	26	38	217	130	146
cannot	271	1035	911	1718	2139	7983	6984	4698	2393	4959	4282
don't	962	1043	1530	738	1478	710	4115	1314	789	1161	1388
do not	72	709	782	1717	1386	8941	4364	4133	3224	7081	5697
haven't	12	5	1	0	0	0	0	1	3	0	0
have not	115	267	340	318	324	1311	1304	869	682	2019	1780
shan't	25	2	1	0	3	0	0	3	6	32	35
shall not	25	66	68	120	96	286	190	116	85	132	101
won't	147	54	3	2	27	13	42	48	33	168	115
will not	53	209	281	335	282	1104	819	439	343	712	570

Figure 6 is based on Table 8 and shows the percentage of the contracted negatives can't, don't, ha'n't, shan't, won't, and aren't vs. their uncontracted counterparts in the speech passages.

Figure 6. Percentage of negative contraction for six auxiliaries in the Old Bailey Corpus.

In comparison to the overall figure of 6.4% the figure shows a considerably higher percentage of negative contraction when only the six affected auxiliaries are considered, as can easily be seen from the average (dotted line). Do not even shows a contraction rate of 93% in the 1730s, though this admittedly is an extreme case. Nevertheless, there is also a noticeable overall drop in negative contraction over the 100 years considered here, from an average of 74% in the 1730s to as low as 12% in the 1830s. This is rather surprising, since one would have expected a steady rise, given that negative contraction is so common today. It could mean that the early Proceedings are more representative of spoken language, possibly because the language became more and more formal as the City of London gained control over the publication and it became an official document (see Section 2). Further studies will have to show whether other corpora show a similar kind of behaviour with regard to negative contraction in the 18th century.

The decline in negative contraction may also be a function of the increasing control that the City authorities exerted over the Proceedings in the course of the 18th century. Robert Shoemaker (p.c. 2007-05-20) suggests that 'the character and audience of the Proceedings changed significantly between 1720 and 1778, when they entered the period of close City control. As they became longer and more expensive during this period, the language became more respectable'.

3.3.2 Internal consistency

Figure 6 shows a strong fluctuation of contraction rates, especially as far as don't is concerned. To a certain degree this is the result of the small intervals chosen here (decades), but there may also be other reasons: as mentioned before (Section 4), there are several layers of filters that stand between the speech event at the Old Bailey and the linguist trying to reconstruct the spoken language of the period. These are the filters imposed by the scribes, by the proofreaders, the typesetters and printers, and by the publishers. It is to the influence of these persons that I will turn in the following. From the late 1730s, the title pages of the Proceedings regularly mention the scribe and/or printer. [16] Table 9 gives an overview of this information:

Table 9. Scribes and printers in the *Proceedings*, as indicated on the title pages.
From	to	Scribe	Printer
16781211			G. Hills
16920406			Thomas Braddyl
17261207			J. Read
17381206	17401015		T. Cooper
17410405	17411014		J. Roberts
17411204	17420603		T. Payne
17420714	17430114		T. Cooper
17430223	17451016		M. Cooper
17451204	17460117		C. Nutt
17471209	17481207		M. Cooper
17490113	17551022	Thomas Gurney	M. Cooper
17551204	17571026	Thomas Gurney	J. Robinson
17571207	17591024	Thomas Gurney	M. Cooper
17591205	17601022	Thomas Gurney	G. Kearsley
17601204	17611021	Thomas Gurney	J. Scott
17730707	17750712	Joseph Gurney
17751206	17771015	Joseph Gurney	William Richardson
17771203	17811017	Joseph Gurney
17811205	17820410	William Blanchard
17820515	17820703	Joseph Gurney
17820911	17921031	E. Hodgson
17921215	17951028	Manoah Sibly	Henry Fenwick
17951202	17970215	Marsom & Ramsey	W. Wilson
17970426	18011028	William Ramsey	W. Wilson
18011202	18051030	Ramsey & Blanchard	W. Wilson
18051204	18150510	Job Sibly	R. Butters
18150621	18160918	J.A. Dowling	R. Butters
18161204	18280110	Henry Buckler	T. Booth
18280221	18300415	Henry Buckler	Henry Stokes
18300527	18310512	Henry Buckler	Henry Stokes & George Titterton
18310630	18330411	Henry Buckler	George Titterton
18330516	18331128	Henry Buckler	William Johnston
18340102	18341016	Henry Buckler	William Tyler

Joseph Gurney is the first scribe to be mentioned in the Proceedings, on the title page of 17730908, but we know that his father had been taking shorthand notes of the sessions since at least 1749, when his name first appears in an advertisement at the end of the Proceedings: 'SHORT-HAND Taught in an easy and expeditious Method, by Thomas Gurney, the Writer of these Proceedings' (17490113, my emphasis). Similar advertisements appeared in 129 further issues, so Thomas Gurney was responsible for recording the trials from the mid-18th century onwards. After his death in 1770, his bookbinder son Joseph took over the business of recording and publishing the Proceedings (see the advertisement in 17700711: 'By the late Mr. THOMAS GURNEY, upwards of Twenty Years Writer of these PROCEEDINGS'). For 85 years, therefore, we know who the scribes were and, for an even longer period, who printed the Proceedings. [17]

3.3.2.1 Micro-study of the scribes

With this information we can now proceed to a micro-study of the material in the OBC in order to establish whether there is a significant correlation between the scribes/printers and the linguistic detail captured in the Proceedings. Perhaps the most noticeable fluctuation in the development of negative contraction shown in Figure 6 is the sudden drop of contraction from an average of 29% in the 1770s to a mere 4% in the 1780s, only to rise again to 23% in the 1790s. This could be an indication of an internal inconsistency of the corpus and will therefore be the focus of the first case study. Table 10 splits the Proceedings in the 1780s up by scribe and indicates the respective figures for negative contraction:

Table 10. Negative contraction in speech, in the 1780s, by scribe.
	Gurney I 17800112-	Blanchard 17811205-	Gurney II 17820515-	Hodgson 17820911-
can't	78	39	14	71
cannot	725	181	47	7030
don't	211	163	35	301
do not	919	83	50	8604
shan't	2	0	0	0
shall not	24	6	4	252
won't	5	0	1	7
will not	107	34	6	957

Joseph Gurney took down 15 proceedings from January 1780 to November 1781 and 3 proceedings from May 1782 to July 1782. The four sessions from December 1781 to April were transcribed by William Blanchard. However, E. Hodgson was the scribe who was responsible for the bulk of the Proceedings in the 1780s, 67 in all. For the present purposes we will have to assume that apart from the change in the person of the scribe all other parameters remain equal, i.e. we will idealize and assume that there is no significant language change within one decade and that the sociobiographical composition of the trial participants remained the same throughout. Figures 7 and 8 chart the percentages of negative contraction for can't and don't:

Figure 7. Negative contraction of can not, in the 1780s, by scribe.

Overall distribution: p ≤ 0.001. The difference B↔GII is not significant (p ≤ 1). The difference GI↔GII is significant (p ≤ 0.01). The differences between all other pairs are highly significant (p ≤ 0.001).

Figure 8. Negative contraction of do not, in the 1780s, by scribe.

Overall distribution: p ≤ 0.001. The differences between all pairs are highly significant (p ≤ 0.001).

Chi square tests show that, except for Blanchard↔Gurney II in Figure 7, all differences in the rate of negative contraction between the scribes are significant. The first finding is that there is a much lower rate of negative contraction in Hodgson's Proceedings than in those of Gurney and Blanchard in the 1780s. Averaging Gurney and Blanchard and setting them against Hodgson yields the following picture:

can't Gurney-Blanchard 12.1% ↔ Hodgson 1.0%
don't Gurney-Blanchard 28.0% ↔ Hodgson 4.3%

It seems clear that the drop in negative contraction in can not and do not in the 1780s is due to Hodgson. This scribal effect is much more pronounced in do not contraction since the difference between Gurney/Blanchard and Hodgson is more than twice as high (23.7%) than in can not (11.1%). Note also with regard to the latter that there is a tendency for a lower significance in the three Gurney↔Blanchard pairs, with B↔GII not significant and GI↔GII significant 'only' at the 0.01 level. There is therefore some, albeit small and rather variable, measure of agreement between Gurney and Blanchard in as far as negative contraction with can is concerned, but no agreement between the two on the one side and Hodgson on the other, which could be an indication of Hodgson's lower faithfulness with regard to the recording of instances of can't.

Unfortunately, the Proceedings do not indicate the printers in the period considered here, so it is impossible to test whether they may also have played a role in the differences. What is needed, then, are short periods in the Proceedings in which the printer stays the same but the scribes change and, conversely, a period where the scribe is the same but the printers change. [18] As to the first case, W. Wilson printed the 17951202-18051030 Proceedings, recorded by three different scribes in succession, Marsom & Ramsey, William Ramsay, and Ramsay & Blanchard. The other case is afforded by Thomas Gurney, who recorded, among others, the ten years between 17511204 and 17611021, with M. Cooper, J. Robinson, G. Kearsley, and J. Scott as printers.

3.3.2.2 Micro-study of 'same printer/different scribes'

I will start with a study of a 'same printer/different scribes' period. Figures 9-11 illustrate this case for 17951202-18051030 with regard to negative contraction of can not, do not, and will not:

Figure 9. Negative contraction of can not, 1795-1805, by scribe.

Overall distribution: p ≤ 0.01. The differences M&R↔R&B and WR↔ R&B are not significant (p ≤ 0.2). The difference M&R↔WR is significant (p ≤ 0.01).

Figure 10. Negative contraction of do not, 1795-1805, by scribe.

Overall distribution: p ≤ 0.001. The differences between all pairs are highly significant (p ≤ 0.001).

Figure 11. Negative contraction of will not, 1795-1805, by scribe.

Overall distribution: p ≤ 0.05. The differences M&R↔WR (p ≤ 0.2) and M&R↔R&B (p ≤ 1) are not significant. The difference WR↔ R&B is significant (p ≤ 0.025).

All three scribes show a very low percentage of can't, a trend which had already begun with E. Hodgson in the 1780s. Chi square shows a significant difference only in M&R↔WR, but this difference is very low (1.0%), meaning that the scribes represented negative contraction of can not in a similar way. The scribes also agree on the percentage of won't, around 10%, the only significant difference being WR ↔ R&B (at the 0.025 level). Interestingly, don't offers a different picture, with (1) a generally higher contraction rate than can not and will not, and (2) a pronounced and highly significant difference between the three sub-periods/ scribes. One way to interpret this is that scribes can be more faithful in the representation of some linguistic features (percentages of can't and won't in this case) than of others (don't). Presumably, several reasons play a role in this 'differential faithfulness', including linguistic and social salience of the variants in question. Unfortunately, there is the complicating factor that Ramsey had a hand in all three sub-periods, and he may well have been the dominating factor in the teams with Marsom and with Blanchard.

3.3.2.3 Micro-study of 'different printers/same scribe'

The last case to be considered is that of 'different printers/same scribe'. For this, I will consider the 17511204-17611021 period, transcribed by Thomas Gurney and printed in turn by M. Cooper, J. Robinson, G. Kearsley, and J. Scott:

Figure 12. Negative contraction of can not, 1751-1761, by printer.

Overall distribution: p ≤ 0.001. The difference CII↔K is significant (p ≤ 0.025). The differences between all other pairs are highly significant (p ≤ 0.001).

Figure 13. Negative contraction of do not, 1751-1761, by printer.

Overall distribution: p ≤ 0.001. The differences CI↔R (p ≤ 1), CII↔K (p ≤ 1), and CII↔S (p ≤ 0.2) are not significant. The difference K↔S is significant (p ≤ 0.05). The differences between all other pairs are highly significant (p ≤ 0.001).

Figures 12 and 13 are interesting in that they show a difference between a relatively high rate of negative contraction with printers CI and R, but a comparatively low rate with CII, K, and S. In other words, with Thomas Gurney as the scribe throughout, the differences in contraction rates must be due to the printers. Note also that in both figures there is a significant difference between the two sub-periods printed by Cooper. This demonstrates that the Proceedings show variation in the representation of a linguistic variable even though scribe and printer are the same.

3.3.2.4 Summary

In sum, the variation in negative contraction presented by the three test cases considered above can perhaps best be captured with what I have called 'differential faithfulness'. On the intra-scribal level this means that individual scribes and printers can be more faithful with regard to the representation of some linguistic variants (maybe because of the variants' greater social or linguistic salience or indexical function) than with regard to others. On the inter-scribal level there may be agreement between different scribes/printers only on certain variables and not on others, as the Marsom/Ramsey/Blanchard test case has shown (agreement on a low contraction rate for can not, but disagreement on the rate for do not). An idealizing cross-figure comparison of negative contraction as represented by different scribes is given in Table 11, where a − stands for a very low rate of contraction and a + for a comparatively high rate:

Table 11. Summary: negative contraction, 1780-1805, by scribe.
Auxiliary	Gurney/Blanchard 1780-1782	Hodgson 1782-1789	Marsom/Ramsay/Blanchard 1795-1805
can't	+	-	-
don't	+	-	+

In a comparatively short period of time (25 years) we find considerable variation between three groups of scribes: Gurney/Blanchard show a relatively high percentage of negative contraction, Hodgson a very low rate, and Marsom/Ramsay/Blanchard occupy an intermediary position with little can't contraction but a rather high percentage of contraction in don't.

4. Creating the (linguistic) Old Bailey Corpus [19]

The Old Bailey Corpus will be searchable online with unrestricted access. The reason for not freely disseminating the corpus itself is that the copyright in the electronic text version is owned by the University of Hertfordshire and the University of Sheffield. The following is a graphic representation of how the online search function will operate:

Figure 14. The Old Bailey Corpus online search function.

The web-based user interface

The web-based user interface will allow the researcher to specify a search string to be found in the spoken passages of the Proceedings. The user will also be able to limit/ sort hits according to sociobiographical parameters. There will also be an advanced search option, where users can set up their own multidimensional searches, for instance, one may search for more than one variant of a sociobiographical variable, one may form one's own social classes by grouping certain professions together, etc.
MySQL Database 1: sociobiographical data

Because of considerations of time, a full-text search through 40+ million words for every query is out of the question. We therefore decided to take a database approach: from the user interface, the sociobiographical specifications are sent to a MySQL Database 1, a list of all speakers in the Proceedings and their sociobiographical characteristics, together with a unique speaker-ID. Those speakers whose characteristics fit the parameters specified in the query are identified and their IDs memorized.
MySQL Database 2: speech

The memorized speaker IDs are then sent to Database 2, which groups all utterances that individual speakers made under the corresponding speaker ID. The software then retrieves the search string and searches the utterances of only those speakers selected in step 2.
Concordance

The results will be output as a concordance and/or a numerical table, generated according to the sociobiographical parameters chosen in step 1.

The digitized transcripts of the Old Bailey Proceedings obtained from Robert Shoemaker and Tim Hitchcock are already heavily annotated, as illustrated by the following excerpt, from the beginning of a trial in 1733:

<unit id="t17330510-1"> <trial> <info> <identifier>t17330510-1</identifier> <source>173305100002</source> <header>Sarah Sanders, theft: specified place, 10 May 1733.</header> <pfro>17330510</pfro> <ntrial>2</ntrial> <psession>17330404</psession> <nsession>17330628</nsession> </info>

1. <person gender="f"> <defend gender="f"> <given>Sarah </given> <surname>Sanders </surname> </defend> </person>, was indicted for <off> <theft type="specified place">stealing a Portugal Piece of Gold, value 36 s. a Gold Ring, value 10 s. a Gold Ring set with Vermillion Stones, value 7 s. 6d. a Silver Girdle Buckle, value 10 s. three Aprons, a Shirt, a Shift, and 2 Ells of Holland, the Goods of <person gender="m"> <victim gender="m"> <given>John </given> <surname>Underwood </surname> </victim> </person>, in his House</theft> </off>, <cd>March 4</cd>.

John Underwood. The Prisoner was my <deflabel>Servant</deflabel>, she came to me very well recommended, but had not staid above ten Weeks before several [. . .] (17330510-1)

Spoken language is not tagged in this version, but there are a number of sociolinguistically useful tags, most importantly

the name of the speaker: <given>Sarah</given> <surname>Sanders</surname>
the year of the trial: <identifier>t17180110-1</identifier>
the speaker age: <age>43</age>
the speaker gender: <defend gender="f">
the speaker profession or status: <deflabel>Servant</deflabel>
the origin of the speaker: <crimeloc>Tottenham</crimeloc>

Speaker origin is probably the most unreliable parameter since it has to be established indirectly, through the crime location, <crimeloc>. The vast majority of speakers in the Proceedings resided in or around London anyway, the jurisdiction of the Old Bailey, and there is often no way of telling whether the crime location corresponds to the speaker's area of residence. Also, London's population was characterized by immigration both from the British Isles and elsewhere in the world. Most immigrants arrived as young adults and may have lost some of the characteristics of their original varieties after a few years of residence in the capital. However, since the origin of the speakers within London or Middlesex may be interesting for micro-sociolinguistic studies of the geographical diversity of London English in the 18th and 19th centuries, it was decided to keep this parameter.

4.2 Automatic identification of spoken language in the Proceedings

Identifying and tagging direct speech was the first step in turning the Proceedings into the OBC. The objective was to extract the spoken passages from the Proceedings and assign them to individual speakers (i.e. compiling the information stored in MySQL Database 2 (see Figure 14).

Because of the size of the material (52 million words) and limited resources, manual identification of spoken language in the Proceedings was out of the question. In search for alternatives, I first considered developing a software program that identifies and tags speech passages on the basis of morpho-syntactic features common in spoken language, such as first and second person pronouns. It soon turned out that such an approach would have been far to complex, error-prone and time-consuming. Instead, it was decided to base the process on formal rather than linguistic patterns: we created a Pearl script that tagged spoken language on the basis of keywords and patterns in the xml-structure/layout of the Proceedings. [20] A complete algorithm of the tasks performed by this tagger can be found in Appendix B, but the following paragraphs will give a general impression of the approach taken here, illustrated by selected examples.

A promising procedure to identify spoken language is to look for metalinguistic information that is present in any printed text (like new paragraphs to indicate speaker changes). Obviously, the first strategy that came to mind was to tag everything between inverted commas as speech, but it turned out that inverted commas are extremely rare in the Proceedings. Compare the plain-text excerpt from 17330510-1, the trial already cited above:

1. Sarah Sanders, was indicted for stealing a Portugal Piece of Gold, value 36 s. a Gold Ring, value 10 s. a Gold Ring set with Vermillion Stones, value 7 s. 6d. a Silver Girdle Buckle, value 10 s. three Aprons, a Shirt, a Shift, and 2 Ells of Holland, the Goods of John Underwood, in his House, March 4.

John Underwood. <speech>The Prisoner was my Servant, she came to me very well recommended, but had not staid above ten Weeks before several Things were missing. I examined her about them, […] and then she confessed further, that she had taken from me a Portugal Piece of Gold that goes for 36 s. I was more surprised at this, than at all the rest, because I could not imagine which Way she could come at it.</speech>

Mrs. Underwood. <speech>Missing my Stockings and Aprons, I asked the Prisoner if she knew any thing of them? […] she said, the Aprons were put in her Box by Mistake.</speech>

Prisoner. <speech>My Box had no Lock nor Hinges, it always stood open; I had been Ironing late at Night, and put some of my Mistress's Things among my own, by Mistake: As for the Ring, […] I fetched it out of pawn with my own Money.</speech>

Henry Glover. <speech>The Prisoner was my Servant above eight Years, she all that Time bore a very good Character; and I had such an Opinion of her Honesty, that I was in the greatest Surprize when I heard of this Charge against her.</speech>

There are no inverted commas here, but one regularity in this text (and elsewhere in the Proceedings) is that every speech act occupies one paragraph. In other words, a speaker's utterance ends at a paragraph break, and </speech> tags can accordingly be inserted in this position, as shown in the excerpt above.

The situation is more complicated when we look for the start of the speech act, because it does not necessarily coincide with the beginning of a paragraph. Compare the second paragraph, where John Underwood's statement starts with the third word, the first two ('John Underwood.') simply identifying him as the speaker. That is, the <speech> tag has to be inserted after the speaker name. Again, paragraphs starting with the speaker's name are a fairly common pattern in the Proceedings and were used for tagging purposes. For the purposes of the Pearl script, a name was defined simply as a string of letters. The script assumes that paragraphs can start with either one name followed by a full stop ('Smith. I was walking …') or two names followed by a full stop ('John Smith. I was walking …'). While this yields correct results in a good number of cases, there are also a variety of exceptions that have to be taken into account. For instance, paragraph 3 in the excerpt above begins 'Mrs. Underwood.' The standard routine identifies 'Mrs.' as a name followed by a full stop. However, <speech> should be inserted after 'Underwood.', not after 'Mrs.' To avoid misplacement of the tag in such cases, the script checks a list of exceptions to the standard routine before the tag is inserted. This list includes strings at the beginning of paragraphs such as 'Defendant.' or 'Prisoner.' The Pearl script also makes use of tags in the electronic version of the Proceedings prepared by Tom Hitchcock and Robert Shoemaker. In this version, many names are tagged as given names and surnames, so the presence of <given> and <surname> at the beginning of a paragraph is of help in placing the <speech> tag.

Question-answer sequences constitute a frequent exception to the general rule that one utterance occupies one paragraph: in the Proceedings, both turns of the adjacency pair are often found in the same paragraph. The approach described so far cannot identify the beginnings and ends of two utterances in the same paragraph, but question-answer sequences also show regular patterns that can be of help in doing this. In the original Proceedings, most of these sequences are marked, with slight variations, 'Q. – A.' and the Pearl script uses this metalinguistic information to insert the <speech> tags at the appropriate places:

Q. <speech>Are you sure that was your horse that you lost?</speech> - A. <speech>Yes.</speech>

Q. <speech>Had you at all known the prisoner before?</speech> - A. <speech>I had seen him in my yard two years before; I believe his friends live in the next parish.</speech>

Q. <speech>What was he?</speech> - A. <speech>I heard that he was a shoe-maker.</speech> (18050220-4)

4.3 Semi-automatic sociobiographical annotation

The main task in creating the OBC is to gather sociobiographical speaker data and to link these with the speech sections in the Proceedings, as described in Figure 14. Again, because of the large size of the corpus a completely manual annotation was impracticable. Instead, an annotation tool was developed that automatizes this process as far as possible. [21] With some adaptations this tool will also be useful for similar annotation purposes in other corpora. Figure 15 shows a screenshot of the Old Bailey Tool, highlighting its main components and functions.

The text window and the tag assistant are the main components of the Old Bailey Tool. First, an xml-file is loaded into the text window. For easier reading, tags can be faded to grey or highlighted in various colours/styles, as shown in the screenshot. The next step is to have the speaker-ID generator in the tag assistant extract all tagged names from the xml-file and assign a unique speaker-ID to them. An alphabetical list of names is shown in the bottom left window of the tag assistant ("Names: alphabetical"), with the speaker-IDs next to them. For example, in the screenshot, Francis Perry has the ID 69. Clicking the genderizer button then automatically assigns the sex to the speakers by consulting a list of about 7,300 male and female first names and their orthographical variants. This captures more than 95% of the names in the Proceedings; the rest can be added manually.

The annotation process starts after these preparatory steps. The buttons "next (down)" and "next (up)" will let the user jump from one <speech> tag to the next in the xml-file. In the screenshot, the current position is the <speech> tag in red ("Prisoner's Defence"). The tag assistant automatically checks for names around the current position and shows them in the "Names found near current tag" window. The reason for this is that there is a high likelihood that the speaker at the current position is identical with one of these persons, which saves the time of scrolling up and down in the alphabetical list. By double-clicking on the appropriate name, either in the alphabetical list or in the "names near current tag" list, a unique speaker-ID will be inserted into the <speech> tag, consisting of the file name (t17750426 in this case) followed by an underscore and the ID shown next to the speaker in the alphabetical name list (_0069 in the case of Francis Perry). As one goes along, the "Recent selections" name list will display the names of speakers whose ID was inserted previously, again because the likelihood is high that the current speaker is again one of these.

The context often contains sociobiographical information on the speakers, for example witnesses frequently begin their statements with 'I am a (profession label)', as Henry Dixon in the screenshot 'I am a pawnbroker'. While inserting speaker-IDs into the xml-file, this information has to be gathered and inserted in the age, profession, and location fields next to the alphabetical list. Speaker-IDs and the sociobiographical details are stored in a database which can be exported for further processing. In addition to this information, the database will contain the names of the scribe and printer of the respective Proceedings, to help corpus users assess the validity of their findings, for instance as demonstrated in Section 3.3.2.

Figure 15. Tagging sociobiographical data: the Old Bailey Tool (OBT).

5. Summary and conclusion

The Proceedings of the Old Bailey constitute a large body of texts, whose speech passages are arguably as near as we can get to the everyday language predating the invention of audio recording technology.

This article started with an overview of the historical background and structure of the OBC, a 50+ million word linguistic corpus based on the Proceedings. Direct speech becomes more common in the 1720s, from when on almost 85% of the text is spoken language. Age is regularly mentioned only from the 1790s, but it is hoped that more information for earlier years can be added during annotation of the corpus. More than 70% of the speakers are men, but this imbalance is remedied by the size of the OBC, which ensures that even with a low percentage, the absolute number of women is still high enough for historical sociolinguistic studies. In addition, the large variety of occupation and status labels of the participants in the trials will be useful in forming social classes.

Section 3 dealt with assessing the reliability of the OBC as a source of spoken English in the 18th and 19th centuries. It looked at the filters imposed in the different stages of the genesis of the Proceedings and investigated external fit and internal consistency. With regard to the filter effect, the simultaneity of the speech event and its recording as well as rapid publication after the sessions at the Old Bailey are arguments in favour of a rather accurate portrayal of spoken language in the Proceedings. On the other hand, the investigation of Gurney's shorthand system showed that even a supposedly verbatim mode of recording did not in all cases result in an absolute faithful representation of the speech event.

The external fit of the Proceedings was examined by comparing a sample trial with an alternative account of the same court case. Although there is some verbal overlap between the two versions, there are also substantial differences, including omissions as well as verbal, morphological, and syntactic divergences. While this in itself does not necessarily discredit the Proceedings as a source of authentic spoken language (it could well be that the alternative account is the less reliable one), it shows that trial accounts cannot simply be taken at face value but have to be evaluated carefully. In a second step to assess the external fit, negative contraction was chosen as a diagnostic linguistic feature of spoken language. An important finding was that negative contraction is (almost) exclusively found in the speech passages of the Proceedings, demonstrating that the scribes did systematically differentiate between speech and prose, which lends some credibility to their portrayal of spoken language. A comparison of the OBC findings with the CED showed further that both corpora agree in the auxiliaries that n't cliticizes to and show comparable (though not identical) rates of contraction for these.

Internal consistency was tested by micro-studies of negative contraction in three short sub-periods of the Proceedings — 1751-1761, 1780-1782, and 1795-1805 — in which either the scribe or the printer varied. The result was that scribes can differ from each other in the rate of negative contraction either across the board or with regard to individual auxiliaries. This differential faithfulness was seen as an indication that scribes can be more accurate in the representation of some linguistic features than of others. The third micro-study showed that variation in the rate of negative contraction can also be due to the influence of the printer. The conclusion to be drawn from all this is that the representation of linguistic features in the OBC, as in other trial collections, can be distorted by scribal and/or printer interference. Corpora including trial proceedings and studies based on such corpora have to take account of the fact that what looks like language variation and change may in fact be due to the influence of scribes and printers.

In compiling and annotating the OBC, the major task was to identify speech passages and link them to sociobiographical speaker parameters such as sex, age, or profession. Digitized and xml-encoded transcripts of the Proceedings were kindly provided by Robert Shoemaker and Tim Hitchcock, and identification of spoken language was achieved with the help of a Pearl script that tagged these passages on the basis of keywords and patterns in the xml-structure/layout of the Proceedings. The Old Bailey Tagger was developed for speaker annotation. This is a tool that automatizes speaker identification and the collection of sociobiographical data. With some adaptations this tool will also be useful for similar annotation purposes in other corpora.

In spite of these caveats, trials proceedings are still among the few and best sources we have of spoken language before the advent of mechanical recording. Some studies suggest that comedy drama presents an even more faithful picture, but trial accounts have the advantage that they are based on a real, not an imagined, speech event. Even if they are not completely true to that speech event, they are at least guided by it, whereas dialogue in drama is for the most part simply invented.

Notes

[1] For detailed background information on the Old Bailey and the publication history of the Proceedings consult the excellent Old Bailey Proceedings Online, from where the information presented in this section has been taken.

[2] Eight-digit references are to the Proceedings reference number as used in the Old Bailey Proceedings Online. The first four digits indicate the year of the trial, followed by two digits each for the month and day. A hyphen followed by a number indicates the particular trial in a particular issue in the Proceedings.

[3] Digitization of the 1834-1913 Proceedings of the Central Criminal Court is under way. They will be launched in the spring of 2008. My plan is to integrate this material in the Old Bailey Corpus, which will then span almost 200 years of spoken Modern English.

[4] In the Corpus of English Dialogues 1560-1760, negative contraction in don't (1600-1639), shan't (1640-1679), and won't (1640-1679) is attested one subperiod earlier and with a generally higher relative frequency in the genre comedy than in the genre trial. Comedy also includes (rather infrequent) contractions in mayn't and mustn't, which are totally absent from CED (and OBC) trials.

[5] Note that <ll> (from the last sound in single plus the first sound in line) is indicated by boldening , the symbol for 'l'. This could be interpreted as both phonological — if we assume that assimilation and elision processes like /ll/ > [l] are not accounted for in the script but instead citation forms of the words are transcribed — and orthographic — if we assume that the primary guide is the conventional orthography and that double-l results from contraction of the last and first letter across a word boundary.

[6] Still, there are about half a dozen tokens of thou hast between 1749 and 1782, the period dominated by Thomas and Joseph Gurney.

[7] I am grateful to Robert Shoemaker for bringing these to my attention. At least from the middle of the 18th century on, the court officially relied on scribes to report minutes of previous sessions. Thus, Joseph Gurney was called to testify in 17710220-82, 17710410-64, 17711023-94, 17730113-78, 17780715-92, 17780715-93, and 17820515-60.

[8] An extract from this alternative account is also included in the file D5TAYLIF in the Corpus of English Dialogues 1560-1760.

[9] In the corresponding BNC academic law texts, the rate of contracted negative auxiliaries is as low as 3.2%.

[10] Only records from the year 1732 onwards will be analyzed in the following, since only in these spoken language has already been identified and tagged. 1732- and 1830- are incomplete decades.

[11] By far the largest number are spelt in one word, cannot (there are only 62 tokens of can not in the OBC 1732-1834 period). Brainerd (1989: 180-181) treats cannot as an early form of contraction analogous to shall not > shannot and will not > wonnot. However, since the spelling of cannot does not suggest any phonological change, it will here be treated as an uncontracted form.

[12] The contracted forms in the prose passages are all tokens of can't, one instance each found in 17520218, 17380906, 17400116, 17401015, and 17431012).

[13] Note that although the CED also contains some accounts of trials at the Old Bailey, these are alternative accounts and not identical with those in the Proceedings. The CED trials can therefore be compared with the OBC without danger of circularity.

[14] Interestingly, a quotation search in the Oxford English Dictionary on CD ROM shows the following first attestations of negative contraction with past forms of auxiliaries: couldn't 1800, didn't 1705, hadn't 1775, mightn't 1865, shouldn't 1628 (!), wasn't 1797, weren't 1845, and wouldn't 1794.

[15] In addition, the OBC has additional contractions involving are and have, but these are rather marginal, which might also explain their absence from the much smaller CED.

[16] More information on who transcribed, printed and published the Proceedings can be gleaned from a close inspection of the main text and the advertisements, but for the present purpose Table 9 is sufficient.

[17] There are some gaps for the printers, though. The gap from 17771203 to 17921031 is due to the fact that the Proceedings were 'printed for' the respective scribes in this period. The actual printers are not identified.

[18] As there is no information about who actually typeset the Proceedings, the analysis has to proceed by printer.

[19] I would like to thank Eva Kapp, Manuel Müller, Magnus Nissel, Andreas Reuter, Ulrike Schneider, Tracy Sutphin, and Alexandra Tran, my student helpers at Giessen University, as well as my research assistants Thorsten Brato and Svetla Rogatcheva for tagging and annotating the OBC.

[20] Thanks go to Sumithra Velupillai, University of Stokholm, for developing the Pearl script.

[21] I am grateful to Magnus Nissel for programming the Old Bailey Tagger.

Sources

BNC = British National Corpus. CQP-edition (Version 3.0), developed by Sebastian Hoffmann (University of Zurich) and Stefan Evert (University of Osnabrück), http://www.natcorp.ox.ac.uk/

CED = Corpus of English Dialogues 1560-1760, compiled under the Supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University), http://www.helsinki.fi/varieng/CoRD/corpora/CED/

OBC = Old Bailey Corpus (OBC), http://www.oldbaileyonline.org/

References

Barber, Charles. 1997. Early Modern English, second ed. Edinburgh: Edinburgh University Press.

Brainerd, Barron. 1989. "The contractions of not: A historical note". Journal of English Linguistics 22: 176-196.

Culpeper, Jonathan & Merja Kytö. 2000. "Data in historical pragmatics. Spoken interaction (re)cast as writing". Journal of Historical Pragmatics 1: 175-199.

Görlach, Manfred. 1991. Introduction to Early Modern English. Cambridge: Cambridge University Press.

Greenbaum, Sidney & Gerald Nelson. 2002. An Introduction to English Grammar, second ed. Harlow: Longman.

Gurney, Thomas. 1752. Brachygraphy: Or Short-writing, second ed. London: [no publisher].

Hitchcock,Tim & Robert Shoemaker. 2007a. "Publishing history of the Proceedings from their inception to 1834". Old Bailey Proceedings Online. http://www.hrionline.ac.uk/oldbailey/proceedings/publishinghistory.html, accessed 4 April 2007 [https://www.oldbaileyonline.org/static/Publishinghistory.jsp ]

Hitchcock, Tim & Robert Shoemaker 2007b. "The value of the Proceedings as a historical source". Old Bailey Proceedings Online. http://www.hrionline.ac.uk/oldbailey/proceedings/value.html, accessed 15 January 2007 [https://www.oldbaileyonline.org/static/Value.jsp ]

Kytö, Merja & Terry Walker. 2006. Guide to A Corpus of English Dialogues 1560-1760. Uppsala: Uppsala University.

Kytö, Merja & Terry Walker. 2003. "The linguistic study of Early Modern English speech-related texts. How "bad" can "bad" data be?" Journal of English Linguistics 31: 221-248.

Lass, Roger. 1999. "Phonology and morphology". The Cambridge History of the English Language, vol. III: 1476-1776, ed. by Roger Lass, 56-186. Cambridge: Cambridge University Press.

Mazzon, Gabriella. 2004. A History of English Negation. London: Longman.

Schneider, Edgar W. 2002. "Investigating variation and change in written documents". The Handbook of Language Variation and Change, ed. by J. K. Chambers, Peter Trudgill & Natalie Schilling-Estes, 67-96. Oxford: Blackwell.

Shoemaker, Robert. (forthcoming). "The Old Bailey Proceedings and the representation of crime and criminal justice in eighteenth-century London". Journal of British Studies.

Strang, Barbara. 1970. A History of English. London: Methuen.

The Trial of Elizabeth Canning, Spinster, for Wilful and Corrupt Perjury; at Justice Hall in the Old-Bailey … 1754. London: John Clarke.

The tryal at large of John Ayliffe, Esq; for forgery; at Justice-Hall in the Old-Bailey, London: on Thursday the 25th day of October 1759. … 1759. London: M. Cooper.

Thomas, Alan. 1994. "English in Wales". The Cambridge History of the English Language, vol. V: English in Britain and Overseas: Origins and Development, ed. by Robert Burchfield, 94-147. Cambridge: Cambridge University Press.

Warner, Anthony. 1993. English Auxiliaries: Structure and History. Cambridge: Cambridge University Press.

Appendix A

Extract from the trial of John Ayliffe, Proceedings 17591024, and an alternative account of the same trial, The tryal at large of John Ayliffe (Tryal).

	Proceedings (718 words)	Tryal (1290 words)
1	Henry Thomas sworn.	Cryer. The evidence you shall give, between our Sovereign Lord the King and the prisoner at the bar, shall be the truth, the whole truth, and nothing but the truth; so help you God. Kiss the book. (Which he did.)
2		Mr Serjeant Davy. Mr Thomas, pray tell my Lord, and the Jury, what your business is.
3	Thomas. I am clerk to Mr Jones, a Stationer in the Temple.	Henry Thomas. I am clerk to Mr Jones, a Stationer, in the Temple.
4	Counsel for Crown. Look at this (giving him the counterpart of the genuine lease into his hand) and tell us whose hand-writing this engrossment is. The counsel for the Crown were Mr Aston, Mr Serjeant Davy, and Mr Wedderburn; and for the prisoner Mr Serjeant Hayward, Mr Stow, and Mr Lane.	Mr Serjeant Davy. Look upon this deed (giving him the counterpart executed by the prisoner.) Do you know the hand-writing of that engrossment; if so, tell us whose it is.
5	Thomas. This is my hand-writing, as far as the words (In witness whereof).	Henry Thomas. This is my hand-writing as far as to the words (In witness whereof,) and so on.
6	Q. How came you not to engross the whole?	Mr Serjeant Davy. How came you not to engross the whole?
7	Thomas. I can't particularly say that; sometimes we leave a blank by the gentlemens desire, perhaps they may add another covenant, or something of that sort, I can't recollect the reason for that.	Henry Thomas. I cannot positively say. – We sometimes leave out the conclusion by gentlemen's desire, in order that they may add a covenant, or some such thing, if it should be thought necessary; but I cannot particularly recollect the reason why the conclusion was omitted in this case.
8	Q. How many parts did you engross?	Mr Serjeant Davy. How many parts of this deed did you engross?
9	Thomas. I engrossed this part, and as it appears by our day-book, another part.	Henry Thomas. I engrossed this; and, as appears by our day-book, another part.
10	Q. By whose instructions did you engross it?	Mr Serjeant Davy. From whom did you receive the instructions for these engrossments?
11	Thomas. It was brought me by Mr Jones's son.	Henry Thomas. From Mr Jones's son.
12		Mr Serjeant Davy. Were the parts you engrossed agreeable to the draft?
13		Henry Thomas. They were.
14		Mr Serjeant Davy. Look at the words (thirty-five pounds) engrossed at length in the reddendum and covenant, for payment of the rent in this counterpart; were they part of the original ingrossment, or were they afterwards inserted in blanks left for that purpose?
15		Henry Thomas. I engrossed them at the same time that I engrossed the rest, agreeable to the draft.
16		Lord Chief Justice. Prisoner, would you ask this witness any question?
17	Serjeant Davy. Now we will prove this to be executed by Mr Ayliffe. John Fannen sworn.	Mr Serjeant Davy. We will now prove the execution of this counterpart by the prisoner. – Call Mr Fannen. (Who appeared, and was sworn.)
18		Alexander Wedderburne, Esq; of counsel for the Crown. Mr Fannen, look upon this deed; (giving him the counterpart) did you see it executed, and by whom?
19	Fannen. This deed (taking the counterpart in his hand) was executed by Mr Ayliffe in my presence.	John Fannen. This deed was executed by Mr Ayliffe in my presence.
20		Mr Wedderburne. Tell us, if you can recollect, when this was, and where?
21		John Fannen. I am not sure; but to the best of my remembrance, it was sometime the beginning of December last, at Mr Fox's house.
22	King's Counsel. Was there any other deed executed at the same time?	Mr Wedderburne. Was there any other deed executed at the same time?
23	Fannen. There was another part executed by Mr Fox.	John Fannen. There was another part executed by Mr Fox.
24	Q. Was you a subscribing witness to this deed?	Mr Wedderburne. Look upon the back of this deed; are you one of the subscribing witnesses?
25	Fannen. I was; and also to the other part executed by Mr Fox.	John Fannen. I am; and so I was to the other part, which was executed by Mr Fox.
26	Q. Had you any conversation with the prisoner after the execution of this lease?	Mr Serjeant Davy. Had you any conversation with the prisoner touching this lease immediately after the execution of it?
27	Fannen. I had; but I cannot say, whether it was immediately after, or a day or two after.	John Fannen. I had; but I cannot say, whether it was immediately, or a day or two after the lease was executed.
28	Q. Where was it?	Mr Serjeant Davy. Where was this?
29	Fannen. In my room at Mr Fox's house. I asked Mr Ayliffe, whether Rusley Park was it? he said, Mr Fox had been so good as to it him at the rent of 30 l. or 35 l. a year, I am not sure which, and he expressed Mr Fox's great goodness in so doing.	John Fannen. It was in my room at Mr Fox's house; I was not sure, though I guessed, that the deed executed by Mr Fox and witnessed by me, was a lease of Rusley park; and, therefore, I asked Mr Ayliffe, whether Rusley park was let; to which he answered,that Mr Fox had been so good to let it him at the rent of thirty or thirty five pounds a year – I am not positive which. – And he expressed great obligations to Mr Fox for his goodness in so doing.
30	Prisoner. I should be glad to look at that deed.	Prisoner. I should be glad to look at that deed.
31	Court. You shall see it as soon as it has been read.	Lord Chief Justice. You shall see it as soon as it has been read. In the mean time, would you ask this witness any question?
32		Prisoner. No, my Lord.
33	Eilis Dawe, sworn.	Mr Aston. My Lord, we will now call a witness to prove the concluding words of this counterpart (In witness whereof the parties to these presents have hereunto interchangeably set their hands and seals, the day and year first above written) to be of the prisoner's own hand-writing – Call Mr Ellis Dawe. (Who appeared and was sworn.)
34	King's Counsel. Look at the concluding words in this deed, from the words [In witness whereof] He takes it in his hand.	Mr Wedderburne. Mr Dawe, look at the concluding words in this deed, beginning with the words (In witness whereof) and tell us if you know the hand-writing.
35	Dawe. I know the prisoner, Mr Ayliffe; these words are his hand-writing.	Ellis Dawe. I do: I am well acquainted with the prisoner Mr Ayliffe; and believe these words to be of his hand-writing.
36	Q. Look upon the endorsement; I mean the title of the deed written upon the back of it; whose hand-writing is that?	Mr Wedderburne. Look upon the endorsement; I mean the title of the deed written upon the back of it; Whose hand-writing do you take that to be?
37	Dawe. That is Mr Ayliffe's hand-writing.	Ellis Dawe. I believe it to be Mr Ayliffe's hand.
38	Q. Are you acquainted with his hand-writing?	Mr Wedderburne. Are you well acquainted with his hand-writing?
39	Dawe. I am.	Ellis Dawe. I am.
40	Q. Have you seen him write?	Mr Wedderburne. Have you seen him write?
41	Dawe. I have often seen him write.	Ellis Dawe. I have often seen him write. Mr Serjeant Davy. We have now fully established this deed, and desire it may be read.
42		Lord Chief Justice. Let it be read.
43	The lease read, which bore date the 27th of November; and wherein the rent reserved was thirty-five pounds a year. After which the jury took the lease and suspected it, and then it was shewn to the prisoner.	[Then the Clerk of the arraigns read the deed, being a counterpart of a lease dated the 27th day of November 1758; from the Right Honourable Henry Fox to the prisoner, of a farm called Rusley-Park, in the parish of Bishopstone in the county of Wilts, for 99 years, if the prisoner, Sarah his wife, and John their son, or any of them should so long live, at the yearly rent of 35 l. (in words at length) with a covenant for payment of the aforesaid rent of 35 l. (in words at length). After which the same was delivered to the jury for their inspection, and then shewn to the prisoner agreeable to his request.]
44	Mr Aston. My Lord, we will now proceed to prove the publication of the forged deed in question; and for that purpose we will begin with a mortgage thereof, executed by the prisoner to Mr Clewer. Walter Hargrave, sworn.	Mr Aston. My Lord, we will now proceed to prove the publication of the lease mortgaged by the prisoner to William Clewer, Esq; and which bears date the 22d of Nov. and is at the yearly rent of 5 l. and for that purpose we will first prove the mortgage, wherein that lease is recited. – Call Walter Hargrave. (Who appeared and was sworn.)
45	King's Counsel. Look upon this deed (giving in the mortgage) see whether your name, upon the back, as a subscribing witness, is of your hand-writing.	Mr Aston. Mr Hargrave, look at your name set upon the back of this mortgage, as a witness to the execution of it; and tell us, whether that is your hand-writing?
46	Hargrave. It is.	Walter Hargrave. This is my hand.
47	Q. By whom did you see it executed?	Mr Aston. By whom did you see this mortgage executed?
48	Hargrave. By Mr Ayliffe: I saw him seal and deliver it.	Walter Hargrave. By Mr Ayliffe. – I saw him sign, seal, and deliver it, as his act and deed.
49	Q. Where?	Mr Aston. Where did you see it executed?
50	Hargrave. At Mr Priddle's chambers, in the King's-Bench-Walks, in the Temple.	Walter Hargrave. At Mr Priddle's chambers, in the King's-Bench-Walks, in the Inner-Temple.
51	Q. Is your name there, as a subscribing witness to the receipt for the consideration-money, likewise of your hand-writing?	Mr Aston. Is your name, set as a subscribing witness to the receipt for the consideration-money, likewise of your hand writing?
52	Hargrave. It is: I saw Mr Ayliffe sign this receipt for 1700 l.	Walter Hargrave. It is: I saw Mr Ayliffe sign this receipt for 1700 l.
53	Q. At the time of the execution of that mortgage, did Mr Ayliffe deliver any title-deeds to Mr Clewer?	Mr Aston. Did you see the prisoner, at the time of the execution of this mortgage, deliver any title-deeds to Mr Clewer?
54	Hargrave. He did: and this lease (pointing to the forged lease in question) was one.	Walter Hargrave. Yes, he delivered several deeds to Mr Clewer, and that lease (pointing to the lease set out in the indictment) was one of them.
55	Q. Do you remember any request being made, and by whom, to keep this mortgage a secret?	Mr Aston. Do you remember any request being made at this time, to keep the mortgage of that lease a secret? and if you do, tell us by whom such request was made, and who were present.
56	Hargrave. Mr Ayliffe desired Mr Bradley, Mr Green, Mr Clewer, and myself, to swear to keep it a secret.	Walter Hargrave. Mr Ayliffe desired Mr Bradley, Mr Green, Mr Clewer, and myself, to swear that we would keep it secret.
57	Q. What answer was given to that?	Mr Aston. What answer was assigned to this?
58	Hargrave. We told him we would take no such oath.	Walter Hargrave. We told him we would take no such oath.
59	Q. What reason did he give for this request?	Mr Aston. Was any reason given for such request?
60	Hargrave. Because he said he was not willing Mr Fox should know of it?	Walter Hargrave. The reason Mr Ayliffe gave, was, that he would not on any account have it come to Mr Fox's ears.

Appendix B: Speech tagger algorithm

Introduction

The Pearl script developed for identifying and tagging spoken language in the Proceedings treats the files as ordinary text files, not as xml-hierarchies, because it makes creating regular expressions easier. Anything tagged as <front>, <back>, <summary>, or <advert> is disregarded in the tagging process since these parts of the Proceedings do not contain speech. The following is a rough sketch of the patterns analyzed by the tagging software. A more detailed algorithm is given in the next section.

The first patterns searched for are 'Q - A'-sequences in different forms. If one is found, the speech-tag is added in the right place (steps 1-13).
The second patterns searched for are '? -' sequences in different forms (steps 14-17).
The third patterns searched for are paragraphs that start with "names". A name is simply defined as any character [A-Za-z']. Lines starting with Mr. or Mrs. are included. The line can start with either one "name" or two "names" followed by a dot. Example: 'Smith. I was walking...' or 'John Smith. I was walking...' or 'Mr. John Smith. I didn't see...'. Special cases that should not be tagged are stated and not changed. An example is 'First Indictment', lines starting with 'Before' etc. (steps 18-37).
The fourth patterns searched for are paragraphs that are directly followed by a </person>-tag followed by a '.' or ', (profession label)'. Example: ' <person>John Smith</person>. I was in my...' or '<person>John Smith</person>, a Watchman. I was looking...' (steps 38-48).
The fifth patterns searched for are paragraphs that end with a '?'. These are probably direct questions. Example: 'Did you see the Prisoner?' (step 49).
The last patterns searched for are paragraphs that contain a first or second person pronoun such as 'I', 'you', 'yourself', 'our' etc. If this is found the speech-tag is added in the right place. Some cases should not be tagged, such as paragraphs containing the string 'our Lord the King' (steps 50-60).

Perl Speech Tagger algorithm

If the line contains 'Q.' or 'Q ' and if there is no character before it, the line should be further checked.
All occurrences of 'Q.' or 'Q ' get </speech> before it and <speech> after it.
If the Q. is within a <person>-tag it is a name and should not get a <speech>-tag. The <speech>-tag is deleted.
Some files are wrongly tagged in the version obtained from Tim Hitchcock and Robert Shoemaker, in that a single Q. was tagged as a person. In these cases the <speech>-tag is moved and placed after the <person>-tag.
The first sequence of Q. should not have </speech> before it, this is deleted
Once the <speech>-tags are added for the Q.s, the rest of the line is checked for answer-sequences. If the line contains '- A.' or '- A,' or '- A ', this substring gets </speech> before it and <speech> after it.
If the A. is within a person-tag, it is a name. Don't process further.
If the line contains 'A.', this substring gets </speech> before it and <speech> after it.
If the line contains ' - ', this substring gets </speech> before it and <speech> after it.
If there is no </speech> before the end -tag this is added.
If a <speech>-tag does not have a -tag after it, this is added.
After all string substitutions are done, the line is written in the output file.
Or else the line is printed without processing.
If the line contains a question - answer sequence using '? - ' or '? - A.', the speech-tags are added in the right place.
If the question is preceded by the name of the speaker, the speech-tag is added after this.
If no name is used at the start of the line, the whole sequence is included in the speech-tag.
If one of the above was true, the line is printed in the output file.
Lines containing the <off>-tag (=offence) should not be processed.
If the paragraph ends with 'sworn.', no speech-tag should be added if the line contains ' I.', this is probably not the pronoun but a number, no speech-tag added.
If the paragraph starts with 'before', 'for', 'in', or 'guilty', this is not a name, no speech-tag added.
If the paragraph contains the string 'conducted the prosecution', no speech-tag added.
If numbers follow and then <person> no speech-tag added.
If the paragraph starts with 'XX Defence.', XX being any number of characters, the rest of the line is tagged with <speech>.
If the paragraph starts with 'A Justice of the Peace.', the rest of the line is tagged with <speech>.
If the paragraph starts with 'N.B' or 'N. B.', this is probably not a name, no <speech>-tag added.
If the paragraph starts with 'Mr. XX', 'Mrs. XX', 'Mr. XX XX' or 'Mrs. XX XX', XX being any letter character or the apostrophe, the <speech>-tag is added.
If the paragraph starts with a name followed by 'Mr.' or 'Mrs.', 'Mr.'/'Mrs.' should be included in the <speech>-tag. An example: 'Smith. Mrs. Johnson and I went to...'
If the paragraph starts with a name followed by 'yes' or 'no', 'yes'/'no' should be included in the <speech>-tag.
If the second word in the paragraph is 'indictment' or 'Count', no <speech>-tag is added.
If the paragraph contains the string 'Cross Examination', no <speech>-tag added.
If the paragraph starts with an <img>-tag (=image) followed by a name or speaker-label such as 'Prisoner', the <speech>-tag is placed after the speaker.
If the paragraph starts with a name or speaker-label such as 'Prisoner', the <speech>-tag is placed after the speaker.
If the paragraph ends with a 'the jury' announcing a verdict, </speech> is placed before this.
If the paragraph ends with a <verdict>-tag, the </speech>-tag is placed before this.
If the paragraph starts with the sequence 'XX to YY', the <speech>-tag is placed after this. XX and YY are defined as any sequence of characters. Example: 'Prisoner to Witness. I didn't see what …'
All other cases where the line starts with a name/speaker.
If true in any of the cases above, the line is printed in the output file.
Else if a <person>-tag follows directly after a -tag, this is probably a speaker-indication.
If the paragraph starts with a <person>-tag, the <speech>-tag is placed after this. If the paragraph ends with a 'the jury' announcing a verdict, </speech> is placed before this.
If the paragraph starts with a <person>-tag, the <speech>-tag is placed after this. If the paragraph ends with a verdict, the </speech> is placed before this.
If the paragraph starts with an <img>-tag, directly followed by a <person>-tag, the <speech>-tag is placed after this.
If the paragraph starts with 'Mr./Mrs.', directly followed by a <person>-tag, the <speech>-tag is placed after this.
If 'sworn.' follows directly after the <person>-tag, the <speech>-tag is placed after this.
If the <person>-tag follows directly after the -tag, ending with '.', the <speech>-tag is added after this. Example: ' <person> John Smith </person>. I was watching...'
If the <person>-tag follows directly after the -tag, followed by ', and...', the <speech>-tag is not added. Example: ' <person> John Smith </person>, and <person>Mary Smith </person>...'
If the string 'was indicted' is found in the paragraph, no <speech>-tag is added.
If the <person>-tag follows directly after the -tag, followed by ', ', the <speech>-tag is added. Example: '<person> John Smith </person>, a Watchman. I was working...'
The line, if true in any of the cases above, is printed in the output file.
If the paragraph ends with a '?', the <speech>-tag is added. If the <speech>-tag is placed before a <person>-tag it is moved and placed after </person>. Example: ' Did you see the man?'
If none of the above criteria is matched, the line is checked for personal pronouns. The <speech>-tag is added in the right place.
Some files were wrongly tagged with two <person>-tags in a row, this is dealt with.
If the line starts with a <person>-tag, the <speech>-tag is placed after this.
If the line contains 'our said Lord', no <speech>-tag is added.
If the line contains 'our Lord the King', no <speech>-tag is added.
If the line contains 'our Sovereign', no <speech>-tag is added.
The <speech>-tag is placed after 'sworn.'.
The <speech>-tag is placed before a verdict
Otherwise the <speech>-tag is placed after
The line is printed in the output file.
In case none of the above are true, the line is printed in the outfile.

Article Contents

The Old Bailey Proceedings, 1674-1834 Evaluating and annotating a corpus of 18th- and 19th-century spoken English