Corpus annotation: a welcome addition or an interpretation too far?

Dawn Archer
University of Central Lancashire


‘[...] annotation schemes and tools have become
an important dimension of corpus linguistics...’

The above quotation is taken from the Helsinki Corpus Festival website. The website goes on to highlight how annotation schemes and tools can help us to address a variety of issues … syntactic, semantic … pragmatic and sociolinguistic. The implicit message, then, seems to be that annotation – ‘the practice of adding interpretative linguistic information to a corpus’ (Leech 1997: 2), whether automatically or by manual means – will result in a “value added” corpus (Leech 2005). The value of annotation is further echoed by McEnery and Wilson’s (2001: 32) statement that ‘a corpus, when annotated’ becomes ‘a repository of linguistic information’ which, rather than being left implicit, is ‘made explicit’, and their later proposal that such ‘concrete annotation’ makes retrieving and analysing information contained in the corpus “quicker” and “easier”.

There are dissenting voices, of course. The late John Sinclair (2004: 191) believed corpus annotation to be ‘a perilous activity’, which somehow adversely affected the “integrity” of the text. He was particularly aggrieved by the possibility that researchers would only observe their corpus data ‘through the tags’ and hence miss ‘anything the tags [were] not sensitive to’ (p191). Susan Hunston is less critical than Sinclair. However, she does suggest that one of the strengths of corpus annotation – the ability to retrieve specific data systematically – might also prove to be a weakness if the researcher remains oblivious to the possibility that their research questions are being shaped, to some extent, by the categories used during the retrieval process (2002: 93).

In this article, I will provide some background in regard to the ongoing “annotation debate” before moving on to assess the strengths and weaknesses of my own use of annotation – manual and automatic – at the part-of-speech, semantic and (socio-)pragmatic level, using datasets which represent both present-day English and English of times past. My aim is to explore/suggest ways in which our texts are being enriched and might be further enriched – as opposed to being “contaminated” – by the annotation process (cf. Sinclair 2004); such that we become “sensitive annotation users” (and/or developers). I will also offer some blue-skies thinking in regard to an automated Historical Semantic Tagger which allows you to “shape the tags”, to some extent; and will ask whether such exploits – if achievable – might serve to transcend a distinction currently fuelling the “annotation debate”… that of corpus-based vs. corpus-driven.

1. Introduction

A special panel on Corpus Annotation was organised for the 2011 Helsinki Corpus Festival. The webpage relating to this special panel stated that,

[although t]he original Helsinki Corpus was created at a time when annotation was not yet a reality, ... annotation schemes and tools ha[d since] become an important dimension of corpus linguistics (Helsinki Corpus Festival Website, 2011).
By so doing, the organisers signalled a common belief within corpus linguistics (especially amongst corpus-based researchers): that adding ‘interpretative linguistic information to a corpus’ (Leech 1997: 2), whether automatically or by manual means, results in a value-added corpus (Leech 2005). [1]

There are dissenting voices, however; most notably the late John Sinclair’s. Sinclair (2004) believed corpus annotation constituted ‘a perilous activity’, which somehow adversely affected the integrity of the text. He was particularly aggrieved by the possibility that researchers would only observe their corpus data ‘through the tags’ and hence miss ‘anything the tags [were] not sensitive to’ (2004: 191). Susan Hunston is less critical than Sinclair. However, she does suggest that one of the strengths of corpus annotation – the ability to retrieve specific data systematically – might also prove to be a weakness, if the researcher remains oblivious to the possibility that their research questions are being shaped, to some extent, by the categories used during the retrieval process (Hunston 2002: 93).

My intention, in this paper, is to outline existing annotation practices at the part-of-speech, semantic and (socio-)pragmatic level, drawing on both my own research and the work of others, so that I might address a number of pertinent issues. They include whether texts are being enriched and might be further enriched – as opposed to being contaminated – by the annotation process (cf. Sinclair 2004); and how we can ensure that we are developing/using annotation diligently. I will also offer some blue-skies thinking in regard to an automated Historical Semantic Tagger which will allow end users to shape the tags, to some extent; and will suggest that such exploits transcend the corpus-based/corpus-driven distinction evident within corpus-linguistics (see also Rayson 2008).

2. Annotation: an example of attempting the absurd?

I would like to begin my paper by posing a question: is the value of annotation similar to the famous lithograph print, Ascending and Descending, by M.C. Escher (1960), which depicts people walking up and down a never-ending staircase (for an image of which, see the link in Sources)?

As the question might seem a little odd, allow me to elucidate: the people (who are depicted in Escher’s print) think they are climbing up the stairs, but – unbeknown to them – they are actually within a loop that takes them back to their starting point. Is it possible that annotation constitutes a similar illusion? By this I mean, does it add nothing that is not already there, in the data, if one takes the time to look closely?

To help me to answer this question, I want to introduce a famous quotation, which is also ascribed to Escher – Only those who attempt the absurd ... will achieve the impossible – and use it to pose a contrary view to that which is evident within my starting question(s): namely, that annotation can seem absurd – when viewed out of its context of use – yet, its saving grace is its ability to get us to see things in new ways.

An example of attempting the absurd which I can offer at this point is an Archer and Rayson (2004) investigation involving the UCREL Semantic Annotation System (USAS) and 432,317 words of refugee-related data, which became known as the one-month challenge (Deegan et al, 2004). Rayson and I were asked to tag refugee data provided by the Forced Migration Online team using USAS, a software package for automatic dictionary-based content analysis, and then read the results to a room full of refugee experts in London – drawing from the USAS semantic field tags alone. [2] To elucidate, we were specifically asked not to read any of the refugee-related texts ourselves during the (automated) tagging phase and, during the presentation itself, we were expected to read the SEMTAGs (as opposed to reading from the texts themselves, word by word, paragraph by paragraph).

The purpose of the one-month challenge was to see whether we could uncover any nuances of difference – in terms of meaning, initially – between the different types of refugee-related texts (UNHCR, Federation of the Red Cross, Government agencies, NGOs and Academic – mostly FMO grey – literature). This proved to be possible. In fact, we are able to infer differences between ideologies that were underpinning these texts. Some organisations such as the Red Cross, for example, were concerned most about living conditions – and, in particular, that people had enough food and shelter to sustain them. Documents relating to government organisations, in contrast, often picked up on ethical and legal issues (including piracy – an issue that was not commonly reported by media outlets in 2004).

I am not suggesting that this is the way to go for everyone, annotationally speaking. But I would venture that it may be one way to go, for some, in special circumstances. After all, are there not parallels, here, with the methodologies of the semantic web? (cf. Breitman et al. 2007).

3. Approaches within corpus linguistics: corpus-based/corpus-driven to data-driven?

The above one month challenge touches on one purpose of this paper: exploring the “middle ground” between the corpus-based and corpus-driven positions within corpus-linguistics. Although these two corpus-linguistic positions share similarities in practice (see McEnery and Hardie 2011, especially Chapter 6, for a detailed discussion), my purpose here is to emphasise their differences, so that I can summarise their (stereo)typical responses to annotation. Suffice it to say, corpus-based researchers typically begin with a research question, which they then test using an existing corpus – or, if necessary, a new corpus (which they will construct). To make it possible to retrieve results that are as objective as possible, they will also engage in annotation, when deemed relevant (Leech 1992, 1993, 1997). Once they have their annotated corpus, the corpus-based adherent tends to engage in qualitative and quantitative investigative procedures as a means of gaining the information which addresses their research question (or research questions). If they are the tool developer, however, they might be more interested in the extent to which the results they have gleaned via the retrieval processes confirm or disconfirm the accuracy of their tool and its approach.

As Tognini-Bonelli (2001: 85) has pointed out, corpus-driven researchers believe existing theories/assumptions about language, which are based on work that pre-dates the emergence of the electronic dataset, need revising and/or replacing, and hence that annotations/tools based on such erroneous theories/assumptions are in danger of adversely skewing any results that they retrieve. Not surprisingly, corpus-driven researchers tend to reject the (automated) use of annotation, in particular. But Rayson (2008) has suggested that this may be going too far: in part – it must be said – so that he might offer a third approach – a data-driven one – which, he argues, combines the best of both the corpus-based and the corpus-driven methods. For example, rather than rejecting things like part-of-speech taggers (as is the wont of corpus-driven researchers), Rayson makes use of them within his approach. But his starting point is not a research question, per se (cf. the corpus-based approach, above). Instead, he begins with the compilation of two datasets; one of which is the dataset of interest, and the other, a comparative (or normative) corpus. He then automatically annotates these texts for part-of-speech and semantic-field information, using CLAWS (Garside et al. 1987) and USAS (Rayson 2003), and statistically compares the results, via the web-based interface Wmatrix. It is at this point that the researcher comes back into the equation, according to Rayson: his or her role is to ‘qualitatively examine concordance examples of the significant words, POS and semantic domains’ (2008: 528), which have been highlighted as key statistically speaking, as a means of determining topics/features worthy of further investigation. The approach, then, is not completely automatic – and any research questions which arise come from the interrogation of the data itself (thereby alleviating some of the concerns of Sinclair 2004 and Tognini-Bonelli 2001).

This does not mean to say that researchers cannot come with a question they might want to address using USAS, however. They can. Before I demonstrate this, via my own research, I will briefly outline the most common types of manual and (semi-)automatic annotation schemes.

4. Corpus annotation: manual and (semi-)automatic

4.1 Automatic approaches

Automatic approaches, within corpus linguistics, include part-of-speech annotation, lemmatisation and semantic annotation. Part-of-speech (henceforth POS) annotation is particularly useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis, such as syntactic parsing and semantic field annotation (both of which are outlined below). Thanks to early inter-rater-consistency studies (such as Baker 1997), we know that automatic POS taggers can be extremely accurate. This said, some manual or semi-automatic amendments will always be necessary, if the tagging result is to be (near-)perfect. [3] The lemmatisation process is closely allied to the identification of parts-of-speech. It involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation, then, allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for that lexeme. As Hickey (1995: 146) highlights, lemmatisation can be extremely useful when working with historical data especially, not least because it helps to minimize typographical errors. Spelling variation is also a common problem when it comes to historical datasets, of course. However, tools have been developed recently which help to eradicate this problem without the need for lemmatisation. They include the VARD, a tool I describe in Section 5.

Semantic annotation involves assigning each word within a text a word sense code. A good example of this is USAS, the system Rayson and I used to mine the refugee data (mentioned in Section 2). Very briefly, USAS makes use of POS information, provided by CLAWS, when assigning its 232 SEMTAG categories (on the basis of pattern matching between the text and several computer dictionaries), and is reported to have a 91% accuracy rate in regard to modern general English (Rayson et al. 2004).

Other automatic techniques – used on very large datasets – include named-entity recognition (NER) and (data) visualization. Visualization techniques allow users to explore data through a graphic interface instead of – or more commonly, these days, in addition to – exploring them using word or phrase searches, whereas NER techniques allow users to structure or model their data, using grammatical information (= rule based) or statistical information. People, organizations, locations, time expressions, quantities, monetary values and percentages tend to be the most common predefined categories within named-entity recognition tools to date; which may help to explain why, within the Humanities and Social Sciences, it is probably the historians, sociologists and media scholars who make most use of named-entity recognition techniques. [4]

4.2 Manual approaches

Manual types of annotation tend to be used when dealing with much smaller datasets – in part because of the time it takes to apply such annotation. The types of annotation which are possible include discoursal annotation, problem-oriented annotation and sociopragmatic annotation.

Discoursal annotation is not particularly common, even now. This is most likely because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena. One of the most well known systems – by Stenström (1984) – has been applied to the London-Lund spoken corpus. The 16-strong tagset captures things like apologies, greetings, hedges, politeness phenomena and feedback responses.

I have been involved in the development of coding systems via which to systematically capture similar features to Stenström (1984), but using historical (English) datasets. For example, I have annotated requests in the Corpus of English Dialogues, in terms of their in/directness, with my colleague Jonathan Culpeper (Culpeper and Archer 2008), and also developed annotation systems (using the same dataset) which capture initiations, responses and follow ups as well as the form and functions of questions, and their corresponding answers (Archer 2005). I provide some examples of these schemes in Section 6.3. I also discuss the sociopragmatic annotation system we have devised (Section 6.2): this particular scheme accounts for features such as gender, age, status and interlocutor role at the Utterance level (Archer and Culpeper 2003).

Research specific/problem-oriented tagging is the phenomenon whereby users will take an un-annotated or annotated corpus and add to it their own form of annotation. As these schemes are applied with a specific research goal in mind, they tend not to be exhaustive. By this I mean, only those words/sentences or utterances which are directly relevant to the research tend to be tagged. [5] Interestingly, the use of research specific/problem-oriented tagging seems to have been more acceptable to Sinclair (2004) – than (automatic) POS tagging, for example – as (in his view) the researcher was still free to ‘observe’ the patterns of the texts themselves; assuming the tagging was applied to an un-annotated corpus (see Section 6).

4.3 Semi-automated techniques

Capturing pragmatic information relating to speech acts using semi-automated techniques is possible using modern data, as Leech and Weisser (2003) have shown, when working with telephone dialogues which, importantly, are task-oriented. It is more difficult to use semi-automatic techniques on historical data – not least because speech act forms and/or their frequency will tend to change over time. This said, there are some interesting papers in Jucker and Taavitsainen’s (2008) edited collection which suggest that we are not too far aware from developing algorithms via which we might capture relatively fixed or highly formulaic speech acts (such as compliments and directives).

Parsing relates to the time-intensive procedure of bringing basic morpho-syntactic categories into high-level syntactic relationships with one another. Parsing is probably the most commonly encountered form of corpus annotation after POS tagging – although it is used less when dealing with historical data than it is when dealing with modern data. Two of the best known parsed historical corpora are the Penn Parsed Corpora of Historical English and the Parsed Corpus of Early English Correspondence. The developers of these corpora have also developed corpus-specific search engines; my mention of which serves as a useful reminder that annotation systems are meaningless if you have no means of searching for and retrieving the tags which have been incorporated into a particular dataset.

5. Retrieval tools

More general search tools used by linguists include the aforementioned Wmatrix (Section 2) and WordSmith Tools (Scott 2008). These particular tools allow users to first create and then search through concordances, identify the collocates of a given search item, and also identify the key words and/or key domains that are an important part of Rayson’s (2008) data driven approach – i.e. those lexical items which tend to be over- or under-used, statistically speaking, when compared to a comparative or normative corpus. It is worth noting that these and similar tools work best on historical texts which have been lemmatised (Section 4.1), or which have had variant spellings normalized. One such normalizing tool – the VARD – came about when Rayson and I recognised the need for a pre-processing step within the Historical Semantic Tagger, so that the system would not fall down because of all the variant spellings within historical texts. That VARD tool has been substantially improved by Alistair Baron over the past few years (see this link in the Sources section for details). Baron has also created a second tool – DICER – to interrogate the results captured by the VARD tool. When used together, such tools therefore make possible a systematic investigation of spelling variants over time/across different genres and text-types (Baron et al. 2009, 2011).

6. Trust the text?

Corpus-driven researchers – most notably Sinclair (2004: 191) – are concerned that corpus users, who are reliant on annotation, will only observe their corpus data ‘through the tags’ and hence miss ‘anything the tags are not sensitive to’. This does not mean they have a complete aversion to annotation (as I highlight above). Rather, their preference is for minimal – (usually) non-permanent, (and preferably) hand-written – annotation. Their reasoning? Such an approach helps to ensure (i) texts can ‘speak for themselves’ (cf. Sinclair 1994: 12), and (ii) corpus users can remain as sensitive as possible to their (interpretation of the) data (Sinclair 2004). Notice the implicit assumption within (ii) – that corpus-based researchers cannot be as text-sensitive as they need be (presumably because the tags get in the way): I’ll discuss the validity of this assumption in Section 6.6. First, I want to explore the implicit assumption within (i) – that the myriad of texts in digital/electronic form are somehow correct enough that we can trust them unequivocally. For, thanks to people like Matti Rissanen, historical linguists have long been aware of not “trusting the text” without first doing some additional detective work.

6.1 Rissanen’s prophetic voice

Rissanen (1989) identifies three problems that we might need to address when working with diachronic corpora – the Philologist’s Dilemma, God’s Truth Fallacy and the Mystery of Vanishing Reliability – only the latter of which is directly linked to the issue of annotation. As I will demonstrate, however, all three of the problems are applicable. Consider, for example, Rissanen’s warning in respect to the Philologist’s Dilemma: the idea that researchers can fall into the trap of trusting a corpus too much, to the extent that they do not work at all – or, at least, do not work enough – with the original materials themselves. The dangers of this are obvious, especially when the corpus contains extracts of texts rather than complete texts. I have discussed this at length in regard to courtroom data (see, e.g., Archer 2007). We often find, for example, that representative corpora containing courtroom texts tend to include only the question-answer sequences between examiner and examinee. This can give us the false impression that the courtroom is categorised by questions and answers only. Those who study the courtroom (modern and/or historical), however, know that this particular macro speech event is made up of a number of mini activity types (Levinson 1992), some of which are not necessarily characterised by the Q-and-A format. Examples relating to the historical English courtroom include Judge and lawyers interacting at the Bar, defendants interacting with witnesses, defendants interacting with judges, and the interaction around the arraignment section and sentencing.

How might Rissanen’s warning in regard to the Philologist’s Dilemma also be applicable to annotation? It brings to mind Sinclair’s concern that researchers will use annotation without referring back to the text. That is they will trust the tags unquestioningly, as opposed to trusting the text. As Leech’s (1993) article on corpus annotation makes clear, however, annotation developers have been pointing out for some time that their schemes do not represent ‘God’s truth’ but, rather, a potentially useful interpretation based on ‘principles’ which are – or should be – ‘accessible to the end user’. Rissanen goes even further than both Sinclair and Leech, of course, when he reminds us not to automatically trust the text either, when that text is part of a corpus. This leads us nicely to Rissanen’s second dilemma when working with diachronic corpora; God’s truth fallacy – the idea that a corpus cannot tell us everything (just as annotation cannot tell us everything). [6] Let’s return to my previous example by way of an illustration, a courtroom text which is based on Q-and-A sequences – however interesting that text may be – will not tell us very much about the formulaic nature of the English sentencing procedure (historical or modern). As such, if we want to study these verdictives (Austin 1962), we need to explore some of the extant records to be found in libraries and other archives. Alternatively, we might build a corpus which focuses on sentencing – or my preferred option – contains complete trials (see Archer 2007).

Another myth in regard to transcripts, in particular, is that they constitute accurate recordings of, for example, speech events (Archer et al, 2012). Kytö and Walker (2003: 227–8) provide us with an extremely useful approach, when it comes to historical transcripts: checking whether there are various versions of the same speech event; and, if there are, determining how they differ. I would add that, where they do differ, we should also be seeking to determine whether/how those differences are affecting our understanding of the speech event. A great courtroom example of this is the Trial of Charles I (1649). Kytö and Walker (2003) have investigated the differences between four versions of this trial, some of which were amended by those involved. Lord President Bradshaw – the man who acted as judge in this infamous trial – is thought to have “improved” his own contributions in the version which bears his name, for example. This version, known as Bradshaw’s Journal, is the only known extant manuscript and, as such, is regarded as the official version, even though it contains indicators of Bradshaw’s ideological standpoint.

6.2 Beware framing!

The judges were not the only participants, within a courtroom context, who strategically framed their turns (for perpetuity). Examples of the defendants’ perspective are to be found in Quaker Trial Records, some of which can be accessed via the State Trials collection. These records were written a priori, and often contained sections which emphasised the bad treatment defendants felt they had experienced at the hands of their judges. The following example relates to a Quaker named John Crooks. In the utterance which immediately precedes this extract, Crooks reports his attempt to get the judges to act as his counsel (i.e. inform him of his legal rights). This occasioned a reproach from one of the judges, who, on thinking that Crooks was trying to tell them their job, used the slur sirrah. As the following extract makes clear, Crooks was undeterred. In fact, he admonished the Chief Justice for using the abusive term!

CJ: […] Sirrah, you are too bold.

Sirrah is not a Word becoming a Judge; for I am no Felon; neither ought you to menace the Prisoner at the Bar…Therefore you ought to hear me to the full what I can say in my own Defence, according to Law, and that in its season, as it is given to me to speak […]

Then the Judge interrupted me, saying Sirrah, with some other Words I do not remember: But I answered, You are not to threaten me, neither are those Menaces fit for the Mouth of a Judge: for the Safety of a Prisoner stands in the Indifferency of the Court; And you ought not behave yourselves as Parties; seeking all advantages against the Prisoner […]. The Judge again interrupted me, saying

CJ: Sirrah you are to take the Oath, and here we tender it to you [bidding me to read]

The above could constitute poetic licence, especially given the fact that most defendants were far less eloquent than Crooks reports himself to have been during his trial (Archer 2005). However, we do know that Crooks knew the Law extremely well: for, prior to becoming a Quaker, he had been a judge himself (for more details, see Cecconi 2011).

I am focussing in some detail on this framing technique as a means of highlighting that texts were written/shaped by people – and sometimes edited/shaped by people – who had specific agendas. This is especially true of collections of texts such as the State Trials, and some Old Bailey trials. It is also likely to be true of some political texts, medical treatises and letters (i.e. the material we tend to find in historical/representative corpora). This, in turn, points to an extremely important issue which links the two concerns discussed thus far (The Philologist’s Dilemma and God’s Truth Fallacy): I ardently believe that, in addition to familiarising ourselves with texts in their original form – and where two or more accounts of the same event exist, comparing those texts – we should learn about the political, social, religious, cultural and linguistic issues which may have affected a given text’s construction. For how are we to glean a better understanding of what was deemed to be appropriate and inappropriate communicative behaviour in times past unless we appreciate the period in which a given text was produced?

Let me explain this more fully, drawing from my own experience: after reading a few Early Modern English courtroom trials, I had a suspicion that directives were a feature of the historical English courtroom more so than people had previously claimed or believed. It was not until I began to annotate the texts, however, that I realised that directives were not just the preserve of the powerful in the Early Modern courtroom (i.e. the questioners). For defendants like Charles I and Titus Oates also used them. Interestingly, because I had co-developed an annotation scheme that captured sociopragmatic information, such as an interlocutor’s sex, age, gender, status and interlocutor role (Archer and Culpeper 2003), I was able to confirm that all six of the defendants who used directives in my corpus of courtroom trial texts (three of whom – Charles I, Hewitt, Mordant – were tried within a 10-year period) shared one thing in common beyond their defendant role: they were of a gentleman status or higher. Would I have discovered this without annotation? If I had managed to read certain trials in detail, and in a certain sequence, then I might. But it would have taken me a while (I should think) to realise the link between status and directives – not least because all that had been written previously seemed to focus on directives being the preserve of the powerful in court – i.e. had suggested such usage was role-dependent; and, as far as I am aware, no-one had linked such usage to sociological variables such as the status of the participants beyond the courtroom.

6.3 Annotation – how much detail is enough?

This brings us to the problem which Rissanen directly linked to annotation – the Mystery of Vanishing Reliability (i.e. the statistical unreliability of annotation that is too detailed). Simply put, the more detailed our annotation schemes, then the less they will tend to tell us in regard to more general patterns of language usage. In the worst case scenario, moreover, potentially skewed results may serve to ‘fetter research for decades’ (Rissanen 1989: 17; cf. Sinclair 2004: 191).

There is another side to this particular coin, of course, as Nunnally points out in Rissanen et al.’s co-edited collection, History of Englishes – and that is that ‘maximally applicable’ schemes tend to be ‘the least revealing about particular text types’ (1992: 371). This suggests, then, that the level of detail within an annotation scheme should be dependent – to some extent – on both the text type to be investigated and the research questions to be asked. Simply put, if your intention is to arrive at generalisable statements in regard to a particular period of English, do not use an annotation scheme that is too detailed. However, if your intention is to better understand a particular feature within a particular text type, your research project would most likely be better served by an annotation scheme which is developed for this purpose.

For my own studies relating to the historical English courtroom, for example, I developed quite detailed annotation schemes as a means of capturing the characteristics of questions and their answers as follows:

Macro SA Categories ‘Question’
macro speech act
Question functions Question
Counsel ‘S wants A to supply a missing variable by saying/confirming/clarifying something about X [ X = an action/event, behaviour/person’   Are, can, could, could not, did, did not, do, do not, don’t, had, has, has not, how, how came, how come, how far, how long, how many, how much, how near, how often, is, might not, might, must, must not, never, or, shall, was, were, what, when, where, whether, which, who, why, why did not, would, etc.
Question ‘ask about’
Request ‘inquire into’
Require ‘question, ascertain’
Sentence ‘interrogate’
Express ‘query,
 call into question’
Inform ‘entreat’

Table 1. The characteristics of questions.
(See Archer 2005: 127–131, for detailed definitions of the above, and the tags used to denote them)

‘Inform’ macro speech act

Answer functions

‘S wants to communicate something
(about X) to A’
(in)validate confirm (proposition)
identify do not confirm/oppose proposition
imply disclaim
supply evade
elaborate refuse to answer
expand Multiple

Table 2. The characteristics of answers
(See Archer 2005: 130–131, for detailed definitions of the above, and the tags used to denote them)

Given questions are/were (often) used as a means to control/manipulate, in a courtroom context, I believed that it was particularly important to have a system which could capture the nuance of meanings between the different question functions, as well as the answer functions they procured.

Although my annotation schemes have allowed me to uncover some illuminating facts about the Early Modern English courtroom, in regard to questions and answers specifically (and also speech acts such as directives) it could prove to be the case that I have looked at trials that, in the final instance, turn out to be unrepresentative of the historical English courtroom as a whole. Hence, the need for future research to in/validate both: (i) my current findings [7] and also (ii) the strength/statistical reliability of the annotation schemes themselves. [8]

6.4 Retrieving data systematically via annotation – a strength and/or a weakness?

Hunston (2002: 93) has suggested that one of the perceived strengths of corpus annotation – the ability to retrieve specific data systematically – can also prove to be a weakness, when the researcher remains oblivious to the possibility that their research questions are being shaped, to some extent, by the categories used during the retrieval process (Hunston 2002: 93, my italics). Some corpus-driven researchers have gone even further, accusing corpus-based researchers of being theory-blind – to the extent that, when their datasets suggest that pre-existing theory may be wrong (or when their annotation schemes do not fit), they ignore or discard the offending data, as opposed to amending the theory. My own response to such criticism is to stress that it is always best to use/develop annotation schemes in such a way that the annotation schemes themselves are not merely a means of finding data which supports one’s theory or stance – or someone else’s theory or stance for that matter. Rather, they should be seen as a means via which to examine evidence and, in examining evidence, determine whether that evidence fits one’s pre-conceived notions or challenges them. Those pre-conceived notions might relate to ideas in regard to the efficacy of specific tagsets, theory/ies, or additional interpretative frames (such as one’s knowledge of the period/activity type, etc.). And when pre-conceived notions are challenged in this way, I would argue that it is important to detail the “how” (as a means of uncovering the “why”).

It is worth noting, at this point, that there are several well-known annotation devices which help to alleviate the problem of precise “fit”. They include:

  • Using “fuzzy” tags, where it is proving difficult to capture a specific language feature using existing categories or tags;
  • Using multiple tags, when, say, an utterance contains more than one identifiable speech act;
  • Using portmanteau tags, where a particular meaning transcends two semantic domains, for example, and;
  • Using “problematic” tags, when it is not possible to assign a language feature or sociopragmatic variable to an existing category or tag.

Let’s explore a historical example taken from the Trial of King Charles, which is tagged according to the Archer (2005) macro speech act scheme, and the Archer and Culpeper (2003) sociopragmatic scheme. The utterance is made by Bradshaw, the Lord President.

<u stfunc="ini" force="m" force1="w" speaker="s" spid="s3tcharl001" spsex="m" sprole1="j" spstatus="1" spage="9" addressee="s" adid="s3tcharl002" adsex="m" adrole1="d" adstatus="0" adage="9">The Court hath considered of their Jurisdiction, and they have already affirmed their Jurisdiction; if you will not answer, we shall give order to record your default</u>

(Trial of Charles I, 1649)

The Court hath considered of their Jurisdiction, and they have already affirmed their Jurisdiction works first as a statement. As we read on, however, we quickly become aware that they have already affirmed their Jurisdiction indicates intent, as does the conditional clause which follows – if you will not answer, we shall give order to record your default. The if then clause also has another function – it acts as a warning and, perhaps (in line with the intent reading), a kind of promise. The multiplicity of meaning is captured in my scheme, via the tag; force="m".

Moving on now to a modern example, taken from Leech (1983: 23–4):

If I were you, I’d leave town straight away

Because of its inherent fuzziness, this particular utterance can be heard as advice, a threat or a warning (depending on the context-of-utterance, and the relationship at that time between S and H). The fuzziness also affords the speaker a level of plausible deniability (were the hearer to respond rather badly). A fuzzy tag provides the annotator with a means of maintaining such fuzziness, and the end user, a means of studying (linguistic/pragmatic) ambiguity.

6.5 Can annotation capture ‘the integrity of the data as a whole’ (Tognini-Bonelli 2001: 85)?

Tognini-Bonelli (2001: 85) has stressed that corpus driven researchers are committed to ‘the integrity of the data as a whole’ and as a result, their ‘theoretical statements are fully consistent with, and reflect directly, the evidence provided by the corpus’ (Tognini-Bonelli 2001: 85). Are we meant to infer, from this, that corpus driven researchers are much more sensitive to and appreciative of their data (and all its particular foibles) than corpus-based researchers? I hope not. For linguists who have developed annotation, or adopted algorithms to find particular speech acts in different electronic datasets, have consistently shown themselves to be very sensitive to and appreciative of their datasets. For example, the form-based approach used by Taavitsainen and Jucker (2008) to study complements was completely reliant upon the researchers knowing their data (taken mainly from literature texts), not least because: they had to consider what forms to look for before searching for them electronically; to interrogate their data to determine whether some of the results they were retrieving via their search strings were actually ‘false friends’ (Kohnen 2002); and to add to their original/starting list as they noticed compliment strings in their data they had not identified prior to their investigations. Simply put, it was not enough to retrieve all examples of more beautiful, really nice, great, well done, etc.: Taavitsainen and Jucker also had to go through that list and work out what was and what was not functioning as a compliment, given what they knew about compliments in this period (cf. Section 6.2). This required careful/detailed readings of all of the hits, in their wider co-text and context.

In Archer (2005), I used a different method of speech act identification: I categorised utterances within my courtroom data according to whether they functioned as a counsel, question, request, require, sentence, express and/or inform. I also added a more detailed annotation scheme in respect to the functions/forms of questions and the functions of answers (see Section 6.3); and Culpeper and I have since added a more detailed annotation scheme in respect to the functions/forms of directives (Culpeper and Archer 2008; see also Section 4.2). That annotation scheme initially drew on the work of Blum-Kulka et al. (1989). However, our interrogation of the data quickly revealed that we could not apply the categories used by them without some amendment. For example, we found examples of locution derivable / obligation statements where the obligation imposed was on S, or on both S and H (as opposed to H only, as is the case in Blum-Kulka et al.’s tagset). [9] A new category we had to introduce (as opposed to merely adapting) is that of prediction/intention. This category relates to S’s prediction of or intention to perform individual or joint action, and captures examples such as “thee and I will make a visit”. [10]

7. Annotation – issues to address

I believe that we have come a long way in terms of the annotation journey. But there are still things to address. Let’s focus, for a moment on pragmatic annotation.

7.1 Issues relating to pragmatic annotation

An important consideration, here, is balancing the time it takes to annotate a dataset, with the results you are likely to achieve; not least because texts need to be manually applied in the main (in spite of Jurafsky and colleagues’ impressive inroads in respect to automatic approaches to speech act identification: see, e.g., Jurafsky and Martin 2009). Pragmatic annotation can end up being quite complex too – which brings into play one of Rissanen’s (1989) concerns, specifically, the mystery of vanishing reliability. For example, my question type scheme for the historical courtroom consists of 15 tags (if we include the tag relating to ‘questions occurring as part of a narrative report’ and the ‘problematic’ tag, which I used when other tags failed to capture the example). This said, this is one aspect of my questions and answer scheme (Archer 2005) that could have been incorporated semi-automatically, via the use of search strings which makes use of S/V inversion (when used), for example (thereby partly alleviating the ‘time taken’ issue). [11] Mention of Rissanen’s vanishing reliability problem also brings to mind the issue of whether a categorisation scheme for question types should aim to be representative of different activity types – for example, the courtroom and the classroom – or be specific to one activity type only (Section 6.3). I do believe that there is a lot of cross-over when it comes to my scheme(s) – but I have not tested this empirically, beyond the work undertaken by one of my postgraduates in connection with the classroom (Davenport 2006).

A second potential issue is whether my own approach to annotation has a specific “theory bias”. Having explored some work within Experimental Pragmatics recently, I was struck by the difference between Neo-Griceans and Relevance Theorists in the evaluation of their results for a forced-choice judgment task (for an outline, see Archer 2011: 474–5). And I have begun to wonder whether my work also shows a theory bias. I know, for example, that a lot of my work assumes that speech acts like requests and questions occurred in the Early Modern English period through to today – what I query is the extent to which they were realised in the same way. This seems to be quite a safe assumption, especially if one accepts the linguistic version of the Uniformitarian Principle (see Labov 1994: 21). But are there times when I am assuming things – because of my training and educational experience – which are not safe to assume? I do not think I am. However, that’s the problem with theoretical biases – we can be blind to things ourselves which are obvious to those who do not share such biases; or even regard something as a fact that turns out to be an interpretation. A remarkable example of this is the recent finding that some things may well travel faster than the speed of light (see the article "Speed-of-light results under scrutiny at Cern" on the BBC News website).

7.2 Issues relating to USAS

Let’s focus now, albeit briefly, on more automatic approaches to annotation, such as the USAS system. The historical version of USAS (the Historical Semantic Tagger) is still very much reliant on the CLAWS POS tagger (which was designed for modern English). Currently, we can envisage three potential ‘ways’ of coping with differences in syntax, when trying to annotate historical texts at the POS level with an acceptable level of accuracy: (i) creating period-specific POS taggers, (ii) amending the existing CLAWS tagger, so that it ‘thinks’ it is dealing with historical data, (iii) amending ‘mistakes’ semi-manually during a post-tagging phase (see also Rayson et al. 2007). We are facing similar issues in regard to the USAS SEMANTIC tagset (as, like the POS tagger, it was designed with general Modern English in mind). Our answer, in this regard, is to map the (232) SEMTAGs to categories within the Historical Thesaurus of the Oxford English Dictionary – so that we have a tool that can apply semantic tags to texts in a way that is much more sensitive to meaning (and meaning change) over time. This has meant developing different lexicon sets (i.e. POS/SEMTAG-encoded word lists and multi word lists) for the Semantic Tagger, and hence will ultimately require the inclusion of a facility which allows users to identify their period of interest. Our intention is that this automated system will also allow users to shape the “new” historical SEMTAGs to some extent – by mapping the SEMTAG/HTOED semantic hybrids to their own (research specific) tagset. This is theoretically possible now – that is to say, there is a means by which users can map the existing SEMTAGs to their own research specific tagset. But without the addition of the HTOED, we cannot really do this in a way that is historically relevant, and hence useful for historical linguists.

As an example of what may be possible in the near future, I offer my work in using the SEMTAGs and HTOED combined to investigate verbal aggression within the courtroom. [12] This work is very much in its early stages: the aim, however, is to plot an aggression space across time – which will make use of data driven, research-specific tags: once I have found relevant sections of text using a combination of SEMTAGs/HTOED categories and, where opportune, key word and key domain results. My initial investigations suggest that the SEMTAG category Q2.2 (which relates to speech acts) when it co-occurs with the A5.1- tag (relating to negative evaluation), E3- (relating to violence or anger), S1.2.4- (relating to a lack of respect) and S7.2- (relating to impoliteness) are likely to prove useful and hence worthy of specific exploration (see Archer forthcoming for a more detailed discussion). What I have yet to determine is how best to map these particular SEMTAGs to the HTOED categories which capture verbal aggression. These categories include, which relates to ‘Putting forward for discussion’ (and is to be found inside ‘Debate/argument’), and ‘Insult’ (which is found under ‘Disrespect’ and ‘Contempt’). There is also a HTOED category, under ‘Conduct’, which specifically relates to impoliteness – ‘Discourtesy’.

Notice how it is difficult to do any one-to-one mapping when connecting these tags. What we have to decide, then, is whether it is best to map from the HTOED to SEMTAGs or vice versa, from SEMTAGs to the HTOED categories.

8. Annotation – a welcome addition

I began this paper by asking, somewhat controversially, whether annotation may take us no further than our starting point – that is to say, adds nothing that is not already there, in the data, if one takes the time and puts in the effort to look closely. I want to explicitly reiterate, in closing, what you have no doubt inferred from my paper: that I believe annotation schemes do take us on a journey, which is almost cyclical in many ways; that is to say, it will involve a recurring sequence of interpretation and evaluation. Yet, as my own experience testifies, the annotation I have used, and indeed the tools I have helped to develop, have always enabled me to see things in new ways. To expand on Escher’s “ascending and descending” (visual) metaphor, there is a point at which the stairs begin to move for me so that I am no longer in the loop, treading the same ground, but moving upwards – or downwards – or even outwards.

My answer to the question I pose in the title, then, is that annotation is a welcome addition. This said, it is important that we take seriously – so that we can address – the criticisms of Sinclair and, especially, the concerns of Rissanen. For the annotating process can be a perilous activity – not least because it is a form of interpretation and hence requires us to make explicit why we are interpreting our texts in a certain way (Leech 1993). We also need to remain alert to the problems inherent in trusting our annotations – or, indeed, our texts – completely (cf. Sinclair 2004); and to understand how texts are shaped by their authors and editors and, importantly, the period in which they were produced (and sometimes re-produced).

Another question I might have asked – but have not until this point – is whether we can do without annotation? The answer to that question would be “yes” for most researchers. It is also “yes” for me, on occasion, for my use of annotation often depends on my research focus. This said, I do not believe that I would have found out what I have, especially in regard to the historical courtroom, as easily or as quickly without annotation, or the tools I have exploited. My use of annotation, moreover, has helped me to know my data well – and to think about asking questions of that data that seemed contrary to what had been written in regard to this particular activity type (see, especially, my point regarding the use of directives being linked to status beyond the historical courtroom).

I believe it worth mentioning a caveat, at this juncture, for those of us who do make use of annotation and/or corpus linguistic tools; namely, that those schemes and tools will likely require further improvement, expansion and, possibly, transformation (and replacement?) in the future. Given such a caveat, some might ask whether the journey is still worth it, in practice? For me, the answer is a resoundingly positive one.


[1] The value of annotation is further echoed by McEnery and Wilson’s (2001: 32) statement that ‘a corpus, when annotated’ becomes ‘a repository of linguistic information’ which, rather than being left implicit, is ‘made explicit’, and their later proposal that such ‘concrete annotation’ makes retrieving and analysing information contained in the corpus quicker and easier.

[2] As Appendix 1 reveals, the SEMTAGs (as they are known) relate to semantic fields such as “food” [F1], “housing” [H4], “clothes” [B5], “criminality” [G2.2-], “government” [G1], etc.

[3] My own experience suggests that researchers who work with historical data or with learner corpora are more willing to discuss their manual amendments of (incorrectly assigned) annotation. The value of such publications should not be underestimated by those who focus on theoretical definitions for POS.

[4] These techniques have been used by researchers involved in the With Criminal Intent project, for example (

[5] Justifications for this type of ad-hoc approach include cost implications (if you are fortunate enough to be able to employ people to annotate texts on your behalf), as well as making the most efficient use of the researcher's time.

[6] Rissanen (1989) also draws our attention to the fact that, any comments we can make regarding a given text may prove to be specific to that text only, in the final instance – so that we avoid the error of thinking one text provides sufficient evidence to make generalisations about the language period or activity type it is said to represent.

[7] This is something that Cecconi has done in respect to my argument that powerless as well as powerful people used directives in the courtroom, for example (see, e.g., Cecconi 2011).

[8] This is currently being undertaken: (1) by Lutzky (2011), who is using the SPC status categories to study the discourse markers, why and what, in Early Modern English data, and (2) by Rama-Martinez (2011), who is using the question and answer categories highlighted here to study the eighteenth- and nineteenth-century trial data sources from Old Bailey Proceedings Online (

[9] Specific examples included the following: “we must go to the city”; “I must speak to you”.

[10] Further work needs to be done to determine whether these – and other directive types found within our dataset – are specific to particular time periods and/or particular activity types.

[11] I say semi-automatically, as punctuation was not systematically applied at this time.

[12] Like Rayson (2008), I am adopting a combination of both corpus-driven and corpus-based approaches. Yet it is also a step beyond this, in the sense that my intention is not to stay with the pre-existing tags (which, in a historical context at least, may not best capture a given dataset). In some ways, I am addressing Sinclair’s issue of a corpus-based researcher not being able to be as sensitive to her dataset as s/he needs to be (cf. Section 6).


M.C. Escher's Ascending and Descending on Wikipedia:

Website for the history of VARD 2:

Wmatrix Tools homepage: Link to the online interface (username and password required):

"Speed-of-light results under scrutiny at Cern" by Jason Palmer, BBC News:


Archer, D. 2005. Historical Sociopragmatics: Questions and Answers in the English Courtroom (1640-1760). Pragmatics and Beyond New Series. Amsterdam & Philadelphia: John Benjamins.

Archer, D. 2007. “Developing a more detailed picture of the Early Modern English courtroom: Data and methodological issues facing historical pragmatics”. Methods in Historical Pragmatics. Recovering speaker meaning and reader inference, ed. by S.M. Fitzmaurice & I. Taavitsainen, 185–218. Mouton.

Archer, D. 2011. “Theory and Practice in Pragmatics”. The Pragmatics Reader, ed. by D. Archer & P. Grundypp, 471–81. Routledge.

Archer, D. (forthcoming). “Exploring pragmatic phenomena in English historical texts using USAS: the possibilities and the problems”. Historical Corpus Pragmatics, ed. by Irma Taavitsainen & A. H. Jucker. Amsterdam & Philadelphia: John Benjamins.

Archer, D. & J. Culpeper. 2003. “Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics”. Corpus Linguistics by the Lune: Studies in honour of Geoffrey Leech ed. by A. Wilson, P. Rayson, & T. McEnery, 37–58. Frankfurt: Peter Lang.

Archer, D. &P. Rayson. 2004. “Using the UCREL automated semantic analysis system to investigate differing concerns in refugee literature.” Presentation as part of the “One Month Challenge” Refugee Workshop, King's College London, February 2004.

Archer, D., K. Aijmer, & A. Wichmann. 2012. Advanced Pragmatics Textbook. Oxon & New York: Routledge.

Austin, J.L. 1962. How To Do Things With Words: the William James Lectures delivered at Harvard University in 1955. Oxford: Oxford University Press

Baker, P. 1997. “Consistency and accuracy in correcting automatically tagged corpora”. Corpus Annotation: Linguistic Information From Computer Text Corpora, ed. by R. Garside, G. Leech & A. McEnery, 243–250. London: Longman.

Baron, A., P. Rayson, & D. Archer. 2009. “Word frequency and key word statistics in historical corpus linguistics”. Journal of English Studies 20(1): 41–67.

Baron, A., P. Rayson, & D. Archer. 2011. “Quantifying Early Modern English spelling variation: Change over time and genre”. Paper presented at the Conference on New Methods in Historical Corpora, University of Manchester, April 29–30, 2011.

Blum-Kulka, S., J. House, & G. Kasper. 1989. “The CCSARP coding manual”. Cross-cultural pragmatics: requests and apologies, ed. by S. Blum-Kulka, J. House and G. Kasper, 273–94. Norwood, NJ: Ablex.

Breitman, K.K., M.A. Casanov, & W. Truszkowski. 2007. Semantic Web: Concepts, Technologies and Applications. London: Springer: Verlag.

Cecconi, E. 2011. “Power confrontation and verbal duelling in the arraignment section of XVII century trials”. Journal of Politeness Research 7(1): 101–22.

Culpeper, J., & D. Archer. 2008. “Requests and directness in Early Modern English trial proceedings and play texts, 1640-1760”. Speech Acts in the History of English, ed. by A. Jucker & I. Taavitsainen, 45–84. Amsterdam & Philadelphia: John Benjamins.

Davenport, J. 2006. Pupil Engagement in the Questioning Process During Numeracy Problem-Solving Sesssions. MA (by research). University of Central Lancashre.

Deegan, M., H. Short, D. Archer, P. Baker, T. McEnery, & P. Rayson. 2004. “Computational Linguistics meets Metadata, or the automatic extraction of key words from full text content”. Online posting.

Escher, M.C. 1960. Ascending and Descending. Lithograph print.

Hickey, R. 1995. Tracing the Trail of Time: Proceedings of the Second Diachronic Corpora Workshop. New College, Toronto: University of Toronto.

Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Jucker, A.H. & I. Taavitsainen, eds. 2008. Speech Acts in the History of English. Amsterdam & Philadelphia: John Benjaims.

Jurafsky, D. & J.H. Martin. 2009. Speech and language processing: an introduction to natural language processing, computational linguistics and speech recognition. Prentice Hall.

Kohnen, T. 2002. “Methodological problems in corpus based historical pragmatics. The case of English directives”. Language and Computers: Advances in Corpus Linguistics. Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23) Göteborg 22–26 May, ed. by K. Aijmer & B. Altenberg, 237–247. Amsterdam: Rodopi.

Kytö, M. & T. Walker. 2003. “The Linguistic Study of Early-Modern English Speech-Related Texts: How ‘Bad’ Can ‘Bad’ Data Be?”. Journal of English Linguistics 31(3): 221–48.

Labov, W. 1994. Principles of Linguistic Change: Internal Factors. Wiley-Blackwell.

Leech, G. 1983. Principles of Pragmatics. Longman: London.

Leech, G. 1992. “Corpus linguistics and theories of linguistic performance”. Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Stockholm, 4–8 August 1991, ed. by J. Svartvik, 105–122. Berlin: Mouton.

Leech, G. 1993. “Corpus Annotation Schemes”. Literary and Linguistic Computing 8(4): 275–281

Leech, G. 1997. “Introducing Corpus Annotation”. Corpus Annotation: Linguistic Information From Computer Text Corpora, ed. by R. Garside, G. Leech & A. McEnery, 1–18. London: Longman.

Leech, G. 2005. “Adding Linguistic Annotation”. Developing Linguistic Corpora: a Guide to Good Practice, ed. by M. Wynne, 17–29. Oxford: Oxbrow Books.

Leech, G. & M. Weisser. 2003. “Generic speech act identification for task-oriented dialogues”. Proceedings of the Corpus Linguistics 2003 Conference, ed. by D. Archer, T. McEnery, P. Rayson, & A. Wilson. Lancaster University: UCREL Technical Papers.

Levinson, S.C. 1992 [1979]. “Activity types and language”. Talk at Work, ed. by P. Drew & J. Heritage, 66–100. Cambridge: Cambridge University Press.

Lutzky, U. 2011. “Why, what do you take me for a Ghost, Sir?” Paper presented at the Helsinki Corpus Festival: The Past, Present and Future of English Historical Corpora, 28 September–2 October 2011.

McEnery, T. & A. Hardie. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.

McEnery, T. & A. Wilson. 2001. Corpus Linguistics. Edinburgh: Edinburgh University Press.

Nunnally, T.E. 1992. “Man’s son/son of man”. History of Englishes: New Methods and Interpretations in Historical Linguistics, ed. by M. Rissanen, O. Ihalainen, T. Nevalainen, & I. Taavitsainen, 359–72. Berlin: Mouton.

Rama-Martinez, E. 2011. “On the dynamics of (cross-)examination in the Old Bailey courtroom (1760-1860)”. Paper presented at the Helsinki Corpus Festival: The Past, Present and Future of English Historical Corpora, 28 September–2 October 2011.

Rayson, P. 2003. Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Ph.D. dissertation, Lancaster University.

Rayson, P. 2008. “From key words to key semantic domains”. International Journal of Corpus Linguistics. 13(4): 519–549.

Rayson, P., D. Archer, S.L. Piao, & T. McEnery. 2004. “The UCREL semantic analysis system”. Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks in association with 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 25 May 2004, 7–12

Rayson, P., D. Archer, A. Baron, J. Culpeper, & N. Smith. 2007. “Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora”. Proceedings of the Corpus Linguistics Conference: CL2007, ed. by M. Davies, P. Rayson, S. Hunston, & P. Danielsson, 27–30 July 2007. University of Birmingham, UK.

Rissanen, M. 1989. “Three problems associated with the use of diachronic corpora”. ICAME Journal 13: 16–19.

Scott, M. 2008. WordSmith Tools. Version 5. Liverpool: Lexical Analysis Software.

Sinclair, J. 1994. “Trust the text”. Advances in written text analysis, ed. by M. Coulthard, 12–25. Routledge.

Sinclair, J. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.

Stenström, A-B. 1984. “Discourse Tags”. Corpus Linguistics: Recent Developments in the Use of Computer Corpora in English Language Research, ed. by J. Aarts & W. Meijs. Amsterdam: Rodopi.

Tognini-Bonelli, E. 2001. Corpus Linguistics at Work. Benjamins.



A1 General I1 Money generally S1.1.2 Reciprocity
A1.1.1 General actions, making etc. I1.1 Money: Affluence S1.1.3 Participation
A1.1.2 Damaging and destroying I1.2 Money: Debts S1.1.4 Deserve etc.
A1.2 Suitability I1.3 Money: Price S1.2 Personality traits
A1.3 Caution I2 Business S1.2.1 Approachability and Friendliness
A1.4 Chance, luck I2.1 Business: Generally S1.2.2 Avarice
A1.5 Use I2.2 Business: Selling S1.2.3 Egoism
A1.5.1 Using I3 Work and employment S1.2.4 Politeness
A1.5.2 Usefulness I3.1 Work and employment: Generally S1.2.5 Toughness; strong/weak
A1.6 Physical/mental I3.2 Work and employment: Professionalism S1.2.6 Sensible
A1.7 Constraint I4 Industry S2 People
A1.8 Inclusion/Exclusion K ENTERTAINMENT, SPORTS & GAMES S2.1 People: Female
A1.9 Avoiding K1 Entertainment generally S2.2 People: Male
A2 Affect K2 Music and related activities S3 Relationship
A2.1 Affect: Modify, change K3 Recorded sound etc. S3.1 Relationship: General
A2.2 Affect: Cause/Connected K4 Drama, the theatre & show business S3.2 Relationship: Intimate/sexual
A3 Being K5 Sports and games generally S4 Kin
A4 Classification K5.1 Sports S5 Groups and affiliation
A4.1 Generally kinds, groups, examples K5.2 Games S6 Obligation and necessity
A4.2 Particular/general; detail K6 Children’s games and toys S7 Power relationship
A5 Evaluation L LIFE & LIVING THINGS S7.1 Power, organizing
A5.1 Evaluation: Good/bad L1 Life and living things S7.2 Respect
A5.2 Evaluation: True/false L2 Living creatures generally S7.3 Competition
A5.3 Evaluation: Accuracy L3 Plants S7.4 Permission
A5.4 Evaluation: Authenticity M MOVEMENT, LOCATION, TRAVEL & TRANSPORT S8 Helping/hindering
A6 Comparing M1 Moving, coming and going S9 Religion and the supernatural
A6.1 Comparing: Similar/different M2 Putting, taking, pulling, pushing, transporting &c. T TIME
A6.2 Comparing: Usual/unusual M3 Movement/transportation: land T1 Time
A6.3 Comparing: Variety M4 Movement/transportation: water T1.1 Time: General
A7 Definite (+ modals) M5 Movement/transportation: air T1.1.1 Time: General: Past
A8 Seem M6 Location and direction T1.1.2 Time: General: Present; simultaneous
A9 Getting and giving; possession M7 Places T1.1.3 Time: General: Future
A10 Open/closed; Hiding/Hidden;
Finding; Showing
M8 Remaining/stationary T1.2 Time: Momentary
A11 Importance N NUMBERS & MEASUREMENT T1.3 Time: Period
A11.1 Importance: Important N1 Numbers T2 Time: Beginning and ending
A11.2 Importance: Noticeability N2 Mathematics T3 Time: Old, new and young; age
A12 Easy/difficult N3 Measurement T4 Time: Early/late
A13 Degree N3.1 Measurement: General W THE WORLD & OUR ENVIRONMENT
A13.1 Degree: Non-specific N3.2 Measurement: Size W1 The universe
A13.2 Degree: Maximizers N3.3 Measurement: Distance W2 Light
A13.3 Degree: Boosters N3.4 Measurement: Volume W3 Geographical terms
A13.4 Degree: Approximators N3.5 Measurement: Weight W4 Weather
A13.5 Degree: Compromisers N3.6 Measurement: Area W5 Green issues
A13.6 Degree: Diminishers N3.7 Measurement: Length & height X PSYCHOLOGICAL ACTIONS, STATES & PROCESSES
A13.7 Degree: Minimizers N3.8 Measurement: Speed X1 General
A14 Exclusivizers/particularizers N4 Linear order X2 Mental actions and processes
A15 Safety/Danger N5 Quantities X2.1 Thought, belief
B THE BODY & THE INDIVIDUAL N5.1 Entirety; maximum X2.2 Knowledge
B1 Anatomy and physiology N5.2 Exceeding; waste X2.3 Learn
B2 Health and disease N6 Frequency etc. X2.4 Investigate, examine, test, search
B3 Medicines and medical treatment O SUBSTANCES, MATERIALS, OBJECTS & EQUIPMENT X2.5 Understand
B4 Cleaning and personal care O1 Substances and materials generally X2.6 Expect
B5 Clothes and personal belongings O1.1 Substances and materials generally: Solid X3 Sensory
C ARTS & CRAFTS O1.2 Substances and materials generally: Liquid X3.1 Sensory: Taste
C1 Arts and crafts O1.3 Substances and materials generally: Gas X3.2 Sensory: Sound
E EMOTIONAL ACTIONS, STATES & PROCESSES O2 Objects generally X3.3 Sensory: Touch
E1 General O3 Electricity and electrical equipment X3.4 Sensory: Sight
E2 Liking O4 Physical attributes X3.5 Sensory: Smell
E3 Calm/Violent/Angry O4.1 General appearance and physical properties X4 Mental object
E4 Happy/sad O4.2 Judgement of appearance (pretty etc.) X4.1 Mental object: Conceptual object
E4.1 Happy/sad: Happy O4.3 Colour and colour patterns X4.2 Mental object: Means, method
E4.2 Happy/sad: Contentment O4.4 Shape X5 Attention
E5 Fear/bravery/shock O4.5 Texture X5.1 Attention
E6 Worry, concern, confident O4.6 Temperature X5.2 Interest/boredom/excited/energetic
F1 Food P1 Education in general X7 Wanting; planning; choosing
F3 Cigarettes and drugs Q1 Communication X9 Ability
F4 Farming & Horticulture Q1.1 Communication in general X9.1 Ability: Ability, intelligence
G GOVT. & THE PUBLIC DOMAIN Q1.2 Paper documents and writing X9.2 Ability: Success and failure
G1 Government, Politics & elections Q1.3 Telecommunications Y SCIENCE & TECHNOLOGY
G1.1 Government etc. Q2 Speech acts Y1 Science and technology in general
G1.2 Politics Q2.1 Speech etc: Communicative Y2 Information technology and computing
G2 Crime, law and order Q2.2 Speech acts Z NAMES & GRAMMATICAL WORDS
G2.1 Crime, law and order: Law & order Q3 Language, speech and grammar Z0 Unmatched proper noun
G2.2 General ethics Q4 The Media Z1 Personal names
G3 Warfare, defence and the army; Weapons Q4.1 The Media: Books Z2 Geographical names
H ARCHITECTURE, BUILDINGS, HOUSES & THE HOME Q4.2 The Media: Newspapers etc. Z3 Other proper names
H1 Architecture, kinds of houses & buildings Q4.3 The Media: TV, Radio & Cinema Z4 Discourse Bin
H2 Parts of buildings S SOCIAL ACTIONS, STATES & PROCESSES Z5 Grammatical bin
H3 Areas around or near houses S1 Social actions, states & processes Z6 Negative
H4 Residence S1.1 Social actions, states & processes Z7 If
H5 Furniture and household fittings Z8 Pronouns etc.
Z9 Trash can
Z99 Unmatched