ICAME-27 Pre-Conference Workshop
Helsinki, 24 May 2006, 2–6 pm

Plenary

Playing tag with category boundaries

ICAME-27, Workshop on Corpus Annotation
Helsinki, 24 May 2006
David Denison
University of Manchester/Université Paris 3
david.denison@manchester.ac.uk

Lexical categories (word classes) play an important role in many branches of linguistics and in many theoretical approaches. Traditional grammar tends to work with a fixed number of categories on (roughly) a Latin model, albeit with some dispute as to the precise number and the most suitable labels for them (cf. for instance Huddleston & Pullum 2002: 22 and passim). Many modern theories also take an axiomatic view of categories, regarding them as givens on which the syntax is constructed. On the other hand, there is a large literature on the underlying typology and semantics of categories, and in some cases even generative grammars offer the possibility of feature marking to soften the crude generalisations offered by partitioning an entire vocabulary among some half-a-dozen lexical categories.

However categories are defined, problems at the boundaries are well known. They include items which seem to straddle two adjacent categories; items whose category membership is unclear; items undergoing lexicalisation whose actual extent is unclear, let alone the category of the string involved. Many such problematic areas of synchronic analysis are equivalent to transitions which lexical items have made in historical time. One aim of my current work is to classify the various kinds of category boundary problem which arise in the analysis of English data, especially in historical change. For the purposes of this abstract I mention a selection of examples, starting with the transition from N to A beginning to be seen in a word like draft:

(1)This is really quite draft at the moment (draft is A)
(2)She might have seen them (have = V)
(3)She might of seen them (of = ?Adv or V or even P)
(4)If I'd've known ('ve = ?Adv)
(5)His various proposals (various = A)
(6)Various of the proposals (various = D)
(7)wæs toweard to alysenne ealne middangeard
'was about to redeem all earth'
(to is P)
(8)yet be such workys … apte to corrupt
and infecte the reder
(to is infinitive marker)
(9)I don't want to (to is M)
(10)One sort of red wine (sort is N, of is P)
(11)Those sort of people (sort of is D)
(12)I sort of resent it (sort of is Adv)

I think that all of the provisional category assignments made in — are defensible, if in some cases eccentric, though I would not necessarily want to defend them: the examples are meant to illustrate the difficulties of assigning categories to the most ordinary language. They also cover a range of structural situations, representing a case where there are only two possible categorisations and the structural parse would not change (node labels apart) according to which is chosen. In some of the other examples, more is at stake than a mere node label, as both category and structure depend crucially on the analysis selected.

Tagsets can be seen as an attempt to marry (some version of) categorisation with the practical aim of offering efficient retrieval of corpus data. Accordingly, tags are typically a superset of the standard categories, so that data sharing a common tag are often a subset of some standard category (proper nouns, for instance, or comparative adjectives). The finer classes of a tagset may be explicitly organised to show their hierarchical relationships.

Now uncertainty is clearly not welcome in tagging: not only do decisions have to be taken, but automatic taggers have to be guided to assign the categories approved by their human masters for the words in particular patterns. Well-structured corpora make their annotation principles entirely explicit, so that a user who wishes to group data differently — that is, follow a different linguistic analysis from that of the corpus constructors — can generally retrieve the data needed. In this paper I propose to look at some of the tagsets and taggings of familiar corpora in the light of the kinds of troublesome (or at least historically unstable) data discussed above. How well do they cope?

Papers

► Dawn Archer (University of Central Lancashire)

Annotating historical texts

There has been a rapid growth in historical digital resources over the past decade. This paper highlights practical issues that need to be overcome if we are to add semantic/pragmatic annotation to such resources, before discussing the modifications made to the UCREL Semantic Annotation System — a tool that adds part-of-speech and semantic field information to modern texts of English (written and spoken)  so that texts dating from Shakespeare to the present day can now be annotated automatically.

► Joan Beal (University of Sheffield), Karen Corrigan (University of Newcastle), Paul Rayson (University of Lancaster) and Nicholas Smith (University of Lancaster)

Writing the vernacular: Transcribing and tagging the Newcastle Electronic Corpus of Tyneside English (NECTE)

The NECTE corpus presented a number of problems not encountered by those producing corpora of standard varieties. The primary material consisted of audio recordings which needed to be orthographically transcribed and grammatically tagged. Preston (1985), Macaulay (1991), Kirk (1997) and Beal (2005) all note that representing vernacular Englishes orthographically, e.g. by using 'eye dialect' can be problematic on various levels. Apart from unwelcome associations with negative racial or social connotations, there are theoretical objections to devising non-standard spellings which represent certain groups of vernacular speakers, thus making their speech appear more differentiated from mainstream colloquial varieties than is warranted. In the first half of this paper, we outline the principles and methods adopted in devising an Orthographic Transcription Protocol for a vernacular corpus, and the challenges faced by the NECTE team in practice. Protocols for grammatical tagging have likewise been devised with standard varieties in mind. In the second half, we relate how existing tagging software (Garside & Smith (1997) was adapted to take account of the non-standard grammar of Tyneside English.

References

Beal, J. C. 2005. "Dialect representation in texts". The Encyclopedia of Language and Linguistics, 2nd edition, 531-538. Amsterdam: Elsevier.
Garside, Roger & Nicholas Smith. 1997. "A hybrid grammatical tagger: CLAWS4". Corpus Annotation: Linguistic Information from Computer Text Corpora, ed. by Roger Garside, Geoffrey Leech & Anthony McEnery, 102-121. London: Longman.
Kirk, John. 1997. "Irish-English and contemporary literary writing". Focus on Ireland, ed. by Jeffrey L. Kallen, 190-205. Amsterdam: John Benjamins.
Macaulay, Ronald K. S. 1991. "Coz it izny spelt when they say it: displaying dialect in writing".American Speech 66: 280-291.
Preston, Dennis R. 1985. "The Li'l Abner syndrome: Written representations of speech". American Speech 60(4): 328-336.

► Ylva Berglund (Oxford University Computing Services)

'Why is it full of funny characters?' — converting the BNC into XML

When the BNC was released, over ten years ago, the annotation was one of the aspects that made the corpus different from other generally available corpora at the time. The corpus was marked-up for linguistic features (part-of-speech) as well as provided with extra-linguistic information (such as information about speakers, type of text, target audience, situational context) and much more. This information was supplied in Corpus Document Interchange Format (an application of SGML) following the TEI guidelines, to make the corpus useable with any SGML-compliant software.

Over the yeas since the corpus was first made available, many users have benefited from the corpus annotation. At the same time, many users have been complaining over it. Many complaints relate to the fact that users find that the corpus texts 'is full of funny character' and cannot be easily read without a specialist tool. Although the corpus was delivered with such a tool (SARA), this did not fill all needs for all users. Since other suitable SGML-compliant software was rare, the potential of the annotation was not realised.

Since the release of the corpus, work has been carried out to improve aspects of the search tool. At the same time, work has been done to make it easier to use the corpus with other tools (without losing the information provided through the annotation). The result is a version of BNC in XML, a format that is becoming standard in many contexts.

This paper will present the reasoning behind converting the BNC into XML against the perceived benefits and drawbacks of the format for use in corpus annotation. It will also discuss aspects of the conversion work, such as unexpected problems and successful solutions, and suggest what the change will mean to the end user.

► Magnus Huber and Sumithra Velupillai (Justus-Liebig-Universität Giessen)

Identifying and annotating spoken English in the Proceedings of the Old Bailey

The proceedings of the Old Bailey, London's central criminal court, constitute the largest body of texts detailing the lives of non-elite people between 1674 and 1834. They contain over 100,000 trials, totalling ca. 52 million words. Xml-encoded transcripts of the Old Bailey Proceedings were obtained from Robert Shoemaker (Head of Department of History, Humanities Research Institute, University of Sheffield) and Tim Hitchcock (Department of History and Social Sciences, Head of Arts and Humanities Research Institute, University of Hertfordshire). Our paper will focus on three steps in the annotation and further refinement of this previously tagged corpus: 1. automatic localization and xml-tagging of stretches of direct speech, 2. sociolinguistic mark-up based on socio-biographical speaker data found in the context, and 3. development of a software tool geared to the linguistic and socio-linguistic analysis of the Old Bailey Corpus.

For Step 1, we created a Perl script that automatically inserts speech tags in the Old Bailey Corpus. We manually searched for patterns that indicated speech sections, such as names or question-answering sequences. These patterns were used to create regular expressions that capture indicators of direct speech in the corpus and to add tags in the appropriate position. The regular expressions capture sequences of direct speech from 1732, with ca. 40 million words within the generated speech tags.

Semi-automatic sociolinguistic mark-up of the material in the speech tags in Step 2 is performed with the help of an interactive tool programmed in Delphi. This automatically scans the corpus for names, generates unique speaker-IDs for each of these and creates a dataset of previously annotated socio-biographical characteristics for each speaker. This dataset can be edited and refined to include information that was not captured by the original corpus compilers at the universities of Sheffield and Hertfordshire. The program then cycles through the corpus, allowing the insertion of speaker-IDs in the speech tags.

While the software used in Steps 1 and 2 is meant for the corpus creators, the tool produced in Step 3 is directed towards the corpus user. Most corpus software (like e.g. WordSmith) only offers the possibility of searching for a given text string in the entire text(s). By contrast, our tool allows the user to search in spoken language only and limit hits by choosing any combination of sociolinguistic attributes of the speakers. For example, one can search for the which as used by males and females, social classes 1 and 2 only, 25+ years of age, in the years 1740-1760, aggregated in 5-year steps. Users will also be able to set up their own categories, by e.g. creating social classes through allocating professions to each class or by specifying the time periods for which hits should be aggregated (e.g. 1-year or 15-year steps).Results are output in a table that can be exported for further processing. This tool presents a considerable and time-saving improvement over existing software.

► Susan Mandala (University of Sunderland)

Finding speech acts in spoken corpora: Present and future possibilities

As (Aijmer 1996: 3-4) has noted, the use of machine-readable spoken data in studies that involve speech acts in discourse is a relatively new method of investigation. Pragmatic coding in general, and discourse coding in particular, have lagged behind other forms of corpus tagging (McEnery & Wilson 1996) with work in this area tending to be either hypothetical in nature or not readily applicable to truly spontaneous speech. As a result, the current state of corpus annotation for conversational material does not typically support the consistent identification of structures such as acts or moves in discourse. Automatic search and retrieval systems are generally geared towards the lexical item and syntactic phrase, and systems that allow for the automatic search and retrieval of discourse categories at or above the level of the conversational act or exchange are virtually unknown for corpora of spontaneous conversation. The existing tagging methods and search systems are forcing a preference for the study of lexically and syntactically evident (Kennedy 1998: 178) discourse features, such as back channel devices and tag questions, and searches for speech act units in corpora have thus far been limited to highly formulaic acts with relatively fixed lexical indicators (e.g. Aijmer 1996). In this paper, I argue that it is possible to consistently and reliably locate less formulaic speech act units in syntactic and lexically tagged conversational corpora, and demonstrate the possibility with a set of results for advice-giving obtained from the Cobuild corpus. I further argue that a comprehensive speech act coding of conversational corpora is theoretically possible, since many of the arguments levelled against such an enterprise do not stand up to closer scrutiny.

References

Aijmer, Karin. 1996. Conversational Routines in English: Convention and Creativity. London & New York: Longman.
Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London & New York: Longman.
McEnery, Anthony & Andrew Wilson. 1996. Corpus Linguistics. Edinburgh: Edinburgh University Press.