Tagging old with new: an experiment with corpus customisation
Atro Voutilainen
Department of Modern Languages, University of Helsinki
Abstract
There is a growing number of solutions for linguistic annotation of corpora that represent a present-day standard written variety of a natural language. Corpus linguists interested in diachronic or dialect corpora have a more limited choice of annotation solutions: annotation software built for a standard variety of a language generally provide an unacceptably low analysis accuracy when applied to a diachronic or dialect corpus. This paper outlines a simple semiautomatic solution to provide a more accurate annotation for a diachronic corpus while using a tagger originally built for a standard variety of the language: problematic words in the corpus are identified and translated (“standardised”), the tagger is run on the standardised corpus, and the resulting more accurate analysis is combined with the original corpus. An informal evaluation on a 18th Century English letter corpus is reported with promising results.
1. Introduction
Production of representative text corpora with high-quality linguistic annotation (e.g. with lexical or syntactic tagging) to serve as empirical data for corpus linguistic research is a costly, expertise-intensive effort. Such efforts are not usually made without a large number of expected users. The largest annotated English-language corpora represent present-day written English, e.g. the PENN Treebank (Marcus et al. 1993), The Corpus of Contemporary American English “COCA” (Davies 2009), the British National Corpus (Burnard and Aston 1998), and the Prague English Dependency Treebank (Mikulova et al. 2006).
Corpus linguists interested in other varieties of language, e.g. diachronic or dialect corpora, are likely to experience a lack of suitable annotated corpora as well as a lack of resources to organise a work-intensive corpus annotation project without substantial delay. A number of options to solve the automatic annotation requirement appear to be available, but each of the following apparent options turns out to be problematic for this kind of low-budget `do-it-yourself' corpus annotation scenario:
- there are several freely or commercially available taggers and parsers with embedded statistical language models or linguistic parsing grammars for analysing standard present-day written English with varying degrees of accuracy (see the Sources section for a long and fairly up-to-date list), but their accuracy is very likely too low for annotating diachronic corpora for corpus linguistic research because of lexical, morphological and syntactic differences between Present-Day English and earlier varieties of English;
- customising a tagger or a parser of Present-Day English (by modifying the language models to better account for the lexicon and grammar of the variety of English of interest) is problematic because it requires substantial effort and expertise beyond the scope of most corpus linguists to understand the workings and architecture of parser's existing formal grammars to modify them in a controlled way. In case the tagger or parser uses a statistical language model or a closed source model, manual fine-tuning is not an option even for an expert on computational linguistics;
- there are also freely available development environments for generating language models and executable software for statistical and linguistic tagging and parsing, but their effective use requires either a kind of expertise not generally possessed by a corpus linguist (i.e. building, testing and documenting formal lexical, morphological and syntactic analysis grammars for integration with the compiler/interpreter software into a tagger/parser) or substantial amounts of morphologically or syntactically annotated text representing the object language (diachronic English, in this case) to enable automatic training of a statistical language model for tagging and parsing.
Since large-scale grammatical annotation of corpora by hand is too work-intensive for a corpus linguist with small resources and an immediate need for a relevant corpus with grammatical annotation, and since creating or customising an automatic solution is impracticable for lack of expertise, a practical option for a corpus linguist might be to manipulate the corpus itself to enable higher-quality automatic annotation with an existing tagger/parser.
In this paper, we outline a method based on (i) light-weight mechanical word-by-word translation (or backdating) of the corpus to make it better resemble the kind of language for analysing which the available tagger or parser was made, and (ii) restoration of the original corpus tokens to the tagger's output to combine original words with higher-accuracy annotation. Using the method requires the availability of a tagger and the corpus; the other text processing tools we mention are included in Linux distributions. – A related method for spelling variation standardisation (but without an evaluation in a syntactic annotation task) is presented in Baron and Rayson (2011).
A sample application of the method is shown step by step, by showing how an extract from a diachronic corpus can be annotated with a word-class tagger made for present-day English. We also present a small-scale evaluation of the method by showing, how much the use of the method contributes to the analysis accuracy of the word-class tagger on an extract from a diachronic English corpus.
2. Method
In this section, we present a method for the corpus linguist to enable annotation of a text corpus with an automatic tagger or parser that was originally built for analysing a different variety of the language. We illustrate with a sample step-by-step application, and end this section by pointing out some expected benefits and limitations of the method.
The method contains three main steps:
- generate a translated version of the corpus to make it better resemble the object language of the tagger or parser (i.e. the language variety for analyzing which the tagger or parser was made). The translation should be done on a token-by-token (word-by-word) basis to facilitate later steps in the process. (In this paper, we do not consider the more challenging task of managing multiword units and changes in tokenization.)
- apply the tagger/parser on the original corpus and the translated corpus.
- extract the tokens (words, punctuation marks) from the analysed version of the original corpus and the descriptors (e.g. tags) from the analysed version of the translated corpus, and join them into the final version.
3. Sample application
Next, we process a concrete example sentence through the three main steps (as well as some additional minor processes for text cleaning, etc.), and mention some simple Unix tools and commands that can be used with minimal knowledge about linguistic computing (code examples are available from the author upon request). For practical examples and exercises on using simple Unix tools for text manipulation, the interested reader is also referred to the downloadable classic Unix for poets by Ken Church (See the Sources section for the download link)
(i) corpus cleaning. As a result of editorial work, the corpus may contain material that interferes with the grammatical annotation process. For instance, in the following extract the strings “=”, &qlduo;(^” and “^)” are added; also page numbering is included (“<P 236>”).
When I was at Bright=n=, I was too happy to
write; Time flew as smooth & swift as the (^Glid^) we saw in our
last ride. One Peg lower w=d= perhaps have made me wish
<P 236>
for Pen Ink & Paper
Here we simply remove added coding as well as lengthy foreign-language expressions (expressions in German, Latin and French) from the corpus; a more complete solution would contain a facility to restore the material into the final version of the annotated corpus. The “sed” string editor is convenient for manipulating single-line character sequences. The additional material shown above can be deleted with a simple script, with the following result:
When I was at Brightn, I was too happy to
write; Time flew as smooth & swift as the Glid we saw in our
last ride. One Peg lower wd perhaps have made me wish
for Pen Ink & Paper
(ii) Extraction of tokens for translation. Depending on the lexical similarity of the language of the corpus with the object language of the tagger/parser, there are alternative methods to generate a list of word-forms from the corpus for use as translation candidates.
- An exhaustive word-list (maybe thousands of word-forms) can be considered when there is a substantial difference between the language of the corpus and the object language of the tagger.
- In case the lexical difference is not pervasive, a more selective method for translation candidate extraction may provide a reasonable balance between human effort and annotation quality gains. For instance, a spell-checker may be a useful candidate for extracting tokens that do not belong to the core vocabulary of the standard modern language variety; the Unix “spell” can be used as a command-line tool for this purpose. In case the number of results is very large, the Unix commands “sort” and “uniq -c” can be used after the “spell” command in a pipeline. – For the sample fragment, the Unix “spell” program produces the following list of translation candidates:
Brightn
Glid
wd
(iii) Translation and string replacement to generate a “standard” version of the corpus. The extracted translation candidates can be provided to the translator as a tabular two-column list, e.g.
Brightn, Brightn
Glid, Glid
wd, wd
so the translator can translate the first column by modifying the second column. With some automatic modifications, the translated data can be converted into an automatic scanner that performs the translation for each relevant token in the corpus (the input file name comes after the “<” sign; output is directed to another file with “>”):
sed -e 's/<Brightn>/Brighton/g' \
....
-e 's/<wd>/would/g' <cleaned-original-corpus.txt >cleaned-translated-corpus.txt
After translation, our sample would look like this:
When I was at Brighton, I was too happy to
write; Time flew as smooth & swift as the Glid we saw in our
last ride. One Peg lower would perhaps have made me wish
for Pen Ink & Paper
(iv) Annotation of the original and translated corpus versions. Running the selected tagger or parser on the two versions of the corpus results in two analysed corpora with an identical number of tokens for easier alignment later on during the process. Our fragments would look like this in a vertical form – original:
One/Num
Peg/N
lower/ACmp
wd/N
perhaps/Adv
have/VPres
made/VEn
me/PronAcc
wish/VInf
for/Prep
Pen/N
Ink/N
&/CC
Paper/N
and translated:
One/Num
Peg/N
lower/ACmp
would/VMod
perhaps/Adv
have/VInf
made/VEn
me/PronAcc
wish/VInf
for/Prep
Pen/N
Ink/N
and/CC
Paper/N
As we see, “wd” is misanalysed as a noun in the original version (the tagger's lexical analyser contains a heuristic component for analysing tokens not represented in the lexicon itself), as is also “have” (infinitive reading is discarded because no “licence” for an infinitive, e.g. a modal auxiliary, is discovered in the context; Vpres is provided as an analysis probably because it is the last analysis to survive contextual disambiguation). In the translated version, both “would” and “have” are analysed correctly. (Note in passing that in both versions, the adverb “lower” is misanalysed as an adjective.)
(v) Combining original tokens with analyses of the modern version. To extract the tokens in the analysed original corpus and to match them with the relevant analyses, the analysed versions of the corpora need to be in a token-per-line (vertical) format, as shown above. The Unix tool “sdiff” combines its arguments (in this case, the two versions of the tagged corpus) as follows (the vertical bar “|” shows columns with differences):
One/Num |
|
One/Num |
Peg/N |
|
Peg/N |
lower/ACmp |
|
lower/ACmp |
wd/N |
| |
would/VMod |
perhaps/Adv |
|
perhaps/Adv |
have/VPres |
| |
have/VInf |
made/VEn |
|
made/VEn |
me/PronAcc |
|
me/PronAcc |
wish/VInf |
|
wish/VInf |
for/Prep |
|
for/Prep |
Pen/N |
|
Pen/N |
Ink/N |
|
Ink/N |
&/CC |
| |
and/CC |
Paper/N |
|
Paper/N |
The final step, combining the original tokens with the analyses in the second column, can be done with the string editor program “sed”, by replacing strings starting and ending with the slash “/” with a single slash:
One/Num
Peg/N
lower/ACmp
wd/VMod
perhaps/Adv
have/VInf
made/VEn
me/PronAcc
wish/VInf
for/Prep
Pen/N
Ink/N
&/CC
Paper/N
4. Expected benefits and limitations
Provided that an automatic tagger or parser is available for a standard present-day variety of the language, but not for the particular, mainly lexically different variety for which the corpus linguist needs grammatical annotation, the method outlined is expected to enable nearly automatic corpus annotation with an analysis accuracy close to that achieved with the available tagger/parser for its proper object language. In particular, the expertise that the method requires – mainly word-level translation, and some basic linguistic computing skills or ready-made scripts – is probably available to most small-resource corpus annotation efforts. In other words, non-trivial corpus annotation efforts should be possible without the time-consuming effort of organising a multidisciplinary research project, and the actual corpus linguistic research project – linguistic study of the language variety of interest – can be started with less delay.
An expected limitation of this method is that word-level translation or modernisation is not likely to prevent misanalyses due to syntactic dissimilarities between the standard language and its (diachronic) variety. The resulting annotated corpus is expected to contain syntax-based misanalyses; at the end of this paper, we will tentatively outline other methods to correct syntactically motivated misanalyses.
5. Experiments
In this section, we report two experiments with an English part-of-speech tagger made for analysis of present-day standard written English.
- The first experiment is an accuracy comparison based on a 2200-word corpus extract from which editorial markup has been cleaned: one analysis is based on an application of the tagger on the corpus without the combined translation and result merging; the other analysis uses the combined translation and merging routine. From both analysis results, the number of correctly analysed words (excluding punctuation) is calculated to get an indication on how much the method contributes to analysis accuracy with this sample.
- The second experiment aims at a more informed view on the kinds of analysis errors that survive the translation method. In this experiment, we use a larger corpus extract (close to 8000 words), and report on findings with examples to get a tentative understanding on what kinds of analysis problems remain to be solved, possibly with other techniques.
Before moving on to the results, we describe the tagger, the diachronic corpus extracts and the translation task.
5.1 The tagger
The English tagger, based on the successful Constraint Grammar framework (Karlsson et al. 1995), uses a sequence of analysis routines:
- a tokeniser to identify words, punctuation marks and sentence boundaries;
- a lexical lookup module with a morphological grammar, a large lexicon, as well as a guessing component for otherwise unrecognised words, to provide each token with one or more alternative morphological analyses (part of speech, inflection); and
- a reductionistic hand-written disambiguation grammar to remove contextually illegitimate morphological analyses from morphologically ambiguous words.
- The last module of the tagger simplifies the tagger's output into a structure where each lexical analysis is represented as a single tag (see examples above) from a palette of 38 tags, described in Appendix 2. In case an ambiguity remains unresolved by the disambiguation component, the lexical form gets two or more alternative tags. Such remaining ambiguities can be mechanically identified and manually resolved.
5.2 Corpus
The corpus is a sample from the CEECE corpus (Corpus of Early English Correspondence Extension) that contains 18th Century English by educated people. The tagger's accuracy is calculated from several randomly selected extracts. Before other processing steps, the editorial markup was automatically removed from the corpus.
5.3 Translation units
For the experiment, we used a restricted translation option: only tokens not recognised by the Unix “spell” program were subjected to translation. Here are the twenty most frequent tokens extracted by spell; their translation is also given (after “/”):
wd/would att/at tis/this wth/with shd/should wch/which favour/favour Streatham/Streatham shoud/should rejoyce/rejoice Madm/Madam coud/could cd/could Brightn/Brighton askt/asked neighbour/neighbour humour/humour honour/honour de/de Clavering/Clavering
5.4 Metrics and experiments
We calculated the percentage of tokens with a correct morphological analysis. The correctness of each analysis was determined by ocular examination of the tagged samples; also the documentation of the underlying grammatical representation was used in case of uncertainty. A measure of subjectivity is involved in this evaluation (some misanalyses may have been unnoticed); however, since the same method was used in examining both the translated and the untranslated corpora, analysis accuracy differences are expected to be indicative of the usability (or otherwise) of the translation method. In any case, an objective evaluation with a larger corpus and controlled benchmark creation (as described e.g. in Voutilainen and Järvinen 1995) will be feasible with the availability of a more mature system prototype.
Here are the results from the comparative experiment with a 2200-word extract. The observed tagging accuracies:
- tagged corpus without translation and merging: 95.2% (106 tagging errors)
- tagged corpus with translation and merging: 98.0% (45 tagging errors)
Even the restricted speller-based translation option appears to be useful: more than 50% of the tagging errors made by the “baseline” system were avoided.
Next, we have a more qualitative look at the performance of the translation-based method with a larger extract, 7423 words (excluding punctuation marks) from the corpus.
In the tagged corpus, we found 117 mistagged words (i.e. words without a correct analysis). We classified the tagging errors into three main categories: (i) tagging errors that could have been avoided if the word had been included among the translation candidates – 22 errors (19% of all tagging errors); (ii) tagging error due to another linguistic difference between 18th Century English and Present-Day English – 32 errors (27% of all tagging errors); (iii) tagging error due to shortcoming in the tagger's language model (tokenisation or lexicon or disambiguation grammar) – 63 errors (54% of all tagging errors). The tagger, like CG taggers in general, left some ambiguity unresolved; the pending ambiguities (close to 200) were resolved manually as a postprocessing step.
We look at each main tagging error source with examples.
Missing translation candidate
The “spell” spelling checker recognised the following words, but the tagger's lexical description does not account for them properly:
yr, y, thats, twill, lye, writ, dos, SIC, writ, 2d, outmost, O
so they were not included in the list of translation candidates. The tagger misanalysed the words, e.g. the pronouns “y” and “yr” were tagged as nouns (subcategory: abbreviation):
-/Pun
do/VPres
y/NAbb <<<<<<<<<<<
,/Pun
if/CS
you/Pron
can/VMod
-/Pun
&/CC
I/Pron
'll/VMod
stick/VInf
…
I/Pron
am/VPres
ashamed/A
to/TO
keep/VInf
yr/NAbb <<<<<<<<<<<<<<<
Messenger/N
longer/AdvCmp
or/CC
wd/VMod
carry/VInf
If “y” had been translated as “you” and “yr” as “your”, many tagging errors of this kind would probably have been avoided.
One of the 22 tagging errors in this category was due to a translation error: “preverse” was translated as “perverse” (an adjective); a more appropriate translation would have been “preserve”:
God/N
preverse/A <<<<<<<<<<<<<
SIC/Adv
his/PronGen
life/N
,/Pun
so/Adv
neccessary/A
to/Prep
his/PronGen
family/N
./Pun
Other linguistic differences between 18th Century and 20th Century English
The other observed linguistic differences between 18th Century and 20th Century English can be subcategorised into three classes; we give examples of each.
(i) Syntactic differences: a syntactic construction untypical of Present-Day English was used in the 18th Century corpus. Some examples:
– “pray” + Imperative. In the tagger's language model, “pray” was described like “please” that can be followed by a verb in the Imperative, but also by other verb forms. In the corpus extract, the verb form following “pray” was always an imperative; here is one example:
dear/A
Madam/N
,/Pun
pray/Adv
do/VPres <<<<<<<<
me/PronAcc
the/Det
honour/N
…
– Question without periphrastic DO was more common in older varieties of English than in Present-Day English. The tagger's language model did not account for this construction, hence “you” was tagged as an accusatve pronoun instead of the appropriate “Pron” tag in the following:
things/NPl
go/VPres
well/Adv
-/Pun
what/Pron
think/VPres
you/PronAcc <<<<<<<<<
of/Prep
taking/VIng
a/Det
peep/N
at/Prep
the/Det
old/A
Castle/N
&/CC
Philosopher/N
of/Prep
Chesington/N
?/Pun
– The tagger's language model did not account for the formulaic expression “God bless you”; hence “bless” was tagged as a Present-tense verb (VPres) instead of the proper Subjunctive verb tag (VSbj):
Well/Adv
,/Pun
God/N
bless/VPres <<<<<<<<<<<<
you/PronAcc
,/Pun
dearest/ASup
Madam/N
,/Pun
(ii) Morphological differences: a word-form is used as a finite verb in 18th Century English as well as in Present-Day English, but in 18th Century English it also serves a participial function. Here is an example: “tore” should have been tagged as participle (VEn), not as past-tense verb (VPast):
She/Pron
will/VMod
be/VInf
tore/VPast <<<<<<<<<<<<
in/Prep
peices/NPl
by/Prep
wild/A
horses/NPl
(iii) Capitalisation, punctuation: use of upper, lower case and punctuation in the 18th Century corpus differs from its usage in standard Present-Day English, which sometimes results in a tagging error (the disambiguation grammar used by the tagger uses information about capitalisation in its context tests). In the following example, the tagger analysed the verbal participle “Airing” as a nominal Ing-form (Ing); the analysis decision was in part based on the use of capitalisation, which usually indicates the presence of a nominal:
possession/N
of/Prep
the/Det
Comfortable/A
bed/N
wch/PronRel
I/Pron
had/VPast
the/Det
honour/N
of/Prep
Airing/Ing <<<<<<<<<<<<<<<<
for/Prep
him/PronAcc
./Pun
Tagging errors due to shortcoming in the tagger's language models
About a half of the 117 tagging errors were not due to a linguistic difference between 18th Century and Present-Day English, but rather to more general `leaks' in the tagger's language models. Most of the tagging errors in this category were due to a misprediction made by the contextual disambiguation grammar (some of them can probably be corrected without compromising the performance of the disambiguation grammar in other respects); there were also some tagging errors resulting from incorrect sentence (boundary) identification.
(i) Here are some tagging errors due to a mispredicting disambiguation grammar:
– the corpus contained several participial clauses with a genitival subject; three of these were mistagged: the participle was analysed as nominal (Ing), cf. “being” in the following example.
I/Pron
need/VMod
not/Neg
fear/VInf
his/PronGen
being/Ing <<<<<<<<<
over-dosed/A
./Pun
– An adverb was sometimes mistagged as adjective (ACmp), cf. “lower” in the following:
One/Num
Peg/N
lower/ACmp <<<<<<<<<
wd/VMod
perhaps/Adv
have/VInf
made/VEn
me/PronAcc
wish/VInf
– As in Present-Day English, also in 18th Century English “but” can be used as a coordinating conjunction, as a preposition and as an adverb. In the CEECE corpus sample, preposition and adverb usage appeared to be much more common than e.g. in Present-Day English newspaper text. Sometimes the tagger over-recognised coordinating conjunction readings, as in the following example:
Who/Pron
,/Pun
but/CC <<<<<<<<
a/Det
Swan/N
,/Pun
cd/VMod
sing/VInf
so/Adv
sweetly/Adv
,/Pun
when/Adv
dying/A
?/Pun
(ii) The following example shows a sentence that was erroneously divided in two by the tagger's tokenisation module (at the exclamation mark); as a result, the tagger lost view of the main predicate verb of the that-clause. In the absence of a verb, the subordinating conjunction reading of “that” (CS) was discarded; the determiner (Det) analysis as the last surviving reading was proposed by the tagger:
I/Pron
forgot/VPast
to/TO
tell/VInf
you/PronAcc
that/Det <<<<<<<<<<<<<
your/PronGen
'/Pun
Good/A
Night/N
'/Pun
!/Pun
is/VPres
got/VEn
about/Prep
Similar sentence identification challenges occur also in Present-Day English; this motivates redesign of the tokenisation logic in the tagging system.
6. Discussion
We set out by motivating a need for an efficient low-resource method for the corpus linguist to annotate text corpora representing types of language for which there are no available automatic annotation solutions. Assuming morphosyntactic taggers or parsers are available for a (Present-Day Standard) variety of the language, we proposed a method based on limited mechanical translation and combination of different tagging results to improve analysis quality at least in cases where the language varieties differ only lexically from each other. To experiment with usefulness of the method, we reported an experiment with a small corpus of 18th Century English and with a linguistic word-class tagger modelled on Present-Day English. Our experiments indicated that the translation-based method is useful: the tagger's error rate decreased by more than a half, even though the translation task was restricted to a minimum (only those tokens not recognised by a spell-checker of Present-Day English). Largely similar experiments showing substantial improvements in tagging accuracy are also reported by Rayson et al. (2007). Quantitative comparison with our experiments does not seem feasible due to differences in corpora, taggers and tag sets.
Additional evidence on the usefulness (or lack thereof) of the method proposed can be gained by experimenting with other (typologically different) languages and with (methodologically different) analysers for tagging and parsing.
As anticipated, the method does not resolve problems due to syntactic differences between the language of the corpus and the object language of the tagger/parser (including changes in tokenization). A potential improvement to the proposed method is to combine it with a syntactic postprocessor that identifies at least the most regular annotation error types and proposes a correct analysis as a replacement. Similar nonlexical improvements can be built for the preprocessing phase, e.g. to account for the use of `strong' punctuation (e.g. full stop, exclamation mark) sentence-internally. Syntactic pre- and postprocessing e.g. for error correction has been studied earlier (e.g. Leech et al. 1994; Manning 2011); empirical study of their use in combination with translation-based techniques may provide useful information on methods to facilitate the work of the corpus linguist and to increase the usability of high-quality software.
Acknowledgements
The research has benefited from earlier collaboration with Merja Kytö. Terttu Nevalainen kindly offered me a chance to make a presentation about my experiments at the Helsinki Corpus Festival. I would also like to thank Paul Rayson for a pointer to substantial earlier work, and Tanja Säily for providing me with the CEECE corpus extracts for experimentation. I also wish to thank two anonymous referees of the present volume for their constructive feedback.
Sources
A list of taggers and parsers can be found here: http://www-nlp.stanford.edu/links/statnlp.html#Taggers
Unix for Poets by Ken Church can be found here: http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf (PDF)
Linux script
Download the Linux script in this article as tar.gz file.
References
Baron, Alistair & Paul Rayson. 2011. “Automatic standardization of texts containing spelling variation, how much training data do you need?”. Proceedings of the Corpus Linguistics Conference CL2009, ed. by Michaela Mahlberg, Victorina González-Díaz, Catherine Smith. University of Liverpool, UK.
Burnard, Lou & Guy Aston. 1998. The BNC handbook: exploring the British National Corpus. Edinburgh: Edinburgh University Press.
Davies, Mark. 2009. “The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights”. International Journal of Corpus Linguistics. John Benjamins.
Karlsson, Fred, Atro Voutilainen, Juha Heikkilä & Arto Anttila, eds. 1995. Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Berlin & New York: Mouton.
Leech, Geoffrey, Roger Garside & Michael Bryant. 1994. “CLAWS4: The tagging of the British National Corpus”. Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Kyoto, Japan.
Manning, Christopher D. 2011. “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?”. Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I, ed. by Alexander Gelbukh. Lecture Notes in Computer Science 6608. Springer.
Marcus, Mitchell P., Beatrice Santorini & Mary Ann Marcinkiewicz. 1993. “Building a Large Annotated Corpus of English: The Penn Treebank”. Computational Linguistics 19(2): Special Issue on Using Large Corpora: II.
Mikulova, Marie, Alevtina Bemova, Jan Hajic, Eva Hajicova, Jiri Havelka, Veronika Kolarova, Lucie Kucova, Marketa Lopatkova, Petr Pajas, Jarmila Panevova, Magda Razimova, Petr Sgall, Jan Stepanek, Zdenka Uresova, Katerina Vesela, and Zdenek Zabokrtsky. 2006. “Annotation on the Tectogrammatical Level in the Prague Dependency Treebank. Annotation Manual”. Technical Report 30, UFAL MFF UK, Prague, Czech Rep.
Rayson, Paul, Dawn Archer, Alistair Baron, Jonathan Culpeper and Nicholas Smith. 2007. “Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora”. Proceedings of Corpus Linguistics 2007. University of Birmingham, UK.
Voutilainen, Atro & Timo Järvinen. 1995. “Specifying a shallow grammatical representation for parsing purposes”. Proc. EACL-1995. Dublin.
Appendix 1: Tagset
The distinctions made by our tagset are largely similar to those made by other known tagsets for English, but there are also some differences. To compare with the BNC Basic Tagset (C5), here are some differences: (i) C5 distinguishes between common and proper nouns, while our tagset subsumes both under the noun category; (ii) regarding non-inflected verb forms, our tagset distinguishes between Present tense, Imperative, Subjunctive and Infinitive categories, while C5 subsumes the first three under a single tag (but provides a separate tag for Infinitives); (iii) C5 represents genitive endings as separate tokens (and tags), while our system keeps inflectionsl endings as part of the word, and gives a Gen subtag to show the case, e.g. NGen; (iv) regarding demonstratives (this, these, that, those) C5 uses a single Determiner category, while our tagset distinguishes between (premodifying) determiner reading and a pronominal reading (with a nominal head function).
Here is a list of our tags in tabular form.
TAG |
GLOSS |
EXAMPLES |
A |
adjective |
happy, tired |
ACmp |
adjective, comparative form |
happier |
ASup |
adjective, superlative form |
happiest |
Adv |
adverb |
when, well, so, please |
AdvCmp |
adverb, comparative form |
more, less |
AdvSup |
adverb, superlative form |
more, least |
CC |
coordinating conjunction |
and, &, either, or |
CS |
subordinating conjunction |
though, if |
Det |
determiner |
a, this, whose, no |
Ing |
nominal ing-form or name |
erring, feeling, Clavering |
Int |
interjection |
Hey |
N |
noun |
coat,Madam, Martin |
NPl |
noun, plural form |
suggestions, feelings |
NGen |
noun or abbreviation, genitive form |
Bolt's, man's |
NGenPl |
noun or abbreviation, plural genitive form |
sisters' |
NAbb |
noun, abbreviated form |
Mr. |
Neg |
not |
not |
Num |
numeral, cardinal |
31, one |
NumOrd |
numeral, ordinal |
third, 2nd |
Prep |
preposition |
for, in, in_spite_of |
Pron |
pronoun |
I, they, someone_else |
PronPl |
pronoun, plural form |
we, you |
PronAcc |
pronoun, accusative from |
him, her |
PronAccPl |
pronoun, accusative plural |
us, them |
PronGen |
pronoun, genitive form |
my, his |
PronGenPl |
pronoun, genitive plural |
our |
PronRel |
relative pronoun |
that, who |
Pun |
punctuation mark |
“.”, “,”, “?” |
TO |
infinitive marker |
to, in_order_to |
VPres |
verb, present tense |
walks |
VSbj |
verb, subjunctive |
forbid |
VImp |
verb, imperative |
come |
VInf |
verb, infinitive |
come |
VMod |
verb, modal auxiliary |
shall, may, will |
VPast |
verb, past tense |
walked, fell |
VEn |
verb, EN-form |
fallen, walked |
VIng |
verb, ING-form |
walking, being |
Appendix 2: Analysis examples
Here are two tagged samples from the CEECE Corpus extract. First, an extract from “burney.txt”:
St/NAbb Martin's/NGen Street/N ./Pun
Dear/A Madam/N ./Pun
If/CS my/PronGen long/A Silence/N has/VPres traust/VEn into/Prep
your/PronGen head/N any/Det suggestions/NPl of/Prep my/PronGen
Friendship/N being/VIng like/Prep that/Pron of/Prep many/Det
others/PronPl ,/Pun only/Adv local/A &/CC temporary/A ,/Pun
believe/VImp them/PronAccPl not/Neg ;/Pun for/CC something/Pron
tells/VPres me/PronAcc Daily/A that/CS however/Adv business/N ,/Pun
disagreeable/A Situations/NPl ,/Pun or/CC a/Det relaxed/A &/CC
flaccid/A Mind/N may/VMod prevent/VInf me/PronAcc from/Prep
writing/Ing ,/Pun you/Pron will/VMod never/Adv be/VInf forgotten/VEn
./Pun
When/Adv I/Pron was/VPast at/Prep Brightn/N ,/Pun I/Pron was/VPast
too/Adv happy/A to/TO write/VInf ;/Pun Time/N flew/VPast as/Prep
smooth/A &/CC swift/A as/Prep the/Det Glid/N we/PronPl saw/VPast
in/Prep our/PronGenPl last/Det ride/N ./Pun
One/Num Peg/N lower/ACmp wd/VMod perhaps/Adv have/VInf made/VEn
me/PronAcc wish/VInf for/Prep Pen/N Ink/N &/CC Paper/N ;/Pun but/CC
two/Num or/CC three/Num lower/ACmp have/VPres made/VEn me/PronAcc
shudder/VInf at/Prep the/Det thoughts/NPl of/Prep them/PronAccPl ./Pun
So/Adv it/Pron is/VPres -/Pun our/PronGenPl Reason/N ,/Pun
Resolution/N ,/Pun &/CC the/Det Proofs/NPl ,/Pun even/Adv of/Prep
our/PronGenPl Affection/N ,/Pun are/VPres the/Det Slaves/NPl of/Prep
Circumstance/N !/Pun
That/CS we/PronPl are/VPres Journaliers/N in/Prep the/Det
performance/N of/Prep mental/A as_well_as/CC bodily/A Feats/NPl ,/Pun
every/Det retailer/N of/Prep Saws/NPl will/VMod allow/VInf ;/Pun
but/CC that/CS there/Adv are/VPres certain/A Diavolini/N
degl'Impedimenti/N ,/Pun or/CC mischievous/A Sylphs/NPl &/CC
Gnomes/NPl that/PronRel successfully/Adv forge/VPres Fetters/NPl
for/Prep Resolution/N ,/Pun even/Adv wise/A Folks/NPl will/VMod
deny/VInf ;/Pun &/CC yet/Adv ,/Pun I/Pron seem/VPres surrounded/VEn
with/Prep an/Det Army/N of/Prep them/PronAccPl ,/Pun that/PronRel
prevent/VPres me/PronAcc from/Prep doing/VIng every/Det thing/N I/Pron
wish/VPres &/CC intend/VPres ./Pun
'/Pun This/Det erring/Ing Mortals/NPl levity/N may/VMod call/VInf
,/Pun '/Pun O/NAbb blind/A to/Prep Truth/N !/Pun
the/Det Sylphs/NPl contrive/VPres it/PronAcc all/Pron ./Pun '/Pun
You/Pron were/VPast very/Adv good/A (/Pun '/Pun but/CC 'tis/VPres
a/Det way/N you/Pron have/VPres '/Pun )/Pun to/TO try/VInf to/TO
comfort/VInf poor/A Madam/N after/Prep her/PronGen unfortunate/A
Campaign/N on/Prep the/Det Continent/N ./Pun
She/Pron changed/VPast her/PronGen Resolution/N ,/Pun &/CC came/VPast
to/Prep London/N the/Det day/N after/Prep her/PronGen Landing/Ing
,/Pun &/CC the/Det Day/N following/VIng we/Pron went/VPast
together/Adv into/Prep Surry/N ,/Pun for/Prep a/Det Week/N ./Pun
– Here is the second sample from “clavering.txt”:
As_to/Prep the/Det rest/N I/Pron will/VMod in/Prep right/N to/Prep
my/PronGen self/N be/VInf silent/A when/Adv I/Pron consider/VPres
they/PronPl are/VPres subjects/NPl he/Pron has/VPres already/Adv
touch'd/VEn upon/Prep ./Pun
So/Adv you/Pron in/Prep your/PronGen disappointment/N of/Prep the/Det
4th/NumOrd instant/A ./Pun
I/Pron was/VPast in/Prep great/A expectations/NPl to/TO hear/VInf
of/Prep your/PronGen success/N ,/Pun having/VIng had/VEn so/Adv
many/Det assurances/NPl of/Prep the/Det sincerity/N of/Prep the/Det
new/A friendship/N with/Prep the/Det Governor/N ./Pun Tho'/CS att/Prep
the/Det same/A time/N I/Pron can/VMod but/CC own/A I/Pron
thought/VPast it/PronAcc impossible/A for/Prep men/NPl to/Prep putt/N
of/Prep their/PronGenPl natures/NPl ./Pun
But/CC ,/Pun good/A Vaudois/N ,/Pun do/VPres not/Neg frett/VInf
to/Prep much/Pron ./Pun
I/Pron 've/VPres just/Adv been/VEn a/Det sufferer/N by/Prep it/PronAcc
myself/Pron ,/Pun for/CC fretting/VIng I/Pron am/VPres sure/A
did/VPast me/PronAcc no/Det service/N and/CC perhaps/Adv helpt/VPast
on/Prep my/PronGen illness/N ./Pun
Rest/VImp you/Pron satisfyed/VEn ./Pun
My/PronGen neighbour/N ,/Pun honest/A man/N ,/Pun comforts/NPl
himself/Pron a/Adv little/Adv with/Prep the/Det thoughts/NPl the/Det
Whigs/NPl will/VMod make/VInf a/Det glorious/A campaign/N against/Prep
Sacheverell/N ./Pun Perticulars/NPl I/Pron leave/VPres to/Prep
him/PronAcc ./Pun
So/Adv now/Adv I/Pron come/VPres to/Prep our/PronGenPl pupill/N ./Pun
I/Pron must/VMod be/VInf not/Neg well/Adv now/Adv and/CC then/Adv
to/TO try/VInf the/Det affections/NPl of/Prep my/PronGen friends/NPl
./Pun
My/PronGen girls/NPl cry'd/VPast and/CC fretted/VPast so/Adv that/CS
I/Pron had/VPast Ratclif/N ,/Pun as_if/CS they/PronPl were/VPast mad/A
./Pun
Johny/N askt/VPast after/Prep me/PronAcc ,/Pun which/PronRel was/VPast
more/Pron than/CS he/Pron did/VPast to/Prep his/PronGen own/A
sisters/NPl ;/Pun and/CC of/Prep Sunday/N last/Adv my/PronGen aunt/N
had/VPast her/PronGen first/NumOrd admittance/N into/Prep the/Det
house/N he/Pron was/VPast in/Prep ./Pun
He/Pron came/VPast to/Prep her/PronAcc ,/Pun asked/VPast how/Adv
I/Pron did/VPast and/CC imediatly/Adv askt/VPast his/PronGen
sisters'/NGenPl leave/N to/TO come/VInf to/TO see/VInf me/PronAcc
./Pun
He/Pron was/VPast in/Prep an/Det excellent/A humour/N and/CC we/PronPl
talkt/VPast politicks/N till/Prep ten/Num a/Det clock/N ./Pun
This/Pron you/Pron must/VMod own/VInf a/Det favour/N not/Neg usuall/A
,/Pun which/PronRel incourages/VPres me/PronAcc to/TO persue/VInf
his/PronGen affairs/NPl with/Prep more/Det pleasure/N ./Pun
You/Pron mistake/VPres me/PronAcc as_to/Prep Weatherley/N ./Pun
|