POS Tagging the CEECE
A Manual to Accompany the Tagged Corpus of Early English Correspondence Extension (TCEECE)
2020
Lassi Saario and Tanja Säily
Research Unit for Variation, Contacts and Change in English (VARIENG)
Faculty of Arts
University of Helsinki
- Introduction
- People
- Normalisation
- Abbreviations
- Tildes and superscripts
- Full stops, colons and ‘bare’ abbreviations
- Redundant punctuation
- Indefinite pronouns and adverbs
- Reflexive pronouns
- Miscellaneous
- XML annotation
- Overlapping with the normalisation
- Structure of the corpus
- Parameter coding
- Text-level coding
- Textual structure
- Special characters
- Page numbers
- Headings
- Emendations
- Comments
- Type changes
- Foreign language
- CLAWS POS tagging
- Configuration of the SGML table
- Error messages
- Post-processing
- Conversion to final formats
- Final XML format
- Tagging accuracy
- Sample
- Calculation principles
- Checking guidelines
- Accuracy rate
- Accuracy by tags
- Corrected collection
- Known issues
- Directory structure
- References
- Appendices
Introduction
POS tagging the 18th-century Extension to the Corpus of Early English Correspondence (CEECE) was motivated by a desire to make the corpus more usable to corpus linguists by enabling more sophisticated queries. We chose the annotation system, CLAWS, as we wanted to make the corpus comparable with CLAWS-tagged Present-day English corpora such as the British National Corpus (BNC). Moreover, the annotation system used for the earlier Parsed Corpus of Early English Correspondence had some serious drawbacks identified in previous research (Säily et al. 2011, 2017). For more information on our technological choices and the justification behind them, see Saario et al. (submitted).
The steps we took are described in Sections 3–5 below. To improve the performance of CLAWS, which was developed for Present-day English, we used the standardised-spelling version of the corpus produced with VARD and added further normalisation as needed. We then converted the corpus from the ancient COCOA format into XML. Tokenisation and POS tagging were performed by CLAWS, after which we did some post-processing to prepare the final formats.
As POS tagging the CEECE was clearly of interest, a tagging project was set up in Helsinki already in 2013 (see Säily 2013). This involved a small team with no funding specifically devoted to the project. Consequently, progress was slow and was limited to the normalisation stage tackled by various research assistants in turn. After a hiatus of several years, one of the project members, Tanja Säily, received a grant from the Faculty of Arts in 2018. She used this to hire a research assistant, Lassi Saario, who had the technical know-how required for the XML conversion and other steps involved, and with Lassi’s help the project was finally completed.
People
Team: Terttu Nevalainen, Tanja Säily, Mikko Hakala, Samuli Kaislaniemi
Assistants: Anna-Lina Wallraff, Emanuela Costea, Anne Kingma, Lassi Saario
Thanks to: Paul Rayson (Lancaster/UCREL), Jukka Suomela (Aalto), Arja Nurmi (Tampere), Turo Hiltunen (Helsinki), Gerold Schneider (Zürich)
We gratefully acknowledge the support of the Faculty of Arts at the University of Helsinki, the Research Unit for Variation, Contacts and Change in English (VARIENG), and the Academy of Finland.
Normalisation
The letters in the CEECE cover a time span from 1653 to 1800. The letter writers range from almost illiterate paupers to well-educated royals. The sociohistorical variation being as significant as it is, there is a lot of spelling variation in the corpus. The CLAWS tagger, however, is designed for Present-Day English. In order to improve tagging accuracy, we decided to normalise the language as much as we could before it was fed to CLAWS.
The normalisation was based on the standardised version of the CEECE, so the corpus had already been normalised to some extent prior to our project. The standardisation had been done using VARD 2 (Variant Detector), a piece of software that is specifically designed to be used on historical corpora as a pre-processor to other linguistic tools (such as a POS tagger) in order to improve the accuracy of those tools. See the poster on the standardised CEEC for more information about the ‘VARDing’, as we call it.
We continued the normalisation from where the VARDing had ended. Normalisable features were first searched for using a corpus analysis toolkit. The KWIC output was exported to a tab-delimited text file that was opened in Microsoft Excel. We went through the tables row by row and decided for each row whether the expression in question was to be normalised. If it was, we wrote down the replacement in a separate column; otherwise we removed the row. Finally, the Excel files were converted back to tab-delimited text files and fed to a Python script that carried out the specified replacements on the corpus files.
The corpus analysis toolkit we first used was WordSmith Tools, a commercial product by Lexical Analysis Software. Later we changed to AntConc (v. 3.5.6), a freeware alternative developed by Laurence Anthony. The Python scripts that were used to execute the replacements were written by Jukka Suomela.
How did we decide what features to normalise? First, we normalised certain aspects of the language that we knew for sure CLAWS would not understand (this is the so-called 1st cycle referred to in the appendix). We then extracted a sample of c. 100 words from the beginning of each collection, run the samples through CLAWS and checked the output. Many more or less systematic factors reducing the tagging accuracy stood out. While most of them would have been too laborious to fix (see Known issues), some of them could be solved rather straightforwardly by further normalisation (the 2nd cycle referred to in the appendix).
The following subsections explain what features were normalised and how. Each subsection covers several searches, each of which is identified by a unique codename such as abbr1 and indef2. The regular expression, the total number of hits and the number of replacements for all the searches are summarised in the appendix. Code names are given in square brackets among the body text, linking the normalised features to the respective data in the appendix and the Excel files of the same name.
Abbreviations
The original CEECE was full of abbreviations. When the corpus was standardised in VARD, a frequency cut-off was set to decide which abbreviations were to be expanded. Only those that exceeded the cut-off were expanded, meaning that there were many abbreviations left in the text after standardisation. Some of them would have been recognised by CLAWS, but not all. We set out to expand as many of them as we could.
Tildes and superscripts
Many abbreviations in the CEECE are marked with a tilde, as in altera~ons (short for alterations) and tho~ (short for though). The corresponding notation in the source edition is usually a tilde placed over the preceding letter, as in alterãons and thõ.
Another notation that is frequently used with abbreviations is superscript, encoded in the CEECE with two surrounding equality signs. The word w=ch= (short for which), for instance, corresponds to wch in the source edition.
Some abbreviations involve both a tilde and a superscript, such as Ma~t=y= (short for Majesty). There may also be some additional notation that is there to mark the abbreviation, such as the colon in Eng~: (short for English) and the full stop in serv=t.= (short for servant).
Not all tildes and superscripts mark abbreviations, however. Some superscripts have nothing to do with abbreviations (e.g. 10=th= and you=rs=). Sometimes a tilde in the corpus encodes an actual tilde in the source edition, as in Espan~ola which is code for Española.
Keeping this variety in mind, we searched for all words involving tildes and superscripts and normalised them accordingly:
- Abbreviations were expanded (e.g. p~'use → peruse, rece=d~.= → received), except for those in foreign language (e.g. ex Aula~ Edmundi). The tildes that remained in such passages were later removed by CLAWS (see the section on error messages).
- Tildes that encoded actual tildes were converted to XML entities, e.g. Espan~ola → Española (see the section on special characters).
- Superscripts that did not have anything to do with abbreviations were simply removed (10=th= → 10th, you=rs= → yours).
[pre1]
Full stops, colons and ‘bare’ abbreviations
In addition to tildes and superscripts, there are abbreviations that are marked by a full stop (e.g. Oct. for October), a colon (e.g. Capt: for Captain) or by nothing at all (e.g. Honble for Honourable). These kinds of abbreviations are much harder to find. One cannot simply search for every single word followed by a full stop or a colon, as this would include far too many unabbreviated words in the results. Nor can one search for all those abbreviations that are not marked in any way. The only feasible way to search for these kinds of abbreviations is to determine the set of particular abbreviations to be searched for.
We chose to include certain common abbreviations we had encountered in course of the preliminary checking, knowing that many more abbreviations remain. The abbreviations that we expanded include various abbreviations for month names (Apr, Sepbr, 9br etc.) and affectionate(ly) (aff, affec etc.) as well as Hond, Honble, Sr, Capt, cd, cod, sh, shod, wd, wld and yt (for that). Some non-standard spellings of some of the abbreviated words (e.g. Jully, affectionett) were also normalised together with the abbreviations. [abbr1–3]
Redundant punctuation
To expand an abbreviation is not as simple as it may seem. If the abbreviation is followed by a full stop or a colon, one has to decide what to do with it once the abbreviation is expanded. If the full stop or colon marks something else in addition to the abbreviation (e.g. end of sentence), it should be kept; otherwise it should be removed.
When abbreviations had been expanded in VARD, all the subsequent full stops and colons had been left intact. As a result, there were plenty of full stops which marked neither abbreviations nor ends of sentences but which CLAWS nevertheless took to mark ends of sentences, resulting in considerable tagging errors. As for colons, they were not recognised by CLAWS as abbreviation markers at all. This affected not so much tagging as it did tokenisation, a colon being treated as a single token even when part of an abbreviation.
The clause I shall send notice to Ld. Weymouth, for instance, had been transformed in VARD into I shall send notice to Lord. Weymouth. Now, CLAWS treated the redundant full stop as a single token and inserted a sentence break afterwards even though it did not end a sentence. What is more, the word Lord was given the incorrect tag NP1 instead of the correct tag NNB, which could have been avoided had the full stop been removed prior to tagging.
It was clear that something had to be done about the problem. It was equally clear that going through all the full stops and colons in the corpus would have required hundreds of hours of extra work. We decided to concentrate on some common words only.
We chose to include all the month names (January, February etc.) as well as Sir, Lord, Captain, Princess, Duchess, Brother, Cousin, dear, affectionate, yours, your, which, that, would, could, should and the. Certain cases of ordinal numbers (1st, 2nd etc.) marked by unnecessary full stops or colons were also included. [punc1–3]
We searched for all such instances of those words that were followed by a full stop or a colon and decided for each, judging from the context, whether the full stop or colon was to be removed. In e.g. the following passage,
He declared that. this was very noble in me…
the full stop after that should be removed, whereas in
But enough of that. The Saxon we see flourishes here…
it ought to be kept.
Indefinite pronouns and adverbs
Another group of normalised features includes the indefinite pronouns everybody, something, anyone and the like. To put it more formally, it includes all the two-part compounds such that the first part is either every, some, any or no and the second part is either body, thing or one. There are 4 x 3 = 12 combinations in total.
The standard spelling of these pronouns in Present-Day English is well established. In the CEECE, the standard is only just emerging. Variation occurs on three levels:
- The pronoun parts may be either joined (e.g. something), separated (e.g. some thing) or hyphenated (some-thing). In present-day usage, the joined form is standard for every pronoun in the group except no one (or no-one).
- There may be variation in the first part (evry), the second part (bodey, bodie, boddey, think or thin) or both parts, although the last option turns out to be merely theoretical in the case of the CEECE.
- The capitalisation may diverge from present-day practice, as in he stands as firm as any One whatsoever.
A particular indefinite pronoun may diverge from the standard spelling on any of these levels—or none of them, in which case it does not need to be normalised at all. A rough regex search reveals that the CEECE contains 6,124 indefinite pronouns in total, 4,150 of which are normal, i.e., represent the present-day standard spelling.
The challenge for us was to write a search that would include all the variant instances, exclude all the normal ones and distinguish between different kinds of variants so as to automate the normalisation as much as possible. We divided them into three groups according to the type of the required normalisation:
- Both parts are standard and only need to be joined [indef1]
- E.g. no body → nobody, some-thing → something
- Note that no one and no-one were excluded from this search.
- First part non-standard, second part standard, whether joined or not [indef2]
- E.g. evrything → everything, Evry one → everyone
- First part standard, second part non-standard, whether joined or not [indef3]
- E.g. some Boddey → somebody, any think → anything
Separate searches were made for abbreviated variants (e.g. noth. → nothing) and variants that ended with an s. The latter ones were normalised by inserting the missing apostrophe in front of the s (e.g. somebodys → somebody's). [indef4–5]
In addition to the aforementioned pronouns, spelling variants of the related adverbs everywhere, somewhere, anywhere, nowhere and sometimes were also normalised accordingly. [indef6]
Reflexive pronouns
Reflexive pronouns (myself, yourselves etc.) were searched for and normalised very much the same way as indefinite pronouns.
The standard singular reflexive pronouns are defined here as the two-part compounds where the first part is either my, your, him, her, it or one and the second is self. Plural ones are defined as such that the first part is either our, your or them and the second is selves. As with indefinite pronouns, spelling variation in the CEECE occurs on many levels:
- The pronoun parts may be either joined (e.g. herself), separated (e.g. her self) or hyphenated (her-self). In present-day usage, the joined form is standard for every reflexive pronoun.
- Theoretically, there could be variation in the number of the second part even if the spelling as such is standard, as in myselves or themself. We did find two variants of this sort but they both turned out to be instances of the ‘royal we’, such as we assure ourself that your Grace will pay ready obedience to our command, spoken by the Duke of Monmouth in reference to himself. Both instances were thus left intact.
- There may be variation in the first part (thy, you, his, herr, its, ones, one's, won's, their), the second part (selfe, selff, selffe, selfes) or both parts of the pronoun, although the last option turns out to be merely theoretical in this case.
- The capitalisation may diverge from present-day practice, as in the clause I could reflect on no person but My Self for it.
According to a regex search, there are 6,078 reflexive pronouns in the corpus, 4,787 of which are normal. Again, we had to define our search formulae so that the search would include as many normalisable instances and as little extra as possible. The grouping of the searches follows that of the indefinite pronouns:
- Both parts are standard and only need to be joined [refl1]
- E.g. my self → myself, your-self → yourself
- First part non-standard, second part standard, whether joined or not [refl2]
- E.g. one's self → oneself, their selves → themselves
- Note that thy has not been normalised to your, since the form thyself is recognised by CLAWS as a reflexive pronoun.
- First part standard, second part non-standard, whether joined or not [refl3]
- E.g. her-selfe → herself, him Selff → himself
Miscellaneous
- Words where the letter v was used instead of u were normalised (e.g. hvsband → husband, vndertoke → undertook). [pre2]
- Words that began or ended with 't or ended with 'd were normalised (e.g. follow'd → followed, on't → on it, 'twill → it will). [pre3]
- The instances of ye where it means the were normalised in order to separate them from those instances where ye means you. [pre4]
- Uncapitalised weekday and month names were capitalised (monday → Monday, january → January etc.) except for march and may that are far too frequently used as verbs. [misc1–2]
- Some common variants of some common words were normalised (dos → does, don → done, cant → can't, wont → won't, tho → though, i → I). [misc3–5]
- Instances of Ly that had been accidentally turned into Lie in VARD were changed to Lady (e.g. Ly. Charlotte → Lie. Charlotte → Lady Charlotte). [misc6]
XML annotation
The original version of the CEECE was written in the ancient COCOA (Word COunt and COncordance on Atlas) format, based on that of the Helsinki Corpus of English Texts. Each collection and letter contained both parameter and text-level coding for metadata of various kinds. Some of the coded parts were to be POS tagged and some were not. In order for CLAWS to be able to separate the former from the latter, the corpus had to be converted into XML format before it could be tagged.
Luckily, we were not the first to be confronted by this issue. The Helsinki Corpus had already been converted to XML, based on a custom TEI XML schema designed by Ville Marttila. From that schema, we derived our own simplified XML format for the CEECE. The actual conversion was performed by a Java program, written by Lassi Saario. The resulting XML files were validated against the Document Type Definition by XmlStarlet (version 1.6.1), a command line XML toolkit developed by Mikhail Grushinskiy.
It has to be emphasised, however, that the main goal of our project was to obtain a POS tagged version of the corpus, not an XML version of it. The XML conversion was only a means to a greater end. Even though all the information that was encoded in the original corpus is preserved in the XML edition (except for some of the normalised parts), the new format offers plenty of possibilities for further annotation that we have not even tried to implement. We hope that our having largely complied with the TEI standards will leave the door open for future contributors who might want to extend our schema to actualise its full potential.
Overlapping with the normalisation
Another thing that should be emphasised is that the corpus was converted into XML only after it had been normalised. Some parts of text that should have been normalised involved corpus codes that would later have been converted into XML. Matters are further complicated by the fact that the overlapping codes were treated differently in the two cycles of normalisation (see the appendix).
In the first cycle, any overlapping code was removed along with the normalisation. Fran[\cesc\]=o= was changed to Francesco , fin[{is{]h'd to finished etc. This means that some of the coded information has been lost in the process and has to be manually retrieved from the normalisation files if so desired.
In the second cycle, coded parts of text were simply excluded from the searches. Because of this exclusion, there may remain some instances in the tagged corpus that have not been normalised even if they instantiate those features that were normalised. The reflexive pronoun (^my^) self , for instance, has been neglected by normalisation because of the code and converted into <hi rend="type">my</hi> self , whereas my self has been normalised to myself .
Structure of the corpus
The original corpus was divided into 78 text files, one for each letter collection. We decided to preserve this division. Each COCOA-encoded collection file was converted into an XML-encoded file of the same name.
In the original corpus, each collection was preceded by a header followed by the individual letters. Each letter was likewise preceded by a header followed by the contents. The XML version follows the same overall structure, illustrated below.
Each XML document begins with the same two lines. The first line specifies the XML version and the character encoding. The second line defines the document type by a reference to an external DTD file. The entities are given an internal declaration as well, since omitting it would cause errors on some browsers which do not support external DTDs.
Each document has teiCollection as its root element. It is made up of a teiHeader element, containing header information about the collection, and a series of TEI elements, representing the individual letters. Each TEI element is likewise made up of a teiHeader element which includes header information about the letter, and a text element which includes the actual contents of the letter.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE teiCollection SYSTEM "../CEEC.dtd" [...]>
<teiCollection xml:id="F2FLEMIN">
<teiHeader>...</teiHeader>
<TEI xml:id="FLEMIN2_001">
<teiHeader>...</teiHeader>
<text type="letter" xml:lang="eng">...</text>
</TEI>
<TEI xml:id="FLEMIN2_002">
<teiHeader>...</teiHeader>
<text type="letter" xml:lang="eng">...</text>
</TEI>
...
</teiCollection>
Parameter coding
In the original CEECE, each file is preceded by the identifier of the collection (same as the file name), followed by source information:
<B F2FLEMIN>
[^THE FLEMINGS IN OXFORD BEING DOCUMENTS SELECTED FROM THE RYDAL
PAPERS IN ILLUSTRATION OF THE LIVES AND WAYS OF OXFORD MEN
1650-1690. VOL. II 1680-1690. ED. BY MAGRATH, JOHN RICHARD.
OXFORD HISTORICAL SOCIETY 62. 1913.^]
In the XML conversion, the identifier is put in the xml:id attribute of the teiCollection opening tag, and the source information is included in a titleStmt element in the teiHeader :
<teiCollection xml:id="F2FLEMIN">
<teiHeader>
<fileDesc>
<titleStmt>THE FLEMINGS IN OXFORD BEING DOCUMENTS SELECTED FROM THE RYDAL PAPERS IN ILLUSTRATION OF THE LIVES AND WAYS OF OXFORD MEN 1650-1690. VOL. II 1680-1690. ED. BY MAGRATH, JOHN RICHARD. OXFORD HISTORICAL SOCIETY 62. 1913.</titleStmt>
</fileDesc>
</teiHeader>
A letter header in the original CEECE consists of an L-line, a Q-line, an X-line and a P-line:
<L FLEMIN2_001>
<Q A 1681 T TDIXON>
<X THOMAS DIXON>
<P II,2>
- The L-line gives the letter identifier.
- The Q-line specifies the authenticity of the letter, the year of writing, the relationship between the writer of the letter and the addressee, and the identifier of the writer, respectively (see Kaislaniemi 2018, 54–56 for details).
- The X-line contains the name of the writer in full. (In some collections there is an A-line instead of an X-line, but the content is the same nevertheless.)
- The P-line includes the number of the page on which the letter begins in the source edition. Similar lines appear amidst the body whenever the page changes.
The contents of the lines are included in the XML header as follows:
<TEI xml:id="FLEMIN2_001">
<!-- from the L-line -->
<teiHeader>
<fileDesc>
<titleStmt>
<title key="A 1681 T TDIXON"></title>
<!-- from the Q-line -->
<author key="THOMAS DIXON"></author>
<!-- from the X- (or A-) line -->
</titleStmt>
</fileDesc>
</teiHeader>
The P-line is included in the XML body along with the other P-lines. See the section on page numbers.
The S-lines like <S SAMPLE 1> that sometimes occur between letters mark samples taken from different source editions in the original corpus. They have been converted into XML comments like <!-- SAMPLE 1 --> .
Text-level coding
Textual structure
A letter body in the original CEECE is divided into lines, the maximum length of which is limited to 65 characters. Some of them are P-lines that annotate page breaks; for the rest there is no fixed format. Paragraphs and sentences flow rather freely from one line to another along with code brackets for headings, emendations etc. See the example below.
[} [\705 FROM MARY HOWE TO MR RANKING IN COOPERSALE (THEYDON
GARNON), 13 JANUARY 1731\] }]
Jenaw 13 day 1731
Mr ranking this is to let you know that the doxtor have done
what he can for me but my iees are never the better but rather
worse I am to be discharged next wandsday I
hope you will be so kind as to send me word how I must come home
by next wandsday morning so with humble service to you and your
good wife
sir I hope you will exquese me in writing of a letter but I
did not know no other way So I rest your humble servant
mary how patient in
[\CONTINUED CROSSWISE IN LEFT-HAND MARGIN\] peter ward
Unfortunately, the line and page divisions that are so explicit in the original CEECE are irrelevant for the purposes of POS tagging. Much more relevant is the paragraph division, which is also much more implicit. Lines that start paragraphs are usually indented with three whitespaces, but not always: sometimes the only clue of the line starting a paragraph is the previous line being shorter than usual. Matters are further complicated by the fact that P-lines sometimes appear in the middle of a paragraph and sometimes between paragraphs. Code brackets often appear in the middle of a paragraph, but sometimes they continue across paragraph breaks, and sometimes they even seem to form paragraphs of their own.
To recognise such delicate divisions may be easy for a human eye, but it is far from easy for a computer. We wanted to try it anyway. The rule of thumb that we gave to our converter is that a line starts a new paragraph if it is indented or if the previous line is shorter than 40 characters. Lines that were recognised to form a paragraph were then merged and, given that the paragraph in question was indeed a proper paragraph (as opposed to a single P-line or a bunch of code), put inside a p element. Code bracket sequences that continued across paragraph or page breaks were split at the breaking point in order to guarantee the sanity of the element tree.
No further division was carried out at this stage. It was not necessary, as CLAWS would later on perform the tokenisation into sentences, words and punctuation marks together with the POS tagging. When the CLAWS output was finally converted back to XML, the tokens were put into w elements (see the section on the final XML format).
Special characters
Grave accent symbols that were used to annotate accents (not only graves but acutes and circumflexes as well) in the original CEECE have all been stripped from the XML edition. Equality symbols that annotated superscripts and tildes that annotated abbreviations have also been gotten rid of (see the section on tildes and superscripts).
Certain special characters have been converted into XML entities according to the following table. The entity names follow the TEI codes used in the BNC and declared in the default SGML table of CLAWS. (This is why the lower case yogh has not been named as an entity.)
Source edition |
Original corpus |
XML corpus |
Description |
& |
& |
& |
ampersand |
ð |
+d |
ð |
lower case eth |
ȝ |
+g |
ȝ |
lower case yogh |
þ |
+t |
þ |
lower case thorn |
£ |
+L |
£ |
pound sign |
ñ |
n~ |
ñ |
tilded n |
Page numbers
Page changes were annotated as P-lines, e.g. <P 45> , in the original CEECE. They are converted into pb elements, the n attribute of which contains the page number, e.g. <pb n="45"> . Note that pb elements may appear inside as well as outside p elements.
Headings
Headings, annotated with the code [}...}] in the original CEECE, are annotated with the code <head>...</head> in the XML edition.
Note that as a matter of fact, all headings in the CEECE are editorial, i.e. they are double-coded such as
[} [\98. TO FANNY BURNEY\] }]
where the inner brackets stand for the editorial remark. The double-coding is preserved in the XML conversion so that the given example is converted into
<head> <note resp="editor" value="98. TO FANNY BURNEY" /> </head>
See the section on comments for more information.
Emendations
Emendations are annotated with the code [{...{] in the original CEECE. These have been converted into supplied elements in the XML version. When the emendation consists of complete words, we simply put the content of the brackets in between the XML tags. In e.g. the following passage,
I turned so sick that I [{could{] hardly speak
the [{could{] is converted into
<supplied>could</supplied>
The case is a bit trickier when the emendation contains partial words, as in [{the th{]ing . CLAWS only tags complete words, and yet we would like the XML version to preserve the information about what was amended to what. These kinds of emendations we have converted so that in between the XML tags there is the final amended expression, while the original code is included in an orig attribute:
<supplied range="0,5" orig="[{the th{]ing">the thing</supplied>
The ‘range’ attribute specifies the extension of the emendation. The expression in between the XML tags is indexed so that the first character has the index 0, the second has the index 1 etc. The first number in the range value is the index of the first character in the range, and the second number is the index of the first character not in the range. Note that whitespaces do not count as characters here. When there are several ranges in the same expression, they are delimited by a semicolon:
<supplied range="2,16;19,23" orig="wi[{th a different re{]por[{t eve{]ry">with a different report every</supplied>
The original version of the CEECE contains two types of comments added to the body text. Comments by compilers of the corpus are annotated with the code [^...^] , e.g. [^LIST OF NAMES OMITTED^] . Comments by editors of source editions are annotated with the code [\...\] , e.g. [\TORN\] .
In the TEI XML edition of the Helsinki Corpus, both codes are converted into a note element. The author of the comment is specified by a resp attribute which points to his/her name in the document header. For our purposes, however, it is sufficient to separate the editors’ comments from the compilers’ and not specify the individual commentator. The attribute is simply given the value compiler for compilers’ comments and editor for editors’ comments.
Comments, whether they are written by editors or compilers, are actually used for two different purposes. One is a ‘proper’ comment, such as the two previous examples. The other is more like an emendation, as in
an order [\was made\] at his Lordship's instance
The difference between the two kinds is that a proper comment is a comment about the surrounding text, whereas an emendation-like comment is more like a part of the text. They are rather easily distinguished by the fact that a proper comment usually involves several consecutive upper case letters while an emendation-like comment does not. This holds true even when a proper comment contains text that is parallel to the preceding text, as in
it will [\be DELETED\] come safe hither
When the brackets contain a proper comment, we do not want CLAWS to assign POS tags to it. Yet it is not possible to configure CLAWS to skip the contents of certain XML elements. The XML tags themselves may be declared so as to be ignored by the POS tagger, but the text between them will be POS tagged nevertheless. The only workaround we could think of was to put their contents in an attribute inside an XML tag:
<note resp="compiler" value="LIST OF NAMES OMITTED" />
<note resp="editor" value="be DELETED" />
On the other hand, we do indeed want to POS tag the contents of the brackets when they are used to annotate emendations. When this is the case, we follow the same principle as with the [{...{] code explained above. Emendations of complete words are encoded as
<note resp="editor">was made</note>
whereas emendations that contain partial words are encoded as
<note resp="editor" range="3,7" orig="Jan[\uary\]">January</note>
Type changes
Changes of typeface in the printed source editions were annotated as (^...^) in the original corpus. In the XML edition, they are annotated as <hi rend="type">...</hi> . (This usually corresponds to an underlined passage in the original letter.)
When the change of typeface concerns partial words as in Theo(^log^) , the original coding is preserved in the orig attribute of the hi element as in
<hi rend="type" range="4,7" orig="Theo(^log^)">Theolog</hi>
Foreign language
Passages in foreign language were annotated with the code (\...\) in the original CEECE. In the XML edition, they are annotated with the code <foreign>...</foreign> .
Note that once the POS tagging had been completed by CLAWS, all the POS tags of words inside <foreign>...</foreign> tags were changed to FW by our post-processor.
CLAWS POS tagging
Once converted into XML, the corpus files were POS tagged by CLAWS4 (v. 24) using the C7 tagset. We kindly thank Paul Rayson, director of UCREL at Lancaster University, for providing us with a free licence of CLAWS.
Configuration of the SGML table
The SGML tags and entities that are included in the input must be declared in an SGML table, i.e. the file SGML_table.c7 in the installation directory of CLAWS. The sample file that comes with the CLAWS distribution contains the TEI codes used in the BNC. As our XML edition largely complies with the TEI standards, almost all the tags and entities used in it were already declared with appropriate instructions in the sample file. We only had to make some minor adjustments.
The elements supplied and foreign were added to the SGML table with the ‘ignore_symbol’ instruction in the 2nd column. For the note element which was already declared, the ‘break’ instruction was removed from the 3rd column. This was because it would have made CLAWS insert a sentence break after each note tag even though in our corpus it often appears in the middle of a sentence. Apart from these changes, the SGML table used in POS tagging the CEECE is the same as the sample file.
Error messages
CLAWS reported a total of 46,781 lines read and a total of 3,538 errors, none of which were fatal. The individual errors are listed in the separate .error files. They can be grouped into the following types, listed here in ascending order of frequency. Besides what is reported here, we do not know why these errors occurred or how CLAWS treated them.
- invalid SGML symbol ȝ at ref 0000271 in FWANLEY (3 errors)
- no tags on syntactic unit (3 errors)
- denies at ref 0000266 in FBURNEYF
- Gracious at ref 0000042 in FNORTH
- us at ref 0000075 in FNORTH
-
word too long (4 errors)
- ten-thousand-Epithet-Epi at ref 0000798 in FGARRICK
- Septuagenarian-Petrarchi at ref 0000723 in FHURD
- Secretary-Marshall-Gener at ref 0000229 in FLENNOX
- Deputy-Quarter-Master-Ge at ref 0000679 in FLENNOX
- character '~' ignored (11 errors)
- all tags filtered off suffix (199 errors)
- Top four:
- iam (21)
- men (21)
- di (15)
- ev (12)
- error in ditto sequence (246 errors)
- These errors seem to have been caused by an XML element appearing in the middle of the sequence, as in
a_AT1 few_DA2 sheep_NN out_II21 <pb n="41" /> of_II22 Derbyshire_NP1
- It seems that this type of error does not affect the accuracy of the tagging and can therefore be considered harmless.
- all tags filtered off word (3,072 errors)
- Top ten:
- dutiful (207)
- me (98)
- lye (82)
- waggon (77)
- felicity (61)
- tt (60)
- mr (46)
- mortification (42)
- providence (38)
- acknowledgements (37)
- Some of these words seem to have been tagged correctly nevertheless.
For more details, see the associated .error files for each collection. (The directory structure is explained in another section.)
Post-processing
All POS tags inside foreign tags (expect for punctuation tags) were changed to FW by a Java program, written by Lassi Saario.
The native output of CLAWS is the so-called vertical output where there is one row for each token. Words longer than 25 characters and SGML tags which contain a space are stored in a supplementary (.supp ) file.
We wanted to provide the end-users of our corpus with two additional alternative formats. One is the so-called horizontal format; the other is XML. What is more, we wanted to provide both a C7 and C5 tagged version in each of these formats, making it a total of six versions of each collection.
The conversions were performed by Paul Rayson’s convert software (v. 7), freely available on GitHub. For each collection, the vertical C7 format was converted into vertical C5 with the v2vmap option. Both the vertical C7 format and the vertical C5 format were then converted into respective horizontal formats with the v2hsupp option. Finally, the vertical C7 was converted into C7 tagged XML with the v2x option and into C5 tagged XML with the v2xmap option. The resulting XML files were validated against the DTD and formatted by XmlStarlet (v. 1.6.1).
What all of the six versions have in common is that they all are versions of a certain collection of letters. The letters in the collection are all included in the same file, which may not be ideal for the purposes of an end-user who wants to perform global searches. This is why we wrote one more program to split the horizontal collection files into separate files for each letter. Two directories were created: one contains all the individual letters in horizontal C7 format and the other in C5 format.
See the section on directory structure for more information.
Final XML format
The XML format as it is documented in the section on XML annotation above concerns the ‘pre-tagging’ format of the corpus, i.e. the format that the original COCOA format was converted into before POS tagging. The ‘post-tagging’ format, i.e. the XML format of the final POS tagged corpus, is equivalent to the pre-tagging format except for the w elements that were added only after POS tagging.
Each token (whether it is a word or a punctuation mark) has been put inside a w element. It has two attributes. The id attribute contains the token identifier (e.g. 7.18 ) where the first number identifies the sentence and the other specifies the token in that sentence. The pos attribute contains the POS tag of the token. It may be an ordinary tag or a ditto tag, such as RR31, which means that the word is the first part of a three-part multiword adverb, e.g. none in none the less (see the note at the end of this page).
Tagging accuracy
The accuracy rate of POS tagging is probably the one important piece of information that is relevant for the end-user of a POS tagged corpus. CLAWS is reported to have consistently achieved an accuracy of 96–97 %, but the rate is expected to drop on historical corpora such as the CEECE (see Schneider, Hundt & Oppliger 2016).
Sample
In order to approximate the accuracy of the POS tagged CEECE (from now on known as TCEECE), we decided on a more or less random sample of 15 letters, all from different collections. The length of the letters varies between 300 and 400 words, which is justified given that the median length of all the letters in the corpus is 353. Out of the 15 letters, five are written by women (34 % of the total word count) and two in the 17th century (12 % of the total word count). This corresponds somewhat to the distribution in the whole corpus, the words of which 27 % are from women and 13 % from the 17th century. The social ranks are rather evenly distributed even though some rare ones are missing. The total number of words in the sample is 5,245 which is approximately 0.24 % of the words in the whole corpus.
Letter ID |
Sender Name |
Sender Gender |
Sender Rank |
Year |
Word Count |
BENTHAJ_056 |
Alleyne Fitzherbert |
Male |
Nobility |
1792 |
339 |
BURNEY_039 |
Charles Burney |
Male |
Professional |
1784 |
349 |
BURNEYF_013 |
Frances (Fanny) Burney |
Female |
Professional |
1779 |
340 |
CARTER_023 |
Elizabeth Carter |
Female |
Clergy (Lower) |
1740 |
391 |
CLIFT_027 |
Joanna Clift |
Female |
Other |
1795 |
355 |
DUKES_052 |
Charles Lennox |
Male |
Nobility |
1741 |
340 |
FLEMIN2_133 |
George Fleming |
Male |
Gentry (Lower) |
1691 |
341 |
GARRICK_032 |
David Garrick |
Male |
Gentry (Lower) |
1753 |
393 |
GIBBON_007 |
Edward Gibbon |
Male |
Gentry (Lower) |
1758 |
304 |
PEPYS3_035 |
Samuel Pepys |
Male |
Professional |
1680 |
312 |
SANCHO_024 |
(Charles) Ignatius Sancho |
Male |
Other |
1778 |
307 |
SWIFT_059 |
Elizabeth Germain née Berkeley |
Female |
Nobility |
1731 |
334 |
TWINING_010 |
Elizabeth Twining née Smythies |
Female |
Clergy (Lower) |
1765 |
362 |
WEDGWOO_023 |
Josiah Sr Wedgwood |
Male |
Professional |
1769 |
379 |
WENTWO2_146 |
Richard Wardman |
Male |
Other |
1734 |
399 |
See the CEECE metadata for more details on the letters and their writers.
Foreign passages, i.e. parts of text that were annotated with the code (\...\) in the original corpus and <foreign>...</foreign> in the XML edition, were excluded from the sample since their tagging had been automatically corrected by our post-processor.
Calculation principles
For each letter in the sample, we checked the vertical C7 output token by token and decided for each token whether the POS tag assigned by CLAWS to that token was correct or not. The accuracy rate was calculated by simply dividing the number of accurately tagged tokens with the total number of tokens.
By a ‘token’ in this context we mean both words and punctuation marks, as both are assigned POS tags by CLAWS (even though the punctuation tags tend to be correct by default). We chose to count punctuation marks as tokens because that’s what others have done as well, and we wanted our statistics to be comparable with theirs. What we do not count as tokens are the XML elements, tagged as NULL by CLAWS, and the lines of hyphens that are inserted by CLAWS between sentences. In other words, every row of the vertical output that is not a sentence break or NULL is counted as a token and taken into account in the calculation.
The problem with our straightforward approach is that it presupposes perfect tokenisation. Unfortunately, tokenisation is not perfect. CLAWS does double duty in that it first splits the input into individual tokens (i.e. words and puncuation marks) and only then assigns POS tags to those tokens, meaning that errors may happen in tokenisation as well as tagging. Take the following part of the vertical C7 output, for instance:
0000292 240 and 93 CC
0000292 250 when 93 [RRQ/55] CS/45
0000292 260 twill 93 NN1
0000292 270 be 93 VBI
0000292 280 better 93 [JJR/98] RRR/2 NN1%/0 VV0%/0
The problem with the third row is that the assigned tag is neither correct nor incorrect. There simply is no correct tag for that token. That is because there are actually two tokens (twill = it + will) instead of one, each of which should be given its own tag.
Quite common is the opposite circumstance where there are two tokens when there should be only one:
0000271 010 Boot 93 [NN1/64] VV0/36
0000271 020 is 93 VBZ
0000271 030 making 93 [VVG/95] NN1@/5
0000271 040 Tritons 04 NP2
0000271 050 & 03 CC
0000271 060 Sphinx > 93 NN1
0000271 061 's < 03 [VBZ/50] GE/48 VHZ@/2
0000271 062 , 03 ,
0000271 070 & 03 CC
0000271 080 does 93 VDZ
0000271 090 them 93 PPHO2
0000271 100 very 97 RG
0000271 110 well 96 [RR/97] JJ@/3
On the 6th and 7th rows, the plural noun Sphinx's that should be tagged as NN2 is mistaken to be a contraction of Sphinx is and is subsequently split into two tokens, both of which are given their own tags.
To deal with cases like these, we have adopted the following principle. Whenever tagging accuracy is affected by a tokenisation error, every token that is affected by the error is counted as one incorrectly tagged token (even though there is no correct alternative in this case). In the former of the previous examples, the 3rd row is thus counted as one error, whereas in the latter, both the 6th and 7th row are counted as errors.
The same rule applies when CLAWS confuses a word token and a punctuation token for one token, e.g. workman_NN1 ._. for workman._NNU. In the opposite situation, however, we only count the punctuation tag and not the word tag as an error, given that the word token can be tagged independently as well. This is the case when e.g. Sam:_NP1, which is short for Samuel, is mistaken for Sam_NP1 :_:.
In all other cases than the aforementioned, we have assumed (rather unrealistically, perhaps) that there always exists a single correct POS tag for each token. Whenever a token has been assigned more than one tags by CLAWS, we have taken the one inside square brackets to be the ‘chosen’ one. (It turns out that the tag inside square brackets is almost always the tag that has the greatest likelihood value—but not always.)
If the tag thus ‘chosen’ by CLAWS differs from the tag we judge to be the correct one, the token is counted as an error. However, note that the so-called ditto tags (see the end of this page) are converted into ordinary tags prior to comparison. Even if CLAWS had chosen, say, RR21 while we had chosen RR22, the tag chosen by CLAWS would be counted as correct.
Checking guidelines
Checking POS tagging requires many difficult decisions as to what is the correct POS tag of a token. One needs guidelines to resolve ambiguities between alternative tags. This is even more so when one not only checks but also corrects some of the tagging. For the purposes of checking (i.e., calculating the accuracy rate), it may sometimes be sufficient to know that the tag assigned by CLAWS is incorrect without actually knowing what the correct alternative is, but for the purposes of correcting (i.e., replacing incorrect tags by correct ones), this escape route is obviously unavailable.
We have followed the tagging guidelines of the BNC sampler whenever possible. The Oxford English Dictionary has been consulted in many unclear cases. When those sources have not provided us with definitive answers, we have made guidelines of our own, the most noteworthy of which are given here. The same guidelines have been applied in correcting with some minor additions.
One common problem concerns the tagging of parts of multiwords that would be joined in standard English but have been separated in the CEECE. Take the adverb likewise, for instance. When written as such, it is correctly tagged by CLAWS as RR; but when the parts are separated, as in
I came away very bare in apparel and my child like wise
like wise is tagged by CLAWS as like_II wise_JJ, which is obviously incorrect. As for the correct tagging, there are at least three alternatives: one could tag the parts conservatively as like_JJ wise_NN1; one could use ditto tags like_RR21 wise_RR22 to indicate that the individual words are parts of a multiword adverb; or one could maintain that the words should be joined and tagged together as likewise_RR. We have decided to go with the first option, not only in this case but in all similar cases.
The mirror image of this problem is a multiword that would be separated in standard English but has been joined in the CEECE. The correct tagging for incase, for instance, could be either in_II case_NN1, in_CS21 case_CS22 or incase_CS. Here we have chosen the last option.
When it comes to quasi-nominal uses of present participles, we have decided that they should be tagged as verbs, as in
Nothing but the indisposition I have been in could have prevented my returning_VVG you thanks for the favour of yours
More as a post-nominal quantifier is considered an adverb and not as a determiner:
I must beg one favour more_RRR of you
But: I must beg one more_DAR favour of you
On the other hand, all and both are considered before-determiners even when they appear after the nominal head:
They are all_DB riding about
(cf. All_DB of them are riding about)
I wish you both_DB2 success of your own
(cf. I wish both_DB2 of you success of your own)
Please and pray are interpreted to be adverbs in all except clearly verbal contexts:
Will you please_RR forward my letters to them
Pray_RR send an answer by the next post
I should be glad to receive your answer please_RR to direct for me…
But:
You certainly will do as you please_VV0
I shall ever be bound to pray_VVI for you
Accuracy rate
Accuracy by letter and in total, both in the C7 and the C5 tagset:
Letter ID |
(a) Tokens in total |
C7 |
C5 |
(b) Accurately tagged tokens |
(c) Accuracy rate (b / a) |
(d) Accurately tagged tokens |
(e) Accuracy rate (d / a) |
BENTHAJ_056 |
387 |
372 |
96.1 % |
372 |
96.1 % |
BURNEY_039 |
412 |
384 |
93.2 % |
385 |
93.4 % |
BURNEYF_013 |
410 |
383 |
93.4 % |
383 |
93.4 % |
CARTER_023 |
427 |
392 |
91.8 % |
394 |
92.3 % |
CLIFT_027 |
362 |
328 |
90.6 % |
329 |
90.9 % |
DUKES_052 |
369 |
352 |
95.4 % |
352 |
95.4 % |
FLEMIN2_133 |
378 |
349 |
92.3 % |
350 |
92.6 % |
GARRICK_032 |
452 |
437 |
96.7 % |
437 |
96.7 % |
GIBBON_007 |
331 |
315 |
95.2 % |
317 |
95.8 % |
PEPYS3_035 |
340 |
322 |
94.7 % |
322 |
94.7 % |
SANCHO_024 |
375 |
359 |
95.7 % |
359 |
95.7 % |
SWIFT_059 |
367 |
340 |
92.6 % |
340 |
92.6 % |
TWINING_010 |
425 |
405 |
95.3 % |
408 |
96.0 % |
WEDGWOO_023 |
424 |
412 |
97.2 % |
413 |
97.4 % |
WENTWO2_146 |
430 |
416 |
96.7 % |
416 |
96.7 % |
TOTAL |
5,889 |
5,566 |
94.5 % |
5,577 |
94.7 % |
These figures can be combined with metadata about the letters and their writers in order to calculate accuracy rates by various social and historical variables, such as gender or century:
Variable |
Accuracy rate (C7) |
Accuracy rate (C5) |
Gender |
Male |
95.4 % |
95.5 % |
Female |
92.8 % |
93.1 % |
Century |
17th |
93.5 % |
93.6 % |
18th |
94.7 % |
94.9 % |
The accuracy rates are given here for C7 tags only, grouped according to the first letter of the tag. Rates by particular tags and the most frequent incorrect–correct tag pairs, both for C7 and C5, are given in the appendix.
For each tag group (a), column (b) contains the total number of tags in (a) assigned by CLAWS (‘selected assignments’), (c) contains the total number of tags in (a) assigned by us (‘relevant assignments’), (d) contains the number of tags in (a) on which we agree with CLAWS (‘true assignments’), (e) contains ‘precision’ (d / b) and (f) contains ‘recall’ (d / c). (For an explanation of precision and recall, see e.g. this.)
(a) Tag group |
(b) Selected assignments |
(c) Relevant assignments |
(d) True assignments |
(e) Precision (d / b) |
(f) Recall (d / c) |
|
Punctuation marks |
570 |
562 |
562 |
98.6 % |
100.0 % |
A- |
Articles |
443 |
439 |
438 |
98.9 % |
99.8 % |
BCL |
Before-clause marker |
2 |
2 |
2 |
100.0 % |
100.0 % |
C- |
Conjunctions |
384 |
393 |
359 |
93.5 % |
91.3 % |
D- |
Determiners |
179 |
168 |
158 |
88.3 % |
94.0 % |
EX |
Existential there |
11 |
11 |
11 |
100.0 % |
100.0 % |
F- |
|
6 |
8 |
6 |
100.0 % |
75.0 % |
GE |
Genitive marker |
23 |
23 |
22 |
95.7 % |
95.7 % |
I- |
Prepositions |
535 |
527 |
511 |
95.5 % |
97.0 % |
J- |
Adjectives |
259 |
265 |
237 |
91.5 % |
89.4 % |
M- |
Numbers |
103 |
96 |
95 |
92.2 % |
99.0 % |
N- |
Nouns |
1,032 |
1,017 |
944 |
91.5 % |
92.8 % |
P- |
Pronouns |
639 |
644 |
636 |
99.5 % |
98.8 % |
R- |
Adverbs |
385 |
412 |
358 |
93.0 % |
86.9 % |
TO |
Infinitive marker |
106 |
108 |
106 |
100.0 % |
98.1 % |
UH |
Interjection |
2 |
1 |
1 |
50.0 % |
100.0 % |
V- |
Verbs |
1,140 |
1,123 |
1,063 |
93.2 % |
94.7 % |
XX |
Negation |
57 |
58 |
57 |
100.0 % |
98.3 % |
ZZ- |
Letters of alphabet |
13 |
0 |
0 |
0.0 % |
|
Corrected collection
In addition to the overall accuracy rate, we also wanted to know what collections have the worst accuracy. Based on the hypothesis that the spelling variation is greatest in authentic letters written by uneducated women, corroborated somewhat by more or less targeted spot checks, we came to the conclusion that the collections with probably the worst accuracy are Pauper and Clift.
The poor tagging accuracy of such collections makes it difficult to study the language of lower-class women and calls for manual correction. While the Clift collection was too huge to have been corrected within our project, the Pauper collection was completely corrected by us. As a by-product, we got to know that the accuracy rate of the uncorrected collection is 87.9 %, whereas the corrected collection can be assumed to be almost error-free.
The accuracy was calculated following the same guidelines as with the sample. Additional normalisation and changes to tokenisation were made whenever it seemed appropriate, even when it would not strictly have been required by the guidelines. The multiword like wise, for instance, that would be tagged as like_JJ wise_NN1 according to the guidelines, was nevertheless corrected into likewise so that it could be given the more natural tag RR.
Known issues
Despite our efforts to normalise the variability of the language, there remain many non-standard features in the corpus that have led to systematic tokenisation and tagging errors, such as the following:
- Sentence breaks
- Unmarked sentence breaks
- E.g. My Aunt and myself return you Thanks for the Venison which was vastly good at the same time I own I was very glad that you was not here to partake of it for I am told you always show the Veneration you have for it by eating as much as three or four other people if this is the case (which I am much inclined to believe) we should have had but a Small share which would have been a little hard as it was the first we had seen for the Season, indeed I have heard such an extraordinary account of your stomach that I am in some fear for my poor tame Deer least you should take a fancy to cut a slice out of them while they are living if you think you can't resist the temptation I beg you will forbear going into the park when you next come here.
- Consecutive sentences that have not been explicitly separated from each other are interpreted by CLAWS as one sentence token. No line of hyphens is inserted between them to mark the border in the vertical output. (This might also have an effect on tagging in some cases.)
- Double-marked sentence breaks
- E.g. As to brother Osborne, his harvest has, I hope, been plentiful and well got in. - My friend, poor Spink…
- Here the sentence break is marked both by a full stop and a hyphen. CLAWS interprets in. as one token, tags it as NNU, and fails to insert a line of hyphens.
- One word written as two, or vice versa
- One word written as two
- He has left me a_AT1 gain_NN1
- Should be: again_RT
- Two words written as one
- Ihave_NN1 not got wherewith to defray the expense
- Should be: I_PPIS1 have_VH0
- Redundant, missing or ambiguous apostrophes
- Genitives
- I am my old friends most faithful humble servant
- Tagged as: friends_NN2
- Should be: friend_NN1 's_GE
- In 3 weeks time
- Tagged as: weeks_NNT2
- Should be: weeks_NNT2 '_GE
- Possessive pronouns
- It's adoption would be a most important public benefit
- Tagged as: It_PPH1 's_GE
- Should be: Its_APPGE
- Verb contractions
- Its not in my power to subsist with my own labour
- Tagged as: Its_APPGE
- Should be: It_PPH1 's_VBZ
- Next time Ill write it better
- Tagged as: Ill_NP1
- Should be: I_PPIS1 'll_VM
- Plural nouns
- Boot is making Tritons & Sphinx's, & does them very well
- Tagged as: Sphinx_NN1 's_VBZ
- Should be: Sphinxes_NN2
- Abbreviations
- Unrecognised abbreviations
- E.g. Kitty may rd. it to you
- The abbreviation rd. (short for read) is tagged as NN1 instead of VVI
- Unrecognised abbreviation markers
- E.g. Sam: Crisp Esquire
- Tagged as: Sam_NP1 :_:
- Should be: Sam:_NP1
- Redundant abbreviation markers
- E.g. We often walk over & dine with. Mr. Boys
- CLAWS tags the full stop after with as a separate token and inserts a sentence break thereafter.
- The full stop is a remnant from the abbreviation wth. that has been expanded in the semi-automated VARDing.
- Some of the redundant abbreviation markers have been removed, but not all (see the section on redundant punctuation).
- Capitalisation (or lack thereof)
- Uncapitalised proper nouns are tagged as common nouns
- E.g. Leave it at the bear and wheat Sheaf in thames_NN2 street in london_NN1 for mr Tallbutt hyeman_NN1 to cantubery_NN1
- Uncapitalised titles are tagged as units of measurement
- E.g. mr_NNU Tallbutt hyeman
- Infinitive verb forms are tagged as finite when capitalised
- E.g. I hope you Gentlemen will be so kind to Advance_VV0 the money
- Sentence breaks are not recognised when the first word of the new sentence is not capitalised
- E.g. she deserves his poetry in her praises._NNU your friend Mrs Barber has been here
- Variants that have been normalised incorrectly or not at all
- Confusing variants
- E.g. dye instead of die, on instead of one, to instead of two
- A news letter wrote by a servant
- Tagged as: wrote_VVD
- Should be: wrote_VVN (or written_VVN)
- Then as a variant of than
- E.g. I find myself much better then_RT when I came
- Should be tagged as CSN
- Than as a variant of then
- E.g. You must know than_CSN a little how to keep a family
- Should be tagged as RT
- Unrecognised verb inflections
- E.g. I shall number it to the many other kindnesses conferd_NN1 on Sir your obliged & most humble servant
- Should be: conferd_VVN (or conferred_VVN)
- This has not been caught by normalisation [pre3] since there is no apostrophe in front of the d
- Variants that were incorrectly standardised in the semi-automated VARDing
- E.g. Having been absent much longer in usual, I thought it my duty to beg a continuance of the same.
- The in was originally yn that should have been normalised into than
There are also many errors that could hardly have been avoided by means of normalisation. Some of the most frequent and symptomatic are listed here:
- Past tenses (--D), past participles (--N) and adjectives (JJ) are confused with one another when they all take the same (-ed) form
- They must neither be given nor sold_VVD
- The damndest pen I ever handled_VVN
- Whoever expected_JJ advancement should appear much in public
- Too long a distance between the auxiliary and the main verb
- E.g. They will_VM get_VVI an order & bring_VV0 me home
- CLAWS fails to recognise that will is the auxiliary verb for bring which should thus be tagged as infinitive (VVI)
- Bare infinitives are not recognised
- E.g. I couldn't help but notice_VV0, I see the time approach_NN1 in which…
- Should be tagged as VVI
- Be is tagged as VBI (infinitive) even when it is imperative or subjunctive and should be tagged as VB0 (base form)
- Be_VBI so kind to order the payment
- I beg of you to send me an answer whether he be_VBI there or near
- Adverbs without the -ly suffix are tagged as adjectives
- I got safe_JJ to Canterbury
- My hands must labour hard_JJ
- Please and pray are tagged sometimes as verbs, sometimes as adverbs (cf. our checking guidelines)
- Please_RR to direct for me the Bull in Coggeshall Essex
- Pray_VV0 send me word by the bearer
- Else as a post-nominal adverb
- Always tagged by CLAWS as RR, although, according to the C7 tagset, it should be tagged as RA
- E.g. Nothing else_RR can ever show my gratitude
- Common nouns after Dear are tagged as proper nouns
- E.g. Dear Brother_NP1, Dear Sir_NP1
- But: My dear Cousin_NN1
- The first person pronoun is tagged as a letter of the alphabet
- E.g. I_ZZ1 and my wife must be obliged to come
- Slashes, when used parallel to commas (e.g. between the verses of a poem or lines of an address) are tagged as FO (formula)
- E.g. To Mr. Robert Dodsley /_FO at Tully's Head /_FO Pall Mall.
- Ideally, they would be tagged as punctuation marks, but there is no punctuation tag for a slash in the C7 tagset.
Directory structure
The final POS tagged corpus is distributed in a directory named tceece . It contains three subdirectories: tceece-collections , tceece-letters-c5 and tceece-letters-c7 .
The directory tceece-collections contains 78 subdirectories, one for each letter collection. Each collection directory has a subdirectory orig which contains the following files (where COLLECTION stands for the name of the collection directory, e.g. FAUSTEN , FDEFOE ):
COLLECTION.xml is the output from the conversion of the original collection file from COCOA into XML. It is the input file that was fed to CLAWS to be POS tagged.
COLLECTION.xml.c7 is the original vertical output from CLAWS.
COLLECTION.xml.c7.errors contains the error messages given by CLAWS.
COLLECTION.xml.c7.supp is a supplement to the vertical output. It contains the words that are longer than 25 characters and the SGML tags that contain a space.
NB: There is an exception to the aforementioned rule. In FPAUPER/orig , the file FPAUPER_orig.xml.c7 is the original output from CLAWS, whereas FPAUPER.xml.c7 is the corrected version of the former. The additional Excel files record intermediate stages in the correction process.
The root of each collection directory contains the following versions of the final POS tagged collection:
- Vertical versions
COLLECTION.xml.c7 is the post-processed version of orig/COLLECTION.xml.c7 , i.e. the final POS tagged version of the collection in vertical C7 format.
COLLECTION.xml.c5 is the C5 equivalent of COLLECTION.xml.c7 .
- Horizontal versions
COLLECTION_horiz.c7 is the horizontal version of the collection in the C7 tagset, obtained from the vertical version COLLECTION.xml.c7 and the associated supplement file orig/COLLECTION.xml.c7.supp .
COLLECTION_horiz.c5 is the C5 equivalent of COLLECTION_horiz.c7 .
- XML versions
COLLECTION_c7.xml is the XML version of the collection in the C7 tagset, obtained from the vertical version COLLECTION.xml.c7 and the associated supplement file orig/COLLECTION.xml.c7.supp .
COLLECTION_c5.xml is the C5 equivalent of COLLECTION_c7.xml .
COLLECTION_c7_formatted.xml and COLLECTION_c5_formatted.xml are the formatted versions (i.e., versions where line-division and indentation have been systematised in order to make the XML syntax more readable to human eye) of COLLECTION_c7.xml and COLLECTION_c5.xml , respectively.
The directories tceece-letters-c5 and tceece-letters-c7 contain all 4,923 letters as text files in horizontal C5 and C7 formats. They have been obtained from the files COLLECTION_horiz.c7 and COLLECTION_horiz.c5 for each collection. The files are named according to the letter identifiers (ADDISON_001, ADDISON_002 etc.).
In addition to the aforementioned directories, the root contains the file TCEECE.dtd , the external Document Type Definition of the XML files.
Which version you should use depends on your choice of application. The individual letters are the most convenient choice to be used with a corpus analysis toolkit (such as AntConc). The formatted XML versions are mostly meant to be used with an XML tool, together with the DTD. If you want to make exact calculations token by token and tag by tag, you may try to import the vertical versions to a spreadsheet (such as Excel).
References
BNC = British National Corpus. http://www.natcorp.ox.ac.uk
BNC Sampler Corpus: Guidelines to Wordclass Tagging. UCREL, University of Lancaster. Updated on 16 Sep, 1997. http://ucrel.lancs.ac.uk/bnc2sampler/guide_c7.htm
CEECE = Corpus of Early English Correspondence Extension. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio at the Department of Modern Languages, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/ceece.html
Kaislaniemi, Samuli. 2018. “The Corpus of Early English Correspondence Extension (CEECE)”. In Patterns of Change in 18th-century English: A Sociolinguistic Approach. (Advances in Historical Sociolinguistics 8), ed. by Terttu Nevalainen, Minna Palander-Collin & Tanja Säily. Amsterdam: John Benjamins, 45–59. https://doi.org/10.1075/ahs.8
Marttila, Ville. 2011. Helsinki Corpus TEI XML Edition Documentation. Helsinki: Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. https://helsinkicorpus.arts.gla.ac.uk/display.py?fs=100&what=manual
A Post-Editor's Guide to CLAWS7 Tagging. 1996. Written by the UCREL team. UCREL, University of Lancaster. http://www.natcorp.ox.ac.uk/docs/claws7.html
Rayson, Paul. 2011. CLAWS4 readme for Unix and Windows. Distributed with the CLAWS4 (v. 24) package. UCREL, University of Lancaster. http://ucrel.lancs.ac.uk/claws/
Saario, Lassi, Tanja Säily, Samuli Kaislaniemi & Terttu Nevalainen. Submitted. “The burden of legacy: Producing the Tagged Corpus of Early English Correspondence Extension (TCEECE)”. Research in Corpus Linguistics, special issue on “Challenges in combining structured and unstructured data in corpus development”.
Säily, Tanja. 2013. “Progress in POS tagging the CEECE”. Seminar presentation, From Correspondence to Corpora: A Seminar on Digital Processing of Historical Letter Compilations, Helsinki, Finland, November 2013. https://www.cs.helsinki.fi/u/tsaily/presentations/letterseminar2013-11-15_ts.pdf
Säily, Tanja, Terttu Nevalainen & Harri Siirtola. 2011. “Variation in noun and pronoun frequencies in a sociohistorical corpus of English”. Literary and Linguistic Computing 26(2): 167–188. https://doi.org/10.1093/llc/fqr004
Säily, Tanja, Turo Vartiainen & Harri Siirtola. 2017. “Exploring part-of-speech frequencies in a sociohistorical corpus of English”. In Exploring Future Paths for Historical Sociolinguistics (Advances in Historical Sociolinguistics 7), ed. by Tanja Säily, Arja Nurmi, Minna Palander-Collin & Anita Auer. Amsterdam: John Benjamins, 23–52. https://doi.org/10.1075/ahs.7.02sai
SCEEC = Standardised-spelling Corpora of Early English Correspondence. 2012. Compiled by Terttu Nevalainen, Helena Raumolin-Brunberg, Samuli Kaislaniemi, Jukka Keränen, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Tanja Säily and Anni Sairio. Standardised by Mikko Hakala, Minna Palander-Collin and Minna Nevala. Department of English / Department of Modern Languages, University of Helsinki. http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/standardized.html
Schneider, Gerold, Marianne Hundt & Rahel Oppliger. 2016. “Part-of-speech in historical corpora: tagger evaluation and ensemble systems on ARCHER”. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). (Bochumer Linguistische Arbeitsberichte 16), ed. by Stefanie Dipper, Friedrich Neubarth & Heike Zinsmeister. Bochum: Ruhr-Universität Bochum, 256–264. https://doi.org/10.5167/uzh-135065
Appendices
- Key to normalisation
- Accuracy by tags
|