Introduction to the Helsinki Corpus TEI XML edition

The sections 'Background' and 'Conversion project in detail' were included on the Helsinki Corpus TEI XML CD-ROM distributed at the Helsinki Corpus Festival in September 2011.

Brief introduction

A new XML annotated version of the Helsinki Corpus was released in September 2011. The latest version is currently 0.96. The corpus is available at https://helsinkicorpus.arts.gla.ac.uk, kindly hosted by the University of Glasgow.

This introduction gives some background information on the conversion project, describes the process undertaken, and provides a list of the project members.

The reference line for the XML version of the Helsinki Corpus reads:

Helsinki Corpus TEI XML Edition. 2011. First edition. Designed by Alpo Honkapohja, Samuli Kaislaniemi, Henri Kauhanen, Matti Kilpiö, Ville Marttila, Terttu Nevalainen, Arja Nurmi, Matti Rissanen and Jukka Tyrkkö. Implemented by Henri Kauhanen and Ville Marttila. Based on The Helsinki Corpus of English Texts (1991). Helsinki: The Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki.

Background

It is difficult to pinpoint exactly when the idea to produce an updated version of the Helsinki Corpus was born. Following the successful release of the original corpus, the compilers went on to work on various more specialised corpora such as the Corpus of Early English Correspondence (CEEC) and the Corpus of Early English Medical Writing (CEEM). The Helsinki Corpus of Older Scots (HCOS) was compiled alongside the Helsinki Corpus by Anneli Meurman-Solin following largely identical annotation standards. See http://icame.uib.no/ij19/solin.pdf. Over the two decades that followed, the experience gained in compiling the Helsinki Corpus was always present, informing the compilers’ new projects. The Research Unit for Variation, Contacts and Change in English (VARIENG), funded by the Academy of Finland as a National Centre of Excellence from 2000 to 2011, was largely composed of members of the original team and their younger colleagues.

Over the years, the Helsinki Corpus team kept records of new information about the sample texts in the corpus. Gradually, a folder full of errata, most of it minor corrections of typos but some more substantial, was compiled. Likewise, developments in the field of corpus linguistics and corpus annotation were beginning to threaten the format of the original corpus with obsolescence. Although several later corpus projects came to adopt some of the practices used in the Helsinki Corpus, such as the time periodisation which was developed specifically for the corpus, the COCOA annotation scheme did not become a universal standard in corpus linguistics. As years went by, new annotation standards appeared on the corpus linguistic scene and, encouragingly, many projects adopted variations of widely used metalanguages such as XML. The idea of converting the Helsinki Corpus to a more universal encoding format was mentioned every now and again, but for a while practicalities prevented that from happening.

Members of the original team, and VARIENG as an institution, always kept an eye on how the Helsinki Corpus was used around the world. VARIENG maintains an ever-growing bibliography of research conducted with the corpus as part of the Helsinki Corpus entry in the Corpus Resource Database (CoRD), and every now and again we hear of other projects making use of the corpus or inspired by it. Our typical reaction was to contact the researcher or team in question, inquire about their work, and wish them success. One particular type of activity involved conversions to other encoding and annotation formats, some of them XML. Although certainly of interest, it turned out none of the new encoding projects had attempted a comprehensive conversion of the entire corpus and all its rich data.

Our decision to produce an updated version turned from idea to action in the autumn of 2010 prompted by two facts, namely that the second funding period of the VARIENG research unit was coming to an end at the end of 2011 and the same year would also mark the 20th anniversary of the original release of the corpus. The latter of these two events had already inspired VARIENG to put the wheels in motion for an anniversary conference called the Helsinki Corpus Festival to be held in September–October 2011, and it did not take long for us to realise that the conference would provide the perfect opportunity to release the new XML edition to the research community. With somewhat less than a year to go before the conference, it was time to get busy.

Why the decision to convert the Helsinki Corpus into XML and not some other scheme, and why using the specific scheme promoted by the Text Encoding Initiative (TEI)? The answer is twofold. Firstly, the main reason for the conversion was to ensure the preservation of all the information encoded into the original corpus. One of the dangers of proprietary encoding systems, a definition we may take to apply to all standards not in sufficiently wide use to be considered universal, is that over time, as new systems emerge, older ones in limited use are gradually forgotten and the data is rendered effectively inaccessible. Because XML is de facto the universal markup language in the world of computing today, it only makes sense to convert existing corpora to it rather than produce yet another new markup language. The TEI Guidelines for Electronic Text Encoding and Interchange define an XML schema developed by and for humanities computing that provides an extraordinarily rich set of elements and attributes for all manner of documents. Corpora are catered for as well, and the wealth of documentation makes TEI particularly attractive as a universal scheme for corpus linguists to adopt. Another distinct benefit of XML is the option of using standoff annotation, which will allow users to add new layers of data without disrupting the base text. Finally, while it is to be noted that there are scarce few corpus tools available today that can make full use of the XML format, we believe that it is simply a matter of time before that feature appears into most tools.

The initial plan was to make as much use as possible of existing XML conversions of the Helsinki Corpus and possibly to outsource some of the conversion work. A number of colleagues who had carried out XML conversions of the Helsinki Corpus were contacted at this early stage, but it soon emerged that the conversions, although indeed to XML, had been carried out automatically with the great majority of encoded data omitted. Outsourcing the conversion project, on the other hand, proved prohibitively expensive, and it was becoming increasingly clear that the conversion would require not only technical ability but also substantial understanding of the philological make-up of the corpus, not all of which was recorded in the manual. With almost all members of the original team still active at VARIENG, and a growing number of younger researchers experienced in XML, the decision was made to look into performing the conversion work in-house.

Conversion project in detail

The project started in earnest in late Autumn 2010, when Terttu Nevalainen, the Director of VARIENG, asked the Planning Officer Jukka Tyrkkö to compile a team of experts to start looking into the challenge of carrying out a conversion into a modern encoding scheme. The conversion team came to comprise of Tyrkkö, Arja Nurmi and the three founding members of the Digital Editions for Corpus Linguistics (DECL) project, Ville Marttila, Alpo Honkapohja, and Samuli Kaislaniemi.

The inaugural meeting of the team took place on November 11, 2010. With Tyrkkö acting as project coordinator, each member of the project took responsibility for a specific area of encoding. Nurmi tackled the encoding of text-internal metadata, encoded in the original corpus by means of various parenthethical codes; Marttila looked into the text parameters encoded in the file header; Kaislaniemi looked into special characters unavailable in the original ASCII-based corpus; and Honkapohja sorted out the rest of the text-internal issues such as those to do with drama and verse texts.

In roughly a month, the team members put together the fruits of their labour and it was generally agreed that the project was feasible. All text-internal features could be accounted for using a scheme compliant with TEI P5 guidelines, and the parameter information could also be converted without major problems. Meetings were held with members of the original team, in particular Nevalainen, Matti Rissanen, and Matti Kilpiö, to get to the bottom of certain problematic parameters and their precise meanings in the corpus. These lively meetings not only provided answers to many of the questions, but also brought forth a number of ideas for future development. The important decision was made to consider the current conversion project Phase I, and to save the improvements for Phase II, to be carried out at a later time. While Phase I would faithfully reproduce the features and information encoded in the original corpus, Phase II would include improvements and further developments. The only changes to be made to the corpus in Phase I were the correction of known errata, to be done in such a fashion that the original form would also be preserved and, if necessary, recovered. The team considered it a matter of utmost importance that any research carried out using the new XML edition was fully compatible with earlier work done using the original corpus.

The decision was made that the best approach would be to start with an automated conversion, and then to correct the errors by a combination of manual proofreading and tweaks of the conversion script. Marttila took responsibility for the scripting, and over the Christmas holidays came up with a script that produced a very promising first run.

It soon became apparent, however, that there would be several challenges before the conversion work was finished. A number of discrepancies were discovered between what the original manual stated and the reality, particularly when it came to topics such as the encoding of text structure and representation of samples. It was becoming clear that considerable amounts of manual work could not be avoided. Many of the challenges came about as a result of the hierarchical principles of TEI XML clashing with the way the corpus was encoded or constructed. In such cases it was not sufficient to simply replace existing codes with new XML compliant ones, but rather it became necessary to add encoding to where none existed before. In the majority of cases, this had to be done manually.

An important part of the conversion involved adding transparency to the coded header parameters in the original corpus. The new edition spells out parameters and their values, and the manual provides more information about their specific meanings. Accomplishing this required going back to the original compilers and asking questions about principles followed in the compilation of the corpus.

Some additional work was also done on localisation of the Middle English texts with the aid of the Linguistic Atlas of Early Middle English (LAEME). 13 texts from the ME period had previously been localised using the Linguistic Atlas of Late Mediaeval English (LALME). Honkapohja read through the editions and their introductions, and in the case of editions based on single manuscripts, localised them using LAEME. His efforts identified 14 source texts from LAEME, while 8 of the manuscripts in HC appear not to be listed there. The great majority of Helsinki Corpus ME texts are based on a single manuscript.

A new member joined the XML team in early 2011, when Henri Kauhanen’s job description as VARIENG’s web editor was expanded to include work on the Helsinki Corpus. Together with Marttila, Kauhanen performed various automated and semi-automated conversion tasks to do with bibliographical data and the annotation of line groups and speaker turns missing in the original corpus. Although sufficient, bibliographic information in the original corpus was not recorded in a structured form, which meant that additional annotation, based on a careful analysis of the structure of the original entries, needed to be added to make the information machine-readable. A particular problem came about as a result of the annotation of correspondence samples, where a single text in the original corpus could contain more than one sample with partially overlapping header information.

One of the major undertakings in the new version was the correction of errata, collected over the preceding two decades and now filling two large cardboard folders. Kauhanen and Kilpiö, a member of the original Helsinki Corpus team’s OE part, went through the folders one note after another, often having to do further research to ascertain a correct letter form or to evaluate the veracity of editorial comments. All in all, Kauhanen and Kilpiö ended up making 558 corrections to the corpus.

In the case of one text, the Durham Ritual, Kilpiö and Kauhanen found the extent of editorial intervention in the original source edition to be so substantial that the number of corrections required was unfeasible. Luckily, the Dictionary of Old English project in Toronto had produced a new digital edition, in TEI compliant XML no less, and the Dictionary’s editor Antonette diPaolo Healey graciously provided us with the new text. Although the principle behind Phase I of the conversion project remains that the XML edition matches the old Helsinki Corpus, in this one case it was deemed acceptable to replace the original text with a far more accurate edition.

The release of the new TEI XML edition naturally required more than simply the corpus. For one thing, we wanted to document the conversion project in order to preserve information of who did what and how. Accordingly, the XML file makes use of responsibility statements and a change log recording every stage of the conversion and the names of the editors responsible. Indeed, one of the many lessons we have learned over the more than two decades of corpus compilation work in Helsinki is the value of such historiography of compilation projects, and we would urge other projects to adopt the habit of keeping records not only of decisions made, but also of who in actual fact performs a given piece of work. This not only makes it easier to go back to the right person when more information is needed, but also affords credit to those who produce corpus resources. The particular application of the TEI XML schema for this corpus and its relationship to the original version of the corpus are documented in the manual written by Marttila.

It was also decided that the corpus would be distributed with a browsing tool. The browser, developed by Kauhanen, was originally envisioned as a simple reader, but eventually grew to the point where it incorporates a search function and fairly extensive documentation on the corpus and the XML conversion process. The browser gives the user access to the original Helsinki Corpus files, the new XML version, a slightly stylized view of the XML file, and an HTML version for easy reading.

The XML conversion project taught the team many things. Chief among them was the value of establishing firm and well-documented guidelines for annotation. Automatic conversions from one encoding scheme to another are relatively straightforward when the source document is strictly adherent to a set of guidelines, but a veritable nightmare if undocumented exceptions have been made over the course of the compilation work. Similarly, the principles governing descriptive parameters ought to be defined clearly and unambiguously. While the original compilers may well share common ground when it comes to what the individual parameters denote or the rules by which samples have been assigned to a given descriptive class, over the subsequent years or decades such information is easily corrupted or lost. In the case of the Helsinki Corpus the extensive manual allowed us largely to avoid any major issues in this regard, but the occasional challenges that did surface already gave the team an idea of how things might have been had the documentation been less informative.

Project team

Helsinki Corpus XML conversion team chair

Terttu Nevalainen

Design

Alpo Honkapohja
Samuli Kaislaniemi
Henri Kauhanen
Matti Kilpiö
Ville Marttila
Terttu Nevalainen
Arja Nurmi
Matti Rissanen
Jukka Tyrkkö

Implementation

Ville Marttila
Henri Kauhanen

XML annotation design

Alpo Honkapohja
Samuli Kaislaniemi
Ville Marttila
Arja Nurmi
Jukka Tyrkkö

Conversion script in Python

Ville Marttila

XSLT authoring

Henri Kauhanen
Ville Marttila

Manual XML editing

Henri Kauhanen
Ville Marttila

Errata corrections

Henri Kauhanen
Matti Kilpiö

New localizations from LAEME

Alpo Honkapohja

Documentation of the TEI XML version

Ville Marttila

Desktop user interface for the XML version

Henri Kauhanen