Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources

Jukka Tyrkkö, Matti Kilpiö, Terttu Nevalainen & Matti Rissanen
Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki

Over the last twenty years, corpora have become invaluable tools in the study of language history. By making it quick and relatively easy to find specific linguistic features, the corpora of today provide researchers with both qualitative and quantitative data in volumes that would have been unfathomable only a decade ago. Along the way, corpora have ushered in new methodological paradigms such as statistical and corpus-driven approaches which help us pinpoint important moments of language change and make us more aware than ever before of synchronic variation across social strata, regions, text types and genres. As the years pass and new methods are developed, the outposts of historical corpus linguistics move on to evermore exciting new frontiers.

The Helsinki Corpus Festival was held at the University of Helsinki at the end of September, 2011. The venue was the House of Sciences (Tieteiden talo), in the centre of our capital. The conference was organised by the Research Unit for Variation, Contacts, and Change in English (VARIENG) in celebration of the 20th anniversary of the Helsinki Corpus, the first historical corpus of English to have attempted a comprehensive representation of the language over a long time span. To our delight, colleagues and friends from all over the world joined us to mark the occasion and to discuss the state of the art today, twenty years on.

The subtitle of the Festival, ‘the past, present, and future of English historical corpora’, was more than justified by the wide scope of the presentations. Over four days of extraordinarily pleasant weather, papers were presented representing the entire field of historical corpus linguistics. Some dealt with specific research questions or theoretical issues, while others demonstrated new tools and resources. Plenary papers by Dawn Archer, Andreas Jucker, Merja Kytö and Geoffrey Leech identified major issues and challenges in the discipline and gave us all plenty of food for thought.

The history of English covers well over a thousand years, and the papers read at the Festival demonstrated that research is being done on that entire timeline. A strong selection of papers was read on Old English, Anglo-Latin and Middle English. Research on these periods has long traditions in Helsinki, and the thematic session showed that corpora and corpus methods have an important role to play in keeping these research topics alive and well. Issues in Early and Late Modern English were likewise covered by many papers, with the latter showing a particular revival of interest. Much of the innovative new research presented at the conference comes from early-career scholars. Some use established corpora like the Helsinki Corpus or ARCHER, while others have compiled new resources specific to particular research questions.

Naturally, the papers could be grouped together in more thematic ways as well. With one of the special sessions devoted to the theme, areal and regional variation came across as one of the growth areas, resonating strongly with research carried out today on Present-Day English and charting the historical developments of areal varieties of English around the world. Focusing on earlier periods, researchers were also engaged in developing innovative methods for the study of Middle English dialectology. Another trend was the creation of new historical data sources, such as dictionary and grammar databases, to supplement the information provided by corpora and to enable new research questions, for example, about the relation between language change and standardization.

The challenges of corpus annotation were raised throughout the conference, with topics ranging from theoretical issues concerning the value of annotating corpora to notes on how best to encode specific linguistic or paratextual features. The need for, and challenges of, spelling standardization was raised in several papers, as was the call for tools better equipped for dealing with the many features particular to historical texts. Several contributors raised the importance of including typographic and layout features in corpora, as well as the need for universal standards. Methodological issues to do with parsing historical texts were also addressed, as were the various statistical tests used in evaluating the significance of research findings.

We, the editors, would like to express our gratitude to all the participants of the Helsinki Corpus Festival, to the organizing committee, our student helpers, and in particular of course to the authors who contributed to this volume, and to the referees who reviewed the papers with great expertise and dedication. The articles in this edited collection are in many cases substantially revised and expanded in comparison to the papers presented at the conference. The schedule was tight and the anonymous refereeing was comprehensive, but throughout the process we were delighted by the positive spirit of cooperation and good will of all those involved. We are particularly grateful to Joe McVeigh, our web editor, for his hard work and positive attitude, and for coming up with many good ideas on how to make best use of the many opportunities of online publishing.

The Helsinki Corpus of English Texts

The idea of creating the Helsinki Corpus of English Texts (HC) saw daylight in the first half of the 1980s. The time for this kind of ambitious project was particularly favourable: three factors coincided. For the first, the University of Helsinki decided to start a new generous funding programme supporting innovative research projects in all fields of research. For the second, computers – even “portable” computers weighing twenty kilos or so – were rapidly becoming popular in the academic world in Finland. For the third, at the English Department there were scholars specializing on the history of English and, most importantly, a number of exceptionally talented post-graduate students writing their dissertations on historical topics and keen and eager to apply corpus methodology in their research.

The first source of funding for the project was the Finnish Cultural Foundation, and the grant awarded made it possible to employ Merja Kytö as the project secretary. Her role in the compilation and completion of the corpus was most valuable. Soon after, the University and the Academy of Finland gave substantial financial support to the project. As a result of this funding and the enthusiastic work of the twenty or so project members, HC was completed and publicized in 1991.

The project team was divided into three groups, concentrating on Old, Middle, and Early Modern English, respectively. The most active members of the teams were Leena Kahlas-Tarkka and Matti Kilpiö for Old English, Saara Nevanlinna, Irma Taavitsainen and Päivi Pahta for Middle English and Terttu Nevalainen and Helena Raumolin-Brunberg for Early Modern English. The project leader was Matti Rissanen.

There were two components which made HC different from the very few earlier historical corpora. One was that it was a long-diachrony corpus covering a millennium of English texts, from the eighth to the early eighteenth century. The other, and even more innovative, was that it was structured, although rather loosely, not only chronologically but also from the point of view of sociolinguistic, dialectal and genre-based characteristics of the texts. This structuring was inspired by the increasing interest in the variationist approach to historical linguistics in the late 1960s, and the 1970s (see, e.g. Weinreich et al. 1968; Samuels 1975; Romaine 1982).

The text samples of HC consist of both complete short texts and extracts from longer texts. All samples are equipped with a set of parameters giving information, among other things, on the date, dialect and genre of the text. Sociolinguistic information (age, gender, social status, etc.) on the author and, where appropriate, the receiver are included in the description of Late Middle and Early Modern English texts. Flexibility was one of the leading principles in the structuring of the corpus. The sub-periods are roughly one century long, but their length may vary as the historical events marking remarkable changes in language development were observed. The labelling of the genre, dialect and sociolinguistic factors was based on extralinguistic factors and not on the internal analysis of the text, to avoid circular reasoning. For detailed information on the contents and structure of HC, see the Corpus Resource Database (CoRD); Rissanen et al. 1993; Kytö 1996).

Now, after two decades, HC is still used for the research of the history of English all over the world. But what is more important – and rewarding to its compilers – is that it has inspired and triggered a number of corpus projects, both in Helsinki and elsewhere, resulting in new, more focused and more sophisticated corpora, large and small. Some of the most outstanding are the Parsed Corpus of Early English Correspondence, the Corpus of Early English Medical Writing, the Helsinki Corpus of Older Scots, A Corpus of English Dialogues, the Innsbruck Corpus of Middle English Prose, the York-Toronto-Helsinki Parsed Corpus of Old English Prose, andthe Penn-Helsinki family of annotated corpora. Information on these and other English corpora can be found in CoRD.

This flood of excellent corpora does not necessarily diminish the importance of HC as a “diagnostic” corpus, for tracing trends and developments in the history of the English language. Its structure and parameter coding is an asset in text-based study of change through variation, emphasizing the communicative aspect of language. It can be described as one of the “small and smart” historical corpora. It is important, however, that the results it gives are supplemented both from more focused and specialized corpora and also from much larger, “huge and handsome” corpora.

The Helsinki Corpus TEI XML edition

Over the twenty years since the Helsinki Corpus was released in 1991, the face of computing has undergone radical changes. The computers of today are not only much more efficient in storing and processing information, but there has also been a real sea-change when it comes to the need to ensure that data is stored in formats that translate from one system to another and can be searched, converted and transformed according to the needs of users all across the world. This need is nothing new to corpus linguists. Indeed, it is fair to say that one of the characteristic features of the discipline over the last two decades has been the ever-increasing number of different annotation standards and encoding systems. Although easily explained by the many different needs of researchers working in various fields, the lack of universal standards has also led to problems. Old encoding systems quickly become difficult to use, if not incomprehensible, and automatic processing of less-than-systematic standards can easily lead to misreadings and statistical errors. A big challenge for the future lies in the development of repositories and large databases, which by their nature often simplify and obfuscate carefully considered systems of metadata.

In 2010, a new team was assembled at the VARIENG research unit to update the encoding of the Helsinki Corpus. Consisting of both veterans of the original team (Terttu Nevalainen, Matti Rissanen, Matti Kilpiö, Anneli Meurman-Solin, Arja Nurmi) and PhD students (Ville Marttila, Henri Kauhanen, Jukka Tyrkkö, Samuli Kaislaniemi, Alpo Honkapohja), the project took to task with the aim of finalizing the new version by the Helsinki Corpus Festival in September 2011. The new version of the corpus was to be in XML and follow the universal TEI P5 standard, developed by the Text Encoding Initiative, an international consortium that develops standards for the representation of texts in machine-readable format. Dubbed the Helsinki Corpus TEI XML edition, the new version was to be completely faithful to the original version, merely updated to a well-documented and widely used modern encoding system. Predictably, challenges soon emerged. The structure of the Helsinki Corpus had to be partly re-envisioned to fit the requirements of TEI, and it was soon discovered that a fully automatic conversion would be impossible. Errata compiled over the last twenty years were integrated into the new version by Matti Kilpiö and Henri Kauhanen, and structure was added for metadata that was previously given as a single text field. Due in great part to the efforts of Ville Marttila and Henri Kauhanen, the new version did indeed see the light of day in time for the conference. The corpus even came with its own browser, developed by Kauhanen.

The team learned many valuable lessons during the process, some of them technical but many more fundamental. The need for precision in corpus compiling was emphasized time and again, as was the value of thorough and transparent documentation of compilation and annotation principles. Similarly, the philological understanding of old text types proved to be invaluable in helping the team avoid mistakes which would otherwise have easily occurred during automatic text processing. Despite the challenges, or perhaps because of them, the team was delighted to work on the corpus and now wants to encourage the compilers of other corpora to consider similar projects.

There is no doubt that the next two decades will bring about ever more exciting developments in computing and corpus linguistics. By adopting a universally-known standard for the new Helsinki Corpus, the HC XML team wanted to ensure that the corpus would live on and remain a viable diagnostic resource for decades to come. For this to happen, we must make sure that when the time comes to update the Helsinki Corpus to reflect the standards of the new day, there will once again be a team dedicated to old texts and new technologies in equal measure.

References:

Kytö, Merja (comp.). 1996 (1991). Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. (3rd ed.). Helsinki: Department of English, University of Helsinki.

Rissanen, Matti, Merja Kytö and Minna Palander-Collin (eds.). 1993. Early English in the Computer Age: Explorations through the Helsinki Corpus. Berlin and New York: Mouton de Gruyter.

Romaine, S. 1982. Sociohistorical Linguistics: Its Status and Methodology. Cambridge: Cambridge University Press.

Samuels, M. L. 1975. Linguistic Evolution, with Special Reference to English. Cambridge: Cambridge University Press.

Weinreich, U., W. Labov and M. Y. Herzog. 1968. ‘Empirical foundations for a theory of language change’, in: W. P. Lehmann and Y. Malkiel (eds), Directions for Historical Linguistics: a Symposium. Austin, Texas: University of Texas Press, 95-195.