Towards multimedia in corpus studies: introduction

Päivi Pahta, Jukka Tyrkkö, Terttu Nevalainen & Irma Taavitsainen
Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki

The nine articles in this on-line volume, Towards Multimedia in Corpus Studies, are based on papers presented at the 27th annual conference of the International Computer Archive for Modern and Medieval English (ICAME), which was hosted by the Research Unit for Variation, Contacts and Change in English (VARIENG) at the Department of English, University of Helsinki, on the small island of Hanasaari just outside Helsinki on May 24-28, 2006. The selection in this volume focuses on new corpora, new methods and tools, and new empirical findings in corpus linguistics that lend themselves naturally to multimodal hypertext presentation. This volume is the second in the new e-publication series launched by VARIENG in 2007, Studies in Variation, Contacts and Change in English. The first volume of the series, Annotating Variation and Change, edited by Anneli Meurman-Solin and Arja Nurmi, was published December 19, 2007. It contains a set of papers from the pre-conference workshop, which focused on the theme of corpus annotation. In addition to these two collections, a third volume, The Dynamics of Linguistic Variation, containing a selection of papers from the main conference, will be published in traditional book format in 2008.

Like all ICAME conferences, ICAME 27 was a rewarding and thought-provoking event, with presentations representing a wide cross-section of the latest research in corpus linguistics. Inspired by the research foci of the host unit, the themes of the conference were variation, contact and change in English. In addition to papers presenting new corpus linguistic research, the conference participants had a chance to hear presentations unveiling exciting new developments in corpus tools, statistical methods, and the emerging fields of web and multimedia corpora.

New developments in corpus-linguistic software have integrated an ever-increasing range of functions into a single tool. Methodological advances and increasingly sophisticated automated processes for tagging and parsing open up new avenues and research questions, the tackling of which would hitherto have been unthinkable due to the sheer volume of labour required to investigate them using more traditional methods. Advanced statistical methods have become more readily available and user-friendly. The theme of the compilation of corpora from newly available resources was featured in several papers, reflecting the exponential increase of material available in digital format from sources such as the Internet. These papers discussed the representativeness and composition of corpora and the various methods that are needed when the volume of material becomes too vast for human compilers to prepare.

ICAME conferences are always much more than simply a venue for presenting new research and new software. In the comfortable surroundings of the Hanasaari Nordic Cultural Center, the participants enjoyed the congenial atmosphere that ICAME is known for - and which brings the same people together year after year. It is in this same spirit of collegial sharing of ideas that VARIENG wishes to promote open-access publishing. We believe that success in research comes about as a result of the free flow of ideas between colleagues and the fearless adoption of new technologies. Corpus-linguistic research seems to us particularly suitable for electronic publication, and we hope that this volume will serve as a stepping stone for the promotion of these ideals within our research community.

This is the first time in the history of ICAME that a collection of papers presented at the annual conference is being published in a multimedia open-access forum. The advantages of multimedia presentation, illustrated in several ways in these spearhead studies, include the capacity to use full colour in illustrations and charts, the option of including large amounts of data in the final product, and the ability to hyperlink directly from references to resources available online. The ability to present raw data in formats such as online Excel tables promotes the scientific goals of retrievability and objectivity, and marks a step forward in this respect. The zest with which the authors of the present volume took to the task bodes well for the future of multimedia publishing in the humanities. For its part, VARIENG is grateful for the opportunity to launch the new electronic series with two volumes of excellent contributions by some of the leading scholars in the field.

The benefits of multimedia publishing for the presentation of evidence are obvious in Sue Blackwell's article, which examines variation and its possible causes in the speech of mothers to their young children. The material comes from three corpora in the CHILDES database, including speech directed to two groups of normally-developing children, one acquiring Dutch (Groningen corpus) and the other English (Manchester corpus), and to two groups of English-speaking children in the USA, one with Down syndrome and one with autism (Flusberg corpus). The study concentrates on pronoun usage in child-directed speech. The detailed analysis of the evidence, accessible to the readers in hyperlinked spreadsheets, reveals interesting and unexpected patterns across the data, and makes it possible to identify strategies that may be risky to the long-term language development of language-impaired children. The inclusion of primary research data in easily accessible format adds a completely new level of verifiability to scholarly publishing.

In a similar vein, Stefan Th. Gries and Caroline V. David investigate variation found in the use of near-synonymous hedging expressions, kind of and sort of, in the British National Corpus World edition. The study emphasizes the necessity of analyzing the same set of corpus data on multiple levels of corpus organization. The multilevel statistical study of variation patterns in the use of the two hedges draws on distinctive collocate/collexeme analysis and multidimensional contingency tables, and provides evidence for a set of language-external and language-internal preference patterns governing the choice of the items. These include previously unnoticed register- or text-type-specific trends, part-of-speech-specific distributions, and semantic preferences.

A new corpus-analytic approach is proposed by Antoinette Renouf and Jayeeta Banerjee, who advocate for the notion of "repulsion". This force, the authors claim, operates in the opposite way to collocation, i.e. it is a tendency for certain pairs of words not to occur together. The paper describes and discusses the methods that can be used to establish how repulsion operates in the organization of text and to measure it; the ultimate aim of the research into this phenomenon is to develop an objective "lexical repulsion" measure, capable of providing insights into text creation which will be useful in lexicology, language pedagogy and NLP. The study is based on 800 million words of journalistic text from the Independent and Guardian newspapers from 1989-2006.

The opportunities afforded by electronic multimedia resources are also shaping research in historical corpus linguistics. Thomas Kohnen gives an overview of the design and development of English diachronic corpora to date. Kohnen's critical state-of-the-art survey discusses a number of computerized historical datasets, including the pioneering multi-genre and multi-purpose Helsinki Corpus, several later, more focused single-genre corpora, full-text databases and dictionary corpora, and the Corpus of English Religious Prose, currently being compiled at the University of Cologne under Kohnen's direction. Looking ahead, his desiderata for future historical corpus design include the creation of new focused corpora for various domains and periods and the implementation of effective links between existing and planned corpora, facilitating the gradual creation of a general mega-corpus covering the whole history of the English language.

The advantages of the electronic medium for combining corpus linguistics with sound philological methods are amply illustrated in the article by Merja Kytö, Peter Grund and Terry Walker, who emphasize the need for accurate, large-scale electronic databases compiled directly from original manuscripts as a basis for historical linguistics. They provide support for this claim by presenting their ongoing work on an electronic edition of English witness depositions from a variety of regions across England from the period 1560-1760. The edition will also serve as a corpus, facilitating advanced automated searches. They also demonstrate the value of this new corpus of depositions to research on regional variation in Early Modern English; for example, the corpus provides empirical evidence of the role of geography in the varying use of third-person pronouns and past tense forms of be during a period which has often been neglected in studies on historical dialectology.

Manfred Markus introduces another new tool for studying regional variation from a historical perspective. Inspired by the new avenues for research offered by electronic dictionaries, Markus reports on the first phases of his project on Spoken English in Early Dialects (SPEED). The project investigates the possibilities of and limitations on computerizing the regional data collected in the late nineteenth century, culminating in Joseph Wright's English Dialect Dictionary (1898-1905). In this volume, Markus describes the organization and structure of the digital databank version of Wright's dictionary, which will facilitate research on dialectology, the historical linguistics of spoken English, and historical lexicology/phraseology.

The new Chemnitz Corpus of Specialised and Popular Academic English (SPACE) is presented by Josef Schmied. SPACE is a parallel corpus, consisting of English academic texts from the 2000s and representing various academic disciplines and audiences. The texts are presented in pairs, each pair comprising a scholarly article targeted at professionals and a derived version where roughly the same content is presented for a general audience. Among other things, the new corpus makes it possible to compare syntactic, semantic and lexical complexity at different levels of audience design. Schmied's contribution introduces the rationale behind the SPACE Corpus, its context and set-up, and uses several small case studies to illustrate the usefulness of the corpus in analyzing complex phenomena across texts and domains.

Andrew Kehoe and Matt Gee discuss the first phases in the development of the WebCorp Linguist's Search Engine, a further elaboration of the WebCorp system developed at the Research and Development Unit for English Studies at Birmingham City University. The new tool, still under development, promises to be a major step towards taming the Internet for systematic corpus-type analysis. It enables users to search the web as a vast corpus without the limitations of the commercial search engines, on which WebCorp relies, and to build web corpora of known size and composition. In this paper, the authors provide a detailed account of the nature of text on the web, including the HTML, PDF and MS Word formats and their processing for corpus-linguistic analysis.

Web data is also the focus of Vincent B.Y. Ooi, Peter K.W. Tan and Andy K.L. Chiang, who examine the new online genre of weblogs in Singapore English. The material consists of personal blogs written by Singaporean teenagers and undergraduates, whose linguistic styles the authors analyze using Wmatrix, an integrated corpus linguistic tool developed by Paul Rayson. On the basis of word frequency profiles, lexico-grammatical patterning, part-of-speech annotation and semantic content analysis of the two datasets, Ooi, Tan and Chiang are able to pinpoint creative linguistic patterns that can lead to a deeper understanding of the various cultural online identities of the different social groups at play. The study also points out some directions for the further development of tools to handle computer-mediated patterning and the use of global varieties of English in web communication.

Acknowledgements

We would like to thank the authors of the articles for accepting our proposal to take advantage of the opportunities provided by electronic publishing and developing the idea in their contributions. Thanks are due to the peer-reviewers of the articles, whose contribution to quality in scholarly publishing is always anonymous but invaluable. We are grateful to the conference secretary, Minna Korhonen, for handling much of the correspondence between the editors and authors. We would especially like to thank our editorial assistant Tanja Säily for the painstaking work she did in converting the articles submitted into HTML, not to mention the onerous task of making all the charts, tables and images conform to our style sheet. She also went beyond the call of duty in spotting typos and correcting inconsistencies that others had missed. We would also like to extend our gratitude to our VARIENG colleagues who helped in organizing the conference, and to the participants, who are always the heart of the conference. For financial support we are grateful to the Academy of Finland, the City of Espoo, and the University of Helsinki and its Department of English. And, of course, our thanks go to Lordi, for the unforgettable Hard Rock Hallelujah concert on the Market Square in Helsinki, which the ICAMErs had a chance to attend along with 90,000 other fans.