Introduction

Paul Rayson, University Centre for Computer Corpus Research on Language, Lancaster University
Sebastian Hoffmann, Department of English Studies, University of Trier
Geoffrey Leech, Department of Linguistics and English Language, Lancaster University

This online volume presents a selection of papers from the landmark 30th meeting of the International Computer Archive of Modern and Medieval English (ICAME). The conference was jointly organised by the University of Lancaster and the University of Central Lancashire and held in the Lancaster House Hotel, Lancaster, from 27th to 31st of May 2009. At the 30th anniversary conference, participants looked back on the extraordinary growth and progress made in corpus linguistics over the last three decades, and looked ahead to future developments.

We received a significant number of submissions for the proceedings of ICAME30 and so decided to split them into two separate volumes. This online volume focuses principally on methodological and historical dimensions of corpus linguistics while a further volume, appearing in the Language and Computers series published by Rodopi, comprises research on Present-day English and very recent change in the 20th century.

Andersen uses a large, continually updated corpus of Norwegian newspapers to investigate the potential – but also the limitations – of the automated detection of Anglicisms. He discusses the challenges posed by various types of neologisms (e.g. through their level of morphological integration or because of their status as loan translations with little or no formal resemblance to their original English sources) and presents thoughtful insights on the degree to which these challenges can be tackled via automated methods. While full automation is clearly beyond the capabilities of currently available tools, Andersen's paper convincingly demonstrates that corpus-linguistic methodology can fruitfully complement the more manual – and hence time-consuming – procedures employed in traditional lexicography.

Brekke's paper expands the methodology in place for the automatic extraction of terminological units from parallel corpora to the investigation of comparable texts. On the basis of data from the economic-administrative domain, he demonstrates how the introduction of fuzzy matching of candidate terms selected from comparable texts can reveal many more relevant related terms than would be found by a method that fully relies on direct matches only. In doing so, the successful use of readily available – and much more extensive – text repositories of comparable texts is moved a step closer to the reach of the research community.

Like Brekke, Damascelli is interested in features of specialised text types. In her case, however, the focus is on the application of corpus-linguistic methodology to the area of language teaching. In this context, Damascelli reports on progress made in creating an e-learning platform which is aimed at supporting students of the degree course in “Social Services” at the University of Turin in attaining the required language competence in English at a minimum of level B1 of the Common European Framework of Reference for Languages (CEFR). She provides a detailed description of the system which is built around a specialised corpus of texts (3 million words) relating to the domain of social services.

Kehoe and Gee's paper deals with the topic of social (or collaborative) tagging, i.e. the keyword classifications provided by users of websites rather than their creators through such social bookmarking services as Delicious. The authors' aim is to test to what extent social tagging can offer a new perspective on the ‘aboutness’ of texts by comparing the tags provided by Delicious users with information obtained with the help of traditional corpus-linguistic methods such as keyword analyses. Among other aspects, the authors see their paper as a contribution towards the development of more adequate tagging practices, e.g. supported by automated evaluations of the usefulness of tags that are submitted by users.

The paper by Laitinen reports on a corpus compilation project underway at the Research Unit for Variation, Contacts and Change in English (VARIENG) which aims to document and investigate the use of English in Finland (covering the time between the years 2005 and 2015), thereby contributing to the study of English as a lingua franca in an increasingly globalised world. Texts are carefully selected to ensure that they are genuine – i.e. unedited – samples of language use (e.g. blogs), by non-native speakers of English who live in Finland. As far as this is possible, additional socio-demographic data about the authors of the texts is also included in the data. In the second part of the paper, Laitinen presents three case studies investigating morphological and grammatical variability on the basis of the part of the corpus that has been completed so far.

Lehmann and Schneider present an extension to a purely window-based approach (i.e. n words to the left and/or right of a selected node word) to detecting aspects of fixedness by exploring co-occurrences of verbs and attached preposition phrases in a large syntactically parsed corpus (approx. 240 million words). They employ a combination of statistical significance (via the use of t-scores) and measure of surprise (O/E) to evaluate and rank possible candidates retrieved from the data. Interestingly, even though the typical inaccuracies of syntactically parsed data (approx. 80% precision) could be expected to result in the retrieval of a considerable number of false positives, these are effectively filtered out from the ranked lists produced by the authors.

Meijs and Blackwell's contribution combines a critical discourse analytical approach with corpus linguistic methodology by making use of the Bank of English (BoE) and Webcorp to investigate how words like anti-semitism as well as related terms such as zionism have changed their semantics over time. Their analysis is complemented by an overview of longer-term changes as represented in dictionaries in the 20th and 21st centuries. With the help of collocation analyses, the authors trace differential developments in the US and Britain, finding explanations for such differences in political – and potentially also societal – developments.

Finally, Mudraya and Rayson exploit a corpus of web-derived data to gain insights into the linguistic nature of online dating ads. In particular, their attention is focused on ads aimed at the over-50s – an age group that has not previously been included in studies of the genre. The authors base their discussion on a quantitative analysis of the semantic categories represented in the ads, which were for this purpose automatically annotated with the help of Wmatrix and its semantic tagging component USAS. The findings largely confirm previous studies conducted in the areas of psychology and social anthropology; however, certain differences also emerge as a result of the types of categories implemented in USAS.

Of the four historical papers contained in this collection, Gardner's is the only one to focus exclusively on Middle English data. In her analysis of the Linguistic Atlas of Early Middle English, 1150–1325 (LAEME), she investigates aspects of productivity in word-formation, placing special emphasis on abstract formations involving suffixes such as -dom, -hood, -lak and -reden. Her careful analysis of the data reveals some developments (e.g. diatopic and dialectal differentiation) that had previously escaped scholarly attention.

In Hiltunen and Tyrkkö's contribution, the data analysed spans the periods of Middle English and Early Modern English, but the text type is much more restricted than in Gardner's case: their study exclusively focuses on medical writing. However, the findings are similar on at least one level in that Hiltunen and Tyrkkö, too, make the point that interesting tendencies can emerge from the data when sub-categories of genres are investigated in detail. Their object of investigation is the existential there construction and its frequencies as well as its phraseological characteristics. The authors interpret the diachronic changes observed in the light of general changes in the genre of scientific writing towards a more informational and less narrative style.

The last two papers of the volume to be briefly summarised here deal with the genre of religious writing from approximately 1500 onwards. In their analysis of Early Modern English data, Kohnen, Rütten & Marcoe put to the test the commonly held view that religious prose represents a conservative register. For this purpose, they trace the development of several well-known features of language change in their data (e.g. the replacement of thou by you and the replacement of the which by which) and compare their findings to those reported in previous studies of these phenomena on the basis of more general data sources. When looked at as a whole, religious prose can indeed be seen to lag somewhat behind other genres, but again the picture is more complex than that: while much of the genre in fact appears to develop along the lines of other genres of the time, it seems to be prayers in particular as well as Bible passages quoted in other sub-genres of the data (e.g. catechisms or sermons) that are largely responsible for the prototypically conservative nature of religious writing.

In her study of object dislocation in English hymns, finally, Gather breaks new ground by investigating a sub-genre of religious writing for which no previous linguistic findings are available. Her analysis reveals the high frequency of the phenomenon under consideration in the data and traces some of the formal changes that can be observed over the four centuries covered in the study (1500–1900). Furthermore, she also describes how the use of object dislocation must be interpreted against the backdrop of both syntactic and prosodic factors (i.e. metre and rhyme).

In conclusion, we believe that this volume represents a promising range of new developments in corpus linguistics, both in the methodologies employed, and in types of data, including little studied genres, investigated.

Lancaster and Trier, May 2011.