Aspects of corpus linguistics: compilation, annotation, analysis

Signe Oksefjell Ebeling, Jarle Ebeling, Hilde Hasselgård
University of Oslo

Introduction

The International Computer Archive of Modern (and Medieval) English (ICAME) organization was founded in Oslo in February 1977. [1] The fact that this first ICAME meeting took place in Oslo was due not least to the role played by Stig Johansson, University of Oslo, as co-ordinator of the initiative (see further Leech & Johansson 2009). When it was decided that the annual ICAME conference was to return to Oslo in 2011 we found it befitting to hold it in honour of Stig Johansson. In June 2011, around 200 delegates from all over the world gathered in Oslo for the 32^nd ICAME conference. The event, entitled Trends and Traditions in English Corpus Linguistics, was co-organized by the University of Oslo, the Norwegian School of Economics and Business Administration and Uni Research AS.

In the call for papers prior to the conference, three of Stig Johansson’s main fields of interest were selected as topics for papers in addition to the general one of trends and traditions in English corpus linguistics, namely corpus development, contrastive analysis and English grammar.

With regard to the field of corpus development, Stig Johansson pointed out on several occasions that “[w]e need bigger corpora, better corpora, corpora with a wider range of languages, and we need to learn to exploit the corpora in the best possible manner” (Johansson 2009: 37–38). This call for bigger and better corpora and corpus tools is reflected in Johansson’s own contribution to the field, particularly through his work on the Lancaster-Oslo/Bergen (LOB) Corpus, for which he secured funding and established cooperation with the Norwegian Computing Centre for the Humanities, both of which were crucial for the completion of the corpus. The connection between the LOB Corpus and ICAME is also evident; indeed, “the beginning of ICAME was intimately connected with the work on the LOB Corpus” (Leech & Johansson 2009: 12). Johansson’s concern that a corpus should be put together in a principled way is reflected in the sampling of the LOB corpus: “[l]ike its American counterpart [the Brown Corpus], the British corpus is intended to be a representative sample of the texts printed in 1961. The texts were selected by stratified random sampling” (Johansson 1978). Similarly, other corpora compiled under his direction, most notably the English-Norwegian Parallel Corpus, are well planned in terms of representativeness and sampling, although admittedly "[i]t follows then that [e.g.] the LOB Corpus is not representative in a strict statistical sense. It is, however, an illusion to think that a million-word corpus of English texts selected randomly from the texts printed during a certain year can be an ideal corpus." (Hofland & Johansson 1982: 3). Johansson’s involvement as chair of the Text Encoding Initiative’s committee of text representation (1990-1993) further illustrates his belief that the compilation of corpora should follow specific guidelines and standards. [2]

Another area where Stig Johansson was a forerunner is contrastive analysis; he was instrumental in the renewed interest in contrastive analysis that we have witnessed over the past couple of decades, through his work on the English-Norwegian Parallel Corpus and the Oslo Multilingual Corpus. The following quote from his monograph Seeing through Multilingual Corpora (2007) bears witness to the great potential offered by corpora following Johansson’s bidirectional model:

They offer unique possibilities of comparison in that they combine translation corpora and comparable corpora within the same overall framework [...] We can reveal cross-linguistic correspondences, we can ‘see’ meanings, we can identify translation effects. The two types of corpora complement each other and at the same time serve as a means of control. (Johansson 2007: 304)

Moreover, on the basis of such corpora, Johansson contends that “[w]e can see how languages differ, what they share and – perhaps eventually – what characterises language in general” (Johansson 2007: 1).

Finally, Johansson’s interest in language – and more specifically the English language – was always the driving force behind his professional commitment, something that is well documented in his co-authorship of the Longman Grammar of Spoken and Written English (Biber et al. 1999) and of a grammar particularly aimed at Norwegian students of English entitled English Grammar: Theory and Use. (Hasselgård et al. 1998 [2^nd ed. 2012]).

Following the call for papers for the 32^nd ICAME conference, we received a large number of proposals for full papers, work-in-progress reports and posters, many of which reported precisely on research in the three fields outlined above.

The conference also hosted three pre-conference workshops, kindly organised by (1) Karin Aijmer & Bengt Altenberg, (2) Merja Kytö & Irma Taavitsainen and (3) Ylva Berglund-Prytz & Martin Wynne. While the latter was organised as a debate for and against the motion “Language corpora are no longer necessary for linguistic research”, the former two included papers on contrastive analysis and historical linguistics, respectively.

The more than 150 papers presented at the conference have resulted in three separate publications in addition to the present one:

Aijmer & Altenberg (eds), which is a volume in memory of Stig Johansson deriving from the workshop on contrastive analysis;
Andersen & Bech (eds), which focuses on language variation in time, space and genre;
Hasselgård, Ebeling & Ebeling (eds), in which the contributions are all devoted to lexical and phraseological issues.

As reflected in the title of this volume, each of the nine contributions focuses on one or more of the following aspects of corpus linguistics: compilation, annotation, analysis. While the contributors in the first part are mainly concerned with the analysis of contrastive/ translation data, the contributors in the second part discuss compilation and annotation issues more widely, in addition to performing focused case studies.

In the first contribution, Egan explores how translation corpora may be exploited in the study of synonymy and polysemy. His starting point for the synonymy part of the study is the English verbs begin and start, while the English preposition at is the item under study in the part focusing on polysemy. Using the English-Norwegian Parallel Corpus, Egan shows how translations from English into Norwegian by several different translators serve as indicators of the degree to which begin and start can be said to be synonymous and in what ways at can be said to be polysemous. In this manner, he illustrates that translation corpora combine authentic data with the advantages of elicitation experiments in that “they contain the intuitive linguistic responses of competent language users [i.e. translators] to a series of linguistic prompts [i.e. the items under study]”. The translators’ choices, Egan claims, “point to differences in semantics of the expressions” both in terms of synonymy and polysemy. The underlying meaning of these three English items is seen to be brought into relief as a result of the insight gained from the translation data.

In her article on delexicalised constructions and nominalizations in English and Spanish, Labrador focuses on the cross-linguistic relationship between English delexical thing and Spanish lo-nominalizations. Her choice of topic was triggered by the observation that Spanish learners of English often go wrong in the use of delexical thing. Influence from their L1 will in many cases lead to infelicitous sequences such as “The amazing about this love story”, where English requires a noun, e.g. thing as head of the NP, while Spanish can resort to lo-nominalization: lo asombroso ‘the amazing’. Conversely, when Spanish students translate from English into Spanish, an overuse of cosa ‘thing’ in similar expressions is often observed. On the basis of both comparable and translation data, Labrador’s study reveals quite clearly that the two contrasted constructions operate as good cross-linguistic correspondences of each other, illustrating how non-congruent constructions in the two languages are used to express the same meaning. Moreover, she illustrates that although the delexical strategy is possible in Spanish, it is much more common in English. The fact that there seems to be a particular set of adjectives that enter into the constructions in the two languages suggests that they have phraseological tendencies that need to be documented for applied purposes, e.g. second language learning and translation studies. Labrador’s study thus offers new cross-linguistic insights into the way in which nominal content is naturally and idiomatically expressed in the languages compared.

In the article entitled “Who’s afraid of ... what – in English and Portuguese”, Maia & Santos explore how the semantic domain of fear is expressed in the two languages. The study draws on data from a range of corpora, notably the BNC for English and the AC/DC corpora for Portuguese. While the BNC is tagged only for parts-of-speech, the AC/DC corpora have in addition been parsed and tagged for the fear domain. Examining two European languages within the same cultural sphere, the authors confirm their hypothesis that the two languages have similar expressions available to express this emotion. Moreover, both languages tend to focus on the Senser in fear situations. Thus it is tempting to speculate that this is a shared tendency across European languages and mindsets in general. At a more detailed level, however, some minor discrepancies can be observed; Portuguese operates with three closely related expressions, all of which can be translated into English John was frightened, viz. O João assustou-se ‘John frightened REFL’, O João estava assustado ‘John was frightened’ and O João ficou assustado ‘John became frightened’. The question arises in what ways the three Portuguese expressions can be seen to differ cognitively and what implications this has for the ways in which they are perceived. Maia & Santos give some ideas of how their contribution can shed light on the notions of universal grammar and linguistic relativity and conclude their article with some suggestions for future research, including further studies of (other) emotions in a variety of languages.

Reichardt investigates the relationship between the valency patterns of the verb consider and its correspondences in German. On the basis of EuroParl, supplemented with data from the monolingual reference corpora the Bank of English and the Deutsches Referenz Corpus, she finds that correspondences, or translation equivalents, in German vary according to the valency patterns of consider, “indicating a congruence between local grammar and the TEs [translation equivalents], i.e. the meaning”. In the case of German prüfen, for instance, the valency pattern <subj obj> is central; of the 50 cases investigated where consider and prüfen correspond to each other, consider is found in the valency pattern <subj obj> in 62% of the cases. Even if there is a tendency for many of the German translation equivalents to show a preference for one of the valency patterns of consider in particular, it also becomes evident that they are spread across several of the valency patterns, indicating that the choice of a German equivalent is not “a rule-based construction process”, i.e. there is some degree of freedom in the translations. In this connection, it is important to stress that the choice also relies on meaning, or the degree of synonymy. Thus, Reichardt points to the importance of analysing language at the grammar-lexis interface. Not only do valency patterns provide information about the syntactic environment of an item, they also give information about the usage of words, i.e. their semantic environment.

With reference to the language-pair English-Norwegian, Thunes relates the concept of translational complexity to the notion of computability within two text types, viz. fiction vs. law texts. In so doing, she seeks answers to two questions in particular: is it possible to automatically compute the translational relation between the text pairs in question and is there a difference in translational complexity between the text types under investigation? For this purpose, Thunes outlines a system for measuring translational complexity which is based on a hierarchy of correspondence type, operating with four types on a scale from least complex to most complex in terms of pragmatic, semantic and syntactic correspondence across languages. Her findings suggest that, although law texts contain a higher proportion of computable translational correspondences, i.e. the degree of complexity is lower, it is not a given that they lend themselves better to machine translation than do fiction texts. However, following a more detailed discussion on the issue, Thunes concludes that, since the non-computable strings in law texts contain minimal cross-linguistic differences, the potential post-editing cost of applying automatic translation to law text is relatively modest compared to that of fiction. Thunes’s study also uncovers certain tendencies with regard to how finite clauses in Norwegian are seen to correspond to non-finite clauses in English, thus providing food for thought not only in the fields of rule-based translation and translation studies but also in that of contrastive analysis.

Kehoe & Gee’s article on the Birmingham Blog Corpus has two main parts. In the first part they introduce the corpus and discuss issues related to the compilation process in some detail. After a brief discussion of the blog format and web corpora in general, the authors outline the steps involved in the process of building a blog corpus, including some examples of how the extraction of text from hosting sites was done. The corpus is made up of several sub-corpora of blog posts and reader comments, i.e. data from different blog hosting sites, and currently (October 2012) stands at approx. 600 million words. It is freely available through the WebCorp Linguist’s Search Engine interface. The second part of the article is devoted to a case study focusing on textual aboutness. By analysing reader comments to blogs in one of the sub-corpora, Kehoe & Gee illustrate that such comments may be good aboutness indicators, and as such “could be used to improve document indexing on the web”. In a comparison of word lists from the body of the posts themselves and word lists from the comments, a list of more than 100 potential aboutness indicators unique to the comments emerged. Although some manual intervention was required to produce the actual list of aboutness-indicating words from the comments, the method of going beyond the information provided in the blog post itself proved useful in the indexing of blog posts. Kehoe & Gee also draw attention to the growth and popularity of blogs in recent years, leading to the conclusion that blogs can no longer be looked upon as a single, uniform genre of ‘online journal or diary’.

Lehmann & Schneider present a dependency tree bank for the BNC, the BNC Dependency Bank, offering a description of the syntactic annotation scheme applied as well as a demonstration of the query interface. The paper starts by giving a step-by-step introduction to how the dependency parser Pro3Gres works on annotated input from the LT-TTT2 annotation framework both in the form of tagging and chunking. With such a robust parser the authors demonstrate how dependency relations operate between nuclei (typically between noun and verb chunks) within an s-unit. The annotated data are stored in a database of dependency relations. Once the result sets from this database are generated these serve as input to the web-based interface, i.e. the BNC Dependency Bank proper. The article offers a series of screen shots that take the reader through the types of dependency queries that can be performed. Although the authors issue some words of caution with regard to the success of the system, it becomes clear that this is an excellent resource for linguists who want to delimit their data syntactically. Lehmann & Schneider conclude by suggesting directions for further development of the dependency bank framework, particularly how the parser can be improved. Furthermore, an important aspect of their effort is the re-usability of the framework in the annotation of other corpora, thus enabling direct comparisons of a variety of corpora.

Another innovative resource, the CorTrad corpus, is described in the contribution by Santos, Tagnin & Teixeira. It is one of the first semantically tagged parallel corpora, if not the first; in addition, it is also quite unique in containing source texts in English and Portuguese with multi-version translations (i.e. three translation stages) into the other language. The authors begin by describing the annotation process, where the semantic field of colour was singled out. Even with a step-by-step annotation scheme, it is evident that semantic tagging of this kind is not an easy enterprise, illustrated not least by the many choices that have to be made with regard to the classification of colour terms. The article moves on to survey the use of colours and colour terms in the three genres represented in CorTrad, viz. fiction (English-Portuguese), scientific magazine (English-Portuguese), and a cookbook (Portuguese-English), with a view to determine how colour terms are used in the two languages, what type of meaning they encode, to what extent they are used as translations of each other, and what role genre seems to play. The findings point to cross-linguistic differences as well as genre differences in the use of colour terms. For instance, in the case of the Portuguese lemma dourar ‘to become golden/ to gild’ it was speculated that there is a lexical gap between the two languages, as English seems to favour brown where Portuguese uses dourado ‘golden’. With reference to differences between the genres, variation is seen not only in the frequency with which colours are used, but also in their collocations. In their concluding remarks, Santos, Tagnin & Teixeira suggest avenues for future research, including comparisons with data from other and larger corpora.

Sotillo’s study of illocutionary acts in SMS text messages starts by introducing the researcher’s SMS corpus, of which she uses a subsample of 1,271 sent and received text messages in the current study. Using Speech Act Theory (SAT) as her theoretical framework, Sotillo classifies the text messages according to Searle’s five types of illocutionary acts, viz. assertives, directives, commissives, expressives, and declarations/ declaratives. These categories were also used in the annotation of the corpus. SAT was chosen as it makes the researcher better equipped to explain the intended meaning of the SMSs, through uncovering how illocutionary acts are instantiated in text messages. Sotillo seeks answers to the following questions: (1) what types of illocutionary acts are found in the data, and (2) what is the communicative intent and functional orientation of these? In order to answer the latter question, a further coding scheme, modelled after Thurlow & Brown’s (2003) communicative intent-functional orientation framework was devised, and the illocutionary acts previously identified were classified accordingly. With reference to research question (1), Sotillo found that a majority of the illocutionary acts were of the assertive type, followed by expressives, directives and commissives. No instances of declarations/ declaratives were recorded. As regards research question (2), the analysis uncovered that, although the communicative intent-functional orientation was most often found to be ambiguous (34%), information sharing and requests were the preferred categories overall, accounting for 24% and 19% of the cases, respectively – a finding that “conforms to Searle’s (1969) conceptualization of language as a type of social activity”. More detailed analyses of the various contributions in the SMS corpus substantiate claims made in previous research in addition to revealing new insights that could be pursued in future studies, e.g. the use of expressive emoticons and lexical shortenings, the impact of a texter’s occupation, etc. Sotillo concludes by encouraging further research in the field of text messaging on the basis of larger collections of data in order to be able to “advance our understanding of this variety of naturally occurring language”.

The nine articles that make up this volume demonstrate how vibrant the field of corpus linguistics is, lending itself to such a wide variety of issues to be discussed and studied. In the first part, five papers sharing an interest in contrastive analysis and translation studies illustrate how the comparison of different language pairs may serve as an eye-opener in the study of synonymy and polysemy in English and Norwegian (Egan), nominalizations in English and Spanish (Labrador), emotional domains in English and Portuguese (Maia & Santos), valency patterns in English and German (Reichardt) and translational complexity in English and Norwegian (Thunes). Similarly, in the second part, where corpus development, corpus tools and annotation are central, a number of different corpora and corpus-related issues are discussed by the authors. Corpora of relatively recent text-types such as blogs (Kehoe & Gee) and SMS text messages (Sotillo) are described and analysed in terms of aboutness in the former case and in terms of illocutionary acts in the latter. A description of a new syntactic (dependency) annotation scheme and query system as applied to the British National Corpus is also among the contributions (Lehmann & Schneider). The CorTrad English-Portuguese parallel corpus with semantic tagging is also introduced and explored for the semantic field of colour (Santos, Tagnin & Teixeira).

We extend our sincere thanks to the authors for their contributions both to the ICAME32 conference and to this volume. Thanks are also due to the many reviewers who invested their time and expertise to the careful review of the contributions.

Stig Johansson’s (2012: 64) expressed wishes for the future of multilingual corpora nicely capture the many aspects of corpus linguistics of this volume:

we need to widen the range of languages;
we need multi-register corpora;
we need corpora with annotation of features which cannot be easily found in raw, unannotated text;
we need to learn more about how we can best exploit them.

We believe that most of these points are also valid for monolingual corpora and that the authors represented in the present volume have contributed towards the fulfilment of these wishes.

Sources

ICAME homepage: http://icame.uib.no/. ICAME founding document (PDF): http://clu.uni.no/icame/history/founding_document_1977.pdf. Stig Johansson’s role in the origin of ICAME (PDF): http://clu.uni.no/icame/history/Leech_Johansson.pdf.

Trends and Traditions in English Corpus Linguistics: https://blogs.it.ox.ac.uk/ota/2011/06/01/icame-2011/.

Norwegian School of Economics and Business Administration homepage: https://www.nhh.no/en/.

Uni Research AS homepage: http://uni.no/?lang=en.

LOB Corpus manual: http://clu.uni.no/icame/manuals/LOB/INDEX.HTM.

English-Norwegian Parallel Corpus: http://www.hf.uio.no/ilos/english/services/omc/enpc/.

Text Encoding Initiative homepage: http://www.tei-c.org/index.xml.

Oslo Multilingual Corpus: http://www.hf.uio.no/ilos/english/services/omc/.

“Language corpora are no longer necessary for linguistic research”: http://blogs.it.ox.ac.uk/ota/2011/06/01/need_corpora/.

Notes

[1] In 1996 Medieval was added to the name of the organization, having originally been called the International Computer Archive of Modern English (Leech & Johansson 2009: 18).

[2] Cf. Sperberg-McQueen & Burnard (1994).

References

Aijmer, K. & B. Altenberg, eds. forthcoming. Advances in Corpus-based Contrastive Linguistics. Studies in Honour of Stig Johansson. Amsterdam & Philadelphia: John Benjamins.

Andersen, G. & K. Bech, eds. forthcoming. English Corpus Linguistics: Variation in Time, Space and Genre. Rodopi.

Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman.

Hasselgård, H, J. Ebeling & S.O. Ebeling, eds. forthcoming. Corpus Perspectives on Patterns of Lexis. Amsterdam & Philadelphia: John Benjamins.

Hasselgård, H., S. Johansson & P. Lysvåg. 1998. English Grammar: Theory and Use. Oslo: Universitetsforlaget.

Hasselgård, H., P. Lysvåg & S. Johansson. 2012 [2nd ed.]. English Grammar: Theory and Use. Oslo: Universitetsforlaget.

Hofland, K. & S. Johansson. 1982. Word Frequencies in British and American English. Bergen: The Norwegian Computing Centre for the Humanities.

Johansson, S. in collaboration with G. Leech and H. Goodluck. 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Department of English, University of Oslo. http://clu.uni.no/icame/manuals/LOB/INDEX.HTM (27 September 2012).

Johansson, S. 2007. Seeing through Multilingual Corpora: On the Use of Corpora in Contrastive Studies. Amsterdam & Philadelphia: John Benjamins.

Johansson, S. 2009. “Which way? On English way and its translations”. International Journal of Translation 21(1–2): 15–40.

Johansson, S. 2012. “Cross-linguistic perspectives”. English Corpus Linguistics: Crossing Paths, ed. by M. Kytö. Amsterdam etc.: Rodopi. 45–68.

Leech, G. & S. Johansson. 2009. The coming of ICAME. ICAME Journal No. 30. Bergen: Aksis. 5–20. Available at http://clu.uni.no/icame/history/Leech_Johansson.pdf (17 October 2012).

Sperberg-McQueen, C.M. & L. Burnard, eds. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3) Volume 1. Text Encoding Initiative, Chicago & Oxford.