home

Studies in Variation, Contacts and Change in English

Volume 20 – Corpus Approaches into World Englishes and Language Contrasts

Introduction

Contents

Edited by Hanna Parviainen, Mark Kaunisto & Päivi Pahta

Abstracts

Hasselgård, Hilde
The nature of the essays: The colligational framework ‘the N of the N’ in L1 and L2 novice academic English
http://www.helsinki.fi/varieng/series/volumes/20/hasselgard/

This study investigates the use of the colligational framework ‘the N1 of the N2’ in novice academic English. The material comes from the English literature discipline in the Norwegian component of the VESPA corpus (Varieties of English for Specific Purposes dAtabase) and the BAWE corpus (British Academic Written English). The research questions concern the frequency, distribution, lexical realizations and meaning of the colligation in L1 and L2 novice academic English. Although the L2 users were expected to underuse the colligation due to cross-linguistic differences between English and Norwegian as well as the fact that phrasal complexity requires a high level of proficiency, the frequency of the pattern was found to be relatively similar between the two corpora. The second noun in the colligation shows more discipline-specificity than the first. The colligation has a high degree of variability in both corpora: the most recurrent lexical pattern is that in which the N1 represents a part and the N2 a literary work, as in the end of the novel. However, there were qualitative differences between the two datasets, the most important of which concerned cases in which the N1 is a nominalization. These were used more in L1 than in L2 along with support nouns, while the learners used the pattern with partitive and possessive meaning slightly more than the native speakers.

Chen, Yu-Hua, Simon Harrison and Robert Weekly
“I don’t have communicate ability”: Deviations in an L2 multimodal corpus of academic English from an EMI university in China – errors or ELF?
http://www.helsinki.fi/varieng/series/volumes/20/chen_harrison_weekly/

Deviations in language forms which are different from the norm (or commonly recognized as native-speaker standards) are often labelled as ‘errors’ by language teachers or researchers in the areas of second language acquisition or language learning. Similar non-standard forms, however, are referred to as ‘features’ in other contexts such as English as a Lingua Franca (ELF). In this paper, we argue that the notions of ‘error’ and ‘ELF’ are not always mutually exclusive, and the attribution very much relies on the context. Non-standard use of part-of-speech forms, for example, is one of the most common deviation types we identify in an L2 corpus (e.g. “I don’t have communicate ability” or “they will lead to the bad influence on the economic”). In comparison, similar ‘non-codified’ examples are also found in the VOICE corpus (e.g. “do you arrived there”, “the rest are protect area”), one of the most well-known ELF corpora. By presenting a selection of such examples extracted from the written, spoken, and multimodal components of an L2 corpus (the Corpus of Chinese Academic Written and Spoken English) from an EMI (English Medium Instruction) university in China, this paper will discuss the options regarding how we, as researchers and practitioners, can reconcile different views towards deviation and consider the implications for teaching, learning and assessment. We argue that ‘errors’ do not play as important a role in spontaneous speech as they do in academic writing, and it is also believed that in many respects the difference between an L2 English learner and an ELF speaker is contextual: when learners leave the classroom and use English, they immediately become ELF speakers, proficient or not.

Romasanta, Raquel P.
Variability in verb complementation: Determinants of grammatical variation in indigenized L2 varieties of English
http://www.helsinki.fi/varieng/series/volumes/20/romasanta/

Verb complementation is one of the areas where variability and change in indigenized L2 varieties of English is frequently observed. As such, it has been addressed from a semantic and pragmatic point of view, and this study does likewise. Here the focus is the complementation profile of the verb regret, a verb which has historically shown non-categorical variation between declarative finite that/zero-complement clauses and gerunds (e.g. I regret that I said that / saying that). The database comprises four different varieties of English (American English, British English, Hong Kong English, and Nigerian English) as represented in the Corpus of Global Web-Based English (GloWbE).

An analysis of the distribution of the two available patterns in the English varieties and the different substrate languages (Cantonese in Hong Kong English, and Hausa, Igbo, Yoruba and French in Nigerian English) suggests that both cognitive effects derived from language contact situations and second language acquisition processes (mainly increased isomorphism and transparency) and influence of substrate languages serve as possible explanations for the higher proportions of declarative finite that/zero-complement clauses in the L2 varieties here. The binary logistic regression analysis of other intra- and extra-linguistic factors drawn from the literature shows that the choice of declarative finite that/zero-complement clauses is determined by factors such as inanimate and non-coreferential subjects, presence of negative markers, passive voice, action verbs, text type General, simultaneous temporal relation, and an increase in the number of words in the complement clause and in the intervening material between the two clauses.

Ronan, Patricia
Silly much? Tracing the spread of a new expressive marker in recent corpora
http://www.helsinki.fi/varieng/series/volumes/20/ronan/

This qualitative and quantitative corpus-based study traces the use of a recently evolving expressive marker, the ‘expressive much’ or X-much construction. This typically consists of an adjective, often negatively connoted semantically, that is postmodified by much, is used extra-syntactically and typically presented with question intonation or a question mark in written language. Examples are Silly much? or Paranoid much? This structure is traced through recent corpora of American and international varieties of English. It uses data from the Corpus of Contemporary American English, COCA, and the Corpus of Web-based Global English, GloWbE, and extracts examples semi-automatically with the help of the search interfaces of the Brigham Young Corpus suite. The study finds that previously observed features of the structure such as extrasyntactic structure, question format and negatively connoted adjectives are still frequent, but that further extensions towards non-question structures, embedding and positively connoted adjectives can also be found. The study further shows that the distribution of the X-much construction varies strongly across the varieties of English represented in the corpus materials, and that geographic and varietal preferences can be observed: the structure is well-attested in American English and in a number of Pacific varieties of English, but little attested in a number of African varieties and varieties around the Indian subcontinent.

Andersen, Gisle
Phraseology in a cross-linguistic perspective: Introducing the diachronic-contrastive corpus method
http://www.helsinki.fi/varieng/series/volumes/20/andersen/

The inventory of phrasemes in a language is not static, but new patterns of lexical co-occurrence evolve over time, and such new patterns may be the result of external influence due to language contact. Thus, a cross-linguistically parallel phraseme such as English to go for X and Norwegian å gå for X, in the sense of ‘choose among several options’, for instance from a menu, may – but need not – be the result of indirect borrowing (Backus 2014). In this paper I investigate ‘the largely unexplored area of phraseological borrowing’ (Fiedler 2017: 90). I first present a typological survey that draws on the work of Granger and Paquot (2008) and Fiedler’s (2017) recent work on phraseological Anglicisms in German. Next, I show how a diachronic-contrastive corpus method can be devised to investigate the question of whether cross-linguistically parallel phrasemes are the result of borrowing or parallel developments, and as a vehicle for rejecting preconceived ideas about a form’s alleged origin in English. The approach is based on diachronic and synchronic corpora of English (COCA and COHA) and Norwegian (the Norwegian Newspaper Corpus and the National Library’s Text Archive).

Mandal, Antorlina & Leonie Wiemeyer
Foreign elements in EFL students’ term papers – communicative strategy or display of multilingual competence?
http://www.helsinki.fi/varieng/series/volumes/20/mandal_wiemeyer/

The present study explores the use of foreign elements in linguistic research papers written by L1 German EFL learners from the Corpus of Academic Learner English (CALE; Callies & Zaytseva 2013). By definition, learner corpora contain texts produced by multilinguals. As a result, learner texts are likely to contain elements from other languages (Callies & Wiemeyer 2017). Research into codeswitching has shown that even advanced learners resort to their mother tongue for bridging lexical gaps and for self-repair (Liebscher & Dailey-O’Cain 2005). Several studies analysing interviews from the LINDSEI (Gilquin et al. 2010; Nacey & Graedler 2013; De Cock 2015a, 2015b) established that recourse to the L1 is a typical communication strategy. Moreover, lexical gap-filling strategies are also found in L2 writing (Agustín Llach 2010). However, the specialised nature of academic texts is likely to bring about different multilingual practices. Thus far there is no corpus-linguistic evidence of learners’ use of foreign elements in their academic writing. This research gap is addressed in the present contribution. It was found that the majority of foreign elements were individual words and phrases, usually from the writers’ L1 German. The learner texts in the CALE, like those in the LINDSEI, also contained cultural bridges, though they were generally a minor phenomenon. Unlike in spoken language, which was the focus of codeswitching in learner corpus research so far, foreign elements were not employed to fill lexical gaps. Instead, in accordance with the specialised nature of the texts, learners used discipline-specific terminology, examples, and illustrations from languages other than English. They exploited their multilingual skills to compare linguistic phenomena in their L1 German and other L2s. The findings show that EFL academic writers’ use of foreign elements at advanced levels of proficiency is not a communicative strategy, but fulfils academic goals.

Mehl, Seth
Measuring lexical co-occurrence statistics against a part-of-speech baseline
http://www.helsinki.fi/varieng/series/volumes/20/mehl/

Analysing strength of lexical co-occurrence using Mutual Information (MI) and Pearson’s chi-square test is standard in corpus linguistics; typically, such analyses are conducted using a statistical baseline of all tokens in the data set (cf. Manning and Schuetze 1999). That is, the probability of a given type or lemma is measured as the number of occurrences of that type or lemma against the total number of tokens in the data. This baseline, however, is not ideal as a measure of linguistic probability: the denominator representing all tokens is artificially high because each token does not represent an opportunity for the given lemma to occur (cf. Wallis 2012). This high denominator in turn results in an artificially low probability and suggests an artificially high degree of confidence in the measurement. This paper reports an experiment in employing a grammatical part of speech (POS) baseline for calculating statistical probability of co-occurrence, asking: In what ways does a POS-baseline differ from a traditional baseline of all tokens, when calculating chi-square and MI? The experiment is conducted in the context of a major research project studying meaning through lexical co-occurrence in Early Modern English texts, and the data is drawn from Early English Books Online (Text Creation Partnership edition). I demonstrate that the traditional baseline of all tokens yields higher MI values and more ‘significant’ results than a POS-baseline. I argue that the traditional baseline of all tokens can be interpreted as yielding artificially high MI values; and as yielding an artificially high number of significant results – but I also illustrate that the improvements of the POS-baseline may be negligible for the typical task of ranking the top ten co-occurrence pairs for a given node word.

Weisser, Martin
ICEweb 2 – a new way of compiling high-quality web-based components for ICE corpora
http://www.helsinki.fi/varieng/series/volumes/20/weisser/

Recent years have seen a renewed interest in the compilation of next-generation or new ICE sub-corpora, possibly also including new genres or data. And because, today, corpus compilation via the web has become a much more convenient method than the traditional sampling employed in creating the original ICE corpora, it makes sense to try and compile as much of the materials as possible for new or updated written ICE materials from online sources. This article introduces ICEweb 2, a new and considerably advanced version of a tool designed to collect written data for such purposes, as well as to process and analyse them in similar ways to those offered by most concordance packages, thus, at least to some extent, obviating the need to switch between tools.

Alissandrakis, Aris, Nico Reski, Mikko Laitinen, Jukka Tyrkkö, Jonas Lundberg & Magnus Levin
Visualizing rich corpus data using virtual reality
http://www.helsinki.fi/varieng/series/volumes/20/alissandrakis_et_al/

We demonstrate an approach that utilizes immersive virtual reality (VR) to explore and interact with corpus linguistics data. Our case study focuses on the language identification parameter in the Nordic Tweet Stream corpus, a dynamic corpus of Twitter data where each tweet originated within the Nordic countries. We demonstrate how VR can provide previously unexplored perspectives into the use of English and other non-indigenous languages in the Nordic countries alongside the native languages of the region and showcase its geospatial variation. We utilize a head-mounted display (HMD) for a room-scale VR scenario that allows 3D interaction by using hand gestures. In addition to spatial movement through the Nordic areas, the interface enables exploration of the Twitter data based on time (days, weeks, months, or time of predefined special events), making it particularly useful for diachronic investigations.

In addition to demonstrating how the VR methods aid data visualization and exploration, we briefly discuss the pedagogical implications of using VR to showcase linguistic diversity. Our empirical results detail students’ reactions to working in this environment. The discussion part examines the benefits, prospects and limitations of using VR in visualizing corpus data.

University of Helsinki