Corpora as lexicographical basis – The case of anglicisms in Norwegian

Gisle Andersen, Norwegian School of Economics and Business Administration

Abstract

This paper reports on the status of ongoing corpus building and lexicographical work within the framework of the Norwegian Newspaper Corpus project. Specifically it describes the work flow, tools and methods used in the identification and analysis of new anglicisms in Norwegian. Surveying the lexical borrowing from English serves a variety of purposes, including lexical acquisition, the extraction of terminology, language technology, and more general linguistic purposes such as surveying the amount and inventory of English loan words in various usage domains. Observations from such a survey may form an empirical basis for language policy decisions, such as considering efforts for preventing domain loss. While previous work in Norwegian lexicography has generally relied on manual methods for excerpting new words – and for identifying anglicisms among the new words, the current project is an effort to develop tools which automatise the process of identifying, segmenting and analysing new loan words from English. A system for corpus-based language monitoring has been set up at Uni Digital (formerly Unifob AKSIS), in close cooperation with lexicographers at the University of Oslo. The paper describes briefly the overall workflow and focuses especially on alternative methods for identifying anglicisms (lexicon-based, n-gram-based, combinatory methods). The paper also presents some main trends and statistics regarding the use of anglicisms, as well as future plans for exploitation of this material.

1. Introduction

Among the significant events in recent corpus development, we find the emergence of dynamic corpora and web-based corpora, as well as the exploitation of corpora in lexicographic projects (Renouf 2007). Initiated by Atkins and Sinclair in the 1970s, the Collins COBUILD project at the University of Birmingham was the first lexicography initiative which systematically used a corpus as its main source of knowledge about words and their use in the language (Sinclair 1987). This effort has been described as a revolution which “changed the principles and methods of dictionary making” (Pulcini 2008: 189) and which enabled lexicographers to “view the evidence of how a word was used without the arbitrary filter of who thought what was an interesting example of a word” (Kilgarriff & Tugwell 2002: 125). Corpus lexicography provides a methodology “for measuring hard evidence of the lexical behaviour of words since this will (arguably) result in a more representative, coherent and consistent output than a lexicon produced from conventional means” (Ooi 1998: 37). With respect to corpus building, “the first ‘dynamic’ corpus of unbroken chronological text” (Renouf 2007: 36) was established in 1990 as part of the AVIATOR project in Birmingham, using the Times newspaper as its source. This was followed by other large, monitor corpora such as the ACRONYM project corpus (Renouf 1996) and more recently the Corpus of Contemporary American English (Davies 2009). Since the turn of the millennium, it has become increasingly common to develop and explore web-based corpora, aka. ‘cyber-corpora’ (Renouf 2007), resulting in a growing body of corpus-based studies using the web as its prime source of data (Kilgarriff & Grefenstette 2003; Hundt, Biewer & Nesselhauf 2007). Corpus-building efforts such as WebCorp (Renouf, Kehoe & Banerjee 2005) and the Wacky initiative (Baroni, Bernardini, Ferraresi & Zanchetta 2009) use web crawler technology for making the web serve as a linguistic corpus.

Inspired by such technological innovations, The Norwegian Newspaper Corpus project [1] is an ongoing effort to create a large monitor corpus representing present-day Norwegian in both its written standard varieties, Bokmål and Nynorsk, and developing associated language processing tools particularly tailored for lexicographical work (Hofland 2000, Wangensteen 2002, Andersen 2005, 2010). The web-based corpus is compiled by daily harvesting and processing of published texts from the web edition of several Norwegian newspapers. It is a ‘modern diachronic corpus’, in the sense described by Renouf (2007), enabling the study of language change, neologistic usage and lexical productivity and creativity as it unfolds in written language. The open-ended monitor corpus can be used to study language as a changing phenomenon at the diachronic micro-level, by comparing time frames calibrated at a daily, weekly, monthly or yearly basis within the time-span the corpus represents. This makes it particularly well suited for studying ongoing lexical and stylistic innovation and variation, in addition to other types of innovation, such as grammatical change. Although the main rationale of the Norwegian Newspaper Corpus has been linked to the need for updated text for lexicography purposes, a much wider usage can be envisaged (Andersen & Hofland forthcoming).

The current paper gives an outline of the methods and language processing tools that have been developed for corpus-based lexicographical work within the framework of the Norwegian Newspaper Corpus project. It focuses in particular on the work flow, tools and methods used in the identification and analysis of English loan words that occur in the Norwegian language. The methods described below include tools for neologism extraction, anglicism detection, collocation analysis and frequency profiling.

2. The system architecture of the Norwegian Newspaper Corpus

The text collection for the Norwegian Newspaper Corpus began in 1998. As of September 2010, the corpus consists of about 850 million words, which makes it the largest searchable corpus of Norwegian. The daily growth is on average approximately 230,000 words. The corpus consists of the full web version of about 25 Norwegian newspapers. The selection of sources has been limited to newspapers that also have a printed counterpart. This includes large, national newspapers like Aftenposten, Dagbladet and VG, major regional newspapers like Bergens Tidende and Stavanger Aftenblad and local newspapers like Sogn Avis and Gudbrandsdølen/Dagningen. Most of the newspapers are of general interest, but a few niche publications are also included, specifically the newspaper Nationen, with a special focus on agricultural issues, Vårt Land, which is explicitly religious and Morgenbladet, a weekly newspaper with a strong academic profile. The full political spectrum is also represented, ranging from the business newspaper Dagens Næringsliv to Klassekampen representing the political left. The selection of newspapers has resulted in a large corpus with a wide topical coverage containing relatively homogeneous data, despite the fact that it is harvested from the web. Although maximal efforts have been made to ensure a balance between the two language varieties, the Bokmål variety is massively larger than Nynorsk. This discrepancy reflects the degree to which Bokmål and Nynorsk are used in newspapers on the web, but it is not necessarily representative of the use of the two varieties in other contexts.

The system involves several stages of processing, most of which run automatically due to a self-executing batch file. Its architecture is visualised as a data flow diagram in Figure 1 and involves the following main steps:

  1. harvesting: two different web-crawler programmes, w3mir and wget, download the full internet version of Norwegian newspapers
  2. boilerplate removal: a set of specifically designed programs automatically select the core text, including headlines, introductory and main text, and image texts, but discarding advertisements, navigation menus, metatext, html code, etc.
  3. language classification: the texts are classified as either Bokmål or Nynorsk, while English texts are discarded
  4. text annotation: metadata concerning date, author and source are extracted from the source texts, and the texts are machine classified according to topic and morphosyntactically tagged by the Oslo-Bergen tagger
  5. new word form extraction: the inventory of word forms of newly harvested texts is compared with an accumulated list of word forms, and a list of forms not previously recorded is extracted and added to the accumulated word list
  6. new word form classification: new word forms are classified according to orthographical criteria, and anglicism candidates are identified
  7. frequency profiling: statistical filters are used to identify neologisms that are most relevant for lexicography
  8. lexical database entry: selected neologisms are registered in the Norwegian Word Bank with relevant morphosyntactic and semantic information

Steps 1–6 are performed automatically on a daily basis, while steps 7–8 require manual intervention and are performed in less regular batches. The development of tools for boilerplate removal has been a particularly time-consuming and complicated task (Andersen & Hofland forthcoming), and this procedural step also includes algorithms for removal of duplicate texts. The current paper is mostly concerned with steps 5–7, to be described in more detail below.

3. Extraction of new words

The tangible output of the processing described above is a substantive daily growth in terms of types and tokens in the corpus. Of the c. 230,000 running words that are daily added to the text database, on average 1,300 are new word forms. It should be pointed out that, in line with the corpus linguistic tradition, a ‘word’ is a technical concept that is defined as any sequence of graphemes found between two spaces in authentic running text. A ‘new word’ is any word that is not included in a large, accumulated reference word list (cf. step 5 above), against which all harvested text is checked. As of January 2010, this reference list consists of about 3.9 million word forms, including a full-form lexicon derived from the comprehensive dictionary Bokmålsordboka. Table 1 shows the variety of items that are considered new words by this definition, including some categories that are highly relevant for lexicography and others which are not.

Table 1. New word categories in the Norwegian Newspaper Corpus.

Category

Examples

#

%

General neologisms/spelling errors

tidsklemma, pingle, ektremistisk

895336

46.0

Anglicism candidates

whistleblower, blogg, subprime

104217

5.4

Abbreviations

omg.

2838

0.1

Names

al-Duwasa, CanJet, Olsweek

477828

24.6

Hyphenated names etc.

Pan-Arctic

104960

5.4

Hyphenated compounds

e-meter-tester, blokk-bleik

212413

10.9

Compounds with other marking

Fabian/Wikimedia

62510

3.2

Digits and abbreviations

KOMMUNE26

16802

0.9

Pure digits

88,500

358

0.0

URLs and e-mail addresses

kickoff.com

22968

1.2

Garbage

bokmål‰rart, Rekdal.-

44790

2.3

Total

 

1945020

100.0

In the first category of Table 1, we find words that lack a special orthographic feature (capital letter, hyphen or the like) but consist of lower-case letters only, accounting for about half of the new words. Lexicographers are interested in real neologisms – new, linguistically motivated and authentic lexical items – and these are typically found in this category, represented here by forms like tidsklemma, a new compound meaning ‘the time squeeze’ and pingle, a new lexical item  meaning ‘weak, cowardly person’. But the first category also includes lower-case spelling errors not previously recognised, such as ektremistisk (ekstremistisk, ‘extremist’), irrelevant from a lexicographer’s point of view, but nevertheless relevant to the developer of spell checking systems or to the psycholinguist focusing on error studies, etc. Some new words, about 5%, are classified as anglicism candidates, including whistleblower, blogg and subprime. The procedure for classification is described below. Some new words have special orthographic features which make them less relevant for inclusion in dictionaries. About 10% of the new words are productive, hyphenated compounds. Newspaper language refers widely to names of people, places, companies and products, and, unsurprisingly, a substantial proportion of new words, about 30%, are orthographically distinguishable as name candidates (including hyphenated/compound names). Besides, the new words include abbreviations, digits, URLs and e-mail addresses, and a small proportion of garbage (2.3%), that is, letter/symbol combinations that do not conform with any recognisable pattern, such as bokmål‰rart.

4. Anglicisms in Norwegian

It is a well known fact that English words thrive in many languages, including Norwegian (Graedler 1998; Görlach 2001). They may either represent new concepts like podcast or they may be new and vogue words for existing concepts, such as cap. Although not a comprehensive list, the following examples give an impression of how English words can be used in Norwegian contexts:

(1)

Det finnes også en egen kategori for podcast. (AP080115)
There is also a separate category for podcast.

(2)

De oppdaget en mann i hvit cap, mørk jakke og mørk bukse (BT090422)
They saw a man in a white cap, dark jacket and dark trousers

(3)

Prøv den i en smoothie. (VG090208)
Try it in a smoothie.

(4)

Avtalen er forutsatt av due diligence. (DN080611)
The agreement depends on due diligence.

(5)

En übercool snowboard-dude med franske foreldre fra Stavanger. (DB0301)
An übercool snowboard dude with French parents from Stavanger.

(6)

Stilsikkert i croonertradisjonen (DB050111)
True to the style of the crooner tradition

(7)

Likevel var hun 67 sekunder bak Simone Luder (24) under det første parkverdenscupløpet. (AP020511)
Nevertheless, she was 67 seconds after Simone Luder in the first park world cup race.

(8)

Raser mot barnereality (DB050613)
Raging against child reality

(9)

Vi leier bare ut rom til Manson og crewet hans. (VG990707)
We only hire rooms for Manson and his crew.

(10)

Man går til boka med skepsis, fordi man er vant til den danske formen for coolhet og København-slang i Turèlls dikt (DB071022)
One approaces the book with scepticism, because one is used to the Danish type of coolness and Copenhagen slang in Turèll’s poetry

(11)

Siden Tolkien satt i kveldinga og nerda med alvespråkene sine, er verden blitt okkupert av… (SA011219)
After Tolkien sat in the evenings and [nerded] with his elfish languages, the world has been occupied by…

Anglicisms in Norwegian come as monomorphemic words like smoothie, or as multiword units. Multiword anglicisms may be lexicalised phrases (lexical collocations; cf. below) imported as one unit, like due diligence, a common business language term, or they may be non-lexicalised clusters of anglicisms, like übercool snowboard-dude in (5). They partake in regular Norwegian-based word formation and may undergo various morphological processes after adoption. In addition to constituting words in their own right, it is common that anglicisms are used as part of mixed compounds (Haugen 1950), where the English component may appear in initial, medial or final position in a word, as shown in (6)–(8). Inflectional processes and morphological integration are illustrated by crewet in (9) with the clitic singular definite article -et attached to an English stem. Anglicism stems may also combine with derivational affixes like -het in coolhet (‘coolness’) as in (10), and they may be affected by syntactic change, for example conversion to a different word class, illustrated by nerda (‘nerded’) in (11), which is a verb containing a Norwegian past tense ending -a. It seems likely that the conversion in this case is a post-borrowing process, although nerd as verb is found in English internet usage (but not in the Oxford English Dictionary).

The Norwegian Newspaper Corpus project devotes particular attention to anglicisms for a variety of reasons. Firstly, recent language policy documents, such as Norsk i hundre! [2] and Mål og meining [3] have expressed a worry that Norwegian is losing ground, that it is in need of protection from English influence and that there is a risk of domain loss. The proposed corpus-based method allows us to assess foreign influence on Norwegian language both quantitatively and qualitatively. Relevant questions that might be posed include: How large is the influx from English? Is it constant or varying over time? To what extent is the influx dependent on domain, and which domains are in the greatest danger of domain loss? What other variables have a bearing on anglicism density in written language, such as type of newspaper, genre, author’s gender, etc? Secondly, the lexicon is in constant change and modern dictionaries are in need of updated word lists. The project effort is clearly relevant for lexicographical work, by providing a daily updated vocabulary and information about the etymology of words, identifying imported words among neologisms. Thirdly, despite valuable contributions like Graedler (1998) and Johansson & Graedler (2002), anglicisms in Norwegian represent an understudied field of linguistics. A brief mention of some of the topics that should be investigated seems justified. Morphosyntactic variation of imported words is one relevant topic. There is a need to study variation in lemma form in imported words like cap, whose inflectional paradigm can be realised in two different ways, either as seen in Alternative 1 below with a singular form cap as its imported stem, to which Norwegian endings are attached, or as seen in Alternative 2 with a plural form as its imported stem:

Alternative 1: cap capen caper capene
Alternative 2: caps capsen capser (caps) capsene
cap the cap caps the caps

Although we know that such variation occurs (Graedler 1998), we know little about the extent and nature of this variation, let alone the semantic or cognitive motivation of importing a plural form as stem. There is a need to investigate the phonological and morphological nature of words with alternative morphological realisations, what characterises the ‘competition’ between the two paradigms, constraints of usage, mutual exclusivity, etc. Morphosyntactic variation is also seen in adjectives ending in -y, like trendy and crazy, which the most comprehensive Norwegian reference grammar (Faarlund, Lie & Vannebo 1997) regards as uninflectable, but which turns out to have definite and plural inflected forms trendye and crazye in the corpus. Moreover, imported verbs are usually inflected according to the morphosyntactic pattern of the paradigmatic verb kaste ‘throw’ (Graedler 1998), but the corpus shows signs of the other major class of Norwegian verbs, the lyse ‘shine’ class, as evidenced by the verb form rulte ‘ruled’. This leads to two competing inflectional paradigms for finite forms of imported verbs:

Alternative 1: rule ruler rulet rulet (kaste)
Alternative 2: rule ruler rulte rult (lyse)
rule rule/rules ruled ruled

The resources developed in the Norwegian Newspaper Corpus project enable us to embark on systematic empirical studies of these and related phenomena, moving beyond the intuitional approach via quantitative explorations of a large and continuously updated corpus.

Occasionally, English loan words get a normalised ‘norwegified’ spelling, which corresponds more closely to Norwegian pronunciation than the English orthography does. This is the case with the common adjective døll ‘dull’, which, in fact, grossly outnumbers the original English spelling dull in the corpus; cf. (12). Sometimes the ‘norwegification’ is only partial in compounds, as shown in jønkfood ‘junk food’ in (13).

(12)

Det var en skikkelig døll plass. (DB010810)
It was a really dull place.

(13)

Det hjelper ikke enslige mødre eller arbeidsledig ungdom med kropper som blir vandaliserte av jønkfood og sinn som sultefores av voksen tafatthet. (SA030328)
It does not help single mothers or unemployed youth with bodies which are vandalised by junk food and minds which are starved by adult indolence.

Norwegified spelling may be the result of top-down or bottom-up processes. The former process applies in the cases where the Norwegian Language Council has proposed an alternative spelling to the English original, as in the case of gaid/guide (Sandøy 1997). The latter process is illustrated by the examples given, where the normalised spelling is initiated by the language users themselves, in this case, newspaper journalists whose articles may go through an in-house editing process. There is a need for studies which investigate the effect and result of both of these processes and which explore the nature of non-normative spelling normalisation (Andersen forthcoming).

Anglicisms need not contain overt source language material but may consist of Norwegian material only, in which case they represent loan shifts (Haugen 1950). This is the case for loan translations (calques), in which only a meaning is imported but the forms used to express this meaning are native. An example would be nedlasting, a direct translation of downloading. Besides, semantic loans, in which the meaning of a word is extended without the import of any lexical material, are represented by karakter ‘character’ (in the new sense of a fictitious person in a play or a film), which seems to be a recent semantic loan from English. Naturally, both types occur widely in the corpus.

(14)

Men når det kommer til nedlasting av filer, er jeg direkte imponert [4]. (SA990831)
But when it comes to downloading of files, I am really impressed.

(15)

Nå var han ikke noen god far, denne karakteren i novellen. (DB0430)
He was not a good father, this character in the short story.

The question is, then, how can we use automatic or semi-automatic methods for retrieving English-based inventory in Norwegian texts? The answer depends on the orthographical characteristics of the words in question, and there are various methods associated with the different types of anglicisms described above. We have developed an in-house language processing tool which tries to identify those anglicisms that are non-adapted (Furiassi 2008), that is, those with an English orthography, i.e. not loanshifts. The module, described in Andersen (2005), uses a hybrid method combining n-gram statistics, dictionary look-up and regular expressions. English and Norwegian have rather different orthographies, and this can be exploited in the machine-based search for anglicisms. By considering grapheme typicality for English or Norwegian, it is a relatively uncomplicated task to extract words like crew, quiz, comeback, chat and shotsene ‘the shots’ as anglicisms, as they all contain character n-grams (chargrams) that do not occur in words of Norwegian origin. The algorithm that we have developed uses grapheme typicality to pick out candidates that are most likely from English, based on n-gram statistics from the BNC. For each neologism, the program considers the orthographic inventory and looks for letter combinations (bigrams) that are typically English and atypically Norwegian. A word such as crew contains the following bigrams (^ and $ signify word beginning and word end):

^c + cr + re + ew + w$

Since, for example, a word-initial cr- and a word final -ew are not found in domestic Norwegian words, crew is picked out as an anglicism candidate. However, due to the typological closeness of English and Norwegian, many anglicisms cannot be identified using this method, since they have an orthography which is not specifically English-looking. This applies to a work such as date, with the following bigrams:

^d + da + at + te + e$

None of these bigrams is atypical for domestic Norwegian words. For this reason, the anglicism identifier also uses dictionary lookup to identify words of English origin. Specifically, it uses a targeted word list that consists of all the word forms found in the BNC that are not found in the Norwegian dictionary Bokmålsordboka. This algorithm identifies date as an anglicism candidate. Finally, words that involve productive morphemes like reality and temptation are picked out as anglicism candidates by means of a set of regular expressions. The notion of “anglicism candidate” is intended to convey that one cannot be sure as to the origin of the words automatically picked out. This could be because the words are of non-English foreign origin, such as capo, because a word represents a Norwegian-English homograph, such as dull, which, in theory, could be the imperative form of a Norwegian verb dulle ‘pamper’ but which coincidentally does not occur as such in the corpus, or because the programming rules over-generate and pick out domestic words that may look like anglicisms (e.g. korthekken ’110 metre hurdles’). Therefore, a manual check of retrieved candidates is needed. Nevertheless, the semi-automatic method reduces the need for manual work in anglicism detection.

The machine-based detection of anglicisms that do not contain overt English orthography is a much more complicated task. Norwegified anglicisms like døll, podkast, and jønkfood (cf. above) and new loan translations like nedlasting are identified as new words according to the routines described above. However, it is more problematic to identify semantic loans, since they involve no formal neology but a change in use of existing forms, like the use of karakter in the sense of ‘character in a film or literary work’, or the new sense of the verb disse, ‘to dis’, from English discredit, in addition to its original sense ‘to swing’. Although not currently extracted by machine methods, these are in principle also machine-retrievable. A possible way to extract them would be in the fashion of the AVIATOR project, also based on a corpus of journalism, where the developers “have been able to discover changes in sense relations in text over time by monitoring the change of collocational profiles” (Renouf 2007: 39) of words.

At the current stage, the anglicism detection module is work in progress, and experiments with alternative methods are ongoing, using supervised machine learning techniques based on TIMBLE, in combination with the Java-based data mining software Weka (Losnegaard & Lyse forthcoming). The current precision of the module is about 75%, based on a gold standard of 10,000 manually identified anglicisms. The module identifies 128,588 anglicism candidates from 1,469,925 unique word forms in the corpus, amounting to 8.7%. Table 2 shows the result of applying the anglicism detection module on the words in the corpus, listing the most common anglicisms in the Norwegian Newspaper Corpus, seen in the right column, and the most common unigrams overall in the left column.

Table 2. Most frequent words and anglicisms in the Norwegian Newspaper Corpus.

i24850781
og19158595
er14432730
til11937335
11930944
som11530249
det10948982
å10475491
av10146256
en10040356
for10029477
at9562804
har8808917
med8462376
ikke6147810
de6031542
om5095203
den4948322
et4437130
fra4217612
var4027751
han3608843
seg3420451
ble3017390
sier2959084
scoret72520
keeper28288
sex20711
manager19606
scoring18042
score17348
verdenscupen15564
scoringer13337
rock12199
ishockey9957
scorer9828
toppscorer9823
comeback9482
cupen9095
jazz8552
headet8296
mobbing7524
scoringen6862
corner6611
keeperen5874
canadiske5289
cupfinalen5188
matchvinner5175
back4900
that4880

The most common anglicism, the past tense form of the verb score, ranks 772nd on the unigram frequency list. We note that vocabulary from sports (scoret, keeper, manager), music (rock, jazz) and general vocabulary (sex, mobbing) are represented among the frequent anglicisms.

A manual classification of the most frequent anglicisms gives the result presented in Figure 2 and Table 3.

Table 3. Most frequent anglicisms by topical category.

Sports

score/-r/-t; scoring/-er; headet, cupen, corner, match, volley

General vocabulary

sexy, hint, servicen, must, audition, tagging, matching

Music

musical, rocka, medley, rockeband, soul, country, blues, jazz, rock

Popular culture

science, fiction, trailer/-e/-en, action, thriller

Travel

campingplassen, campingvogner, sightseeing, charter, cruiseskip/-et, booket, cruise

Food/drink

cola, bacon, whisky, pizza

Technology/ICT

blogg/-er/en; iPhone; mail

Business/economy

business, shipping, offshore

Distribution of anglicisms by domain.

Figure 2. Distribution of anglicisms by domain.

5. Collocation analysis

As mentioned above, anglicisms are often imported as lexicalised phrases which constitute multiword collocations. Examples of such multiword anglicisms are due diligence, easy listening, break even and a set of attitudinally salient discourse markers such as get a life or the ironical yeah right. Given the occurrence of borrowed phrases such as these, the correct analysis of anglicisms requires processing of multiword units. Again, the notion of n-gram becomes crucial, defined as ‘recurrent strings of uninterrupted word forms’ (Stubbs 2007: 90), and statistical measures of association provide a key to the identification of such units in running text. The correct identification and segmentation of multiword collocations is crucial for lexicography and language technology purposes, such as improvement of word class taggers, disambiguation of homographic words etc. This part of the analysis relies on well established methods for collocation analysis, with a view to identifying tight collocations (Renouf & Sinclair 1991), i.e. sequences of words with a strong tendency to collocate. We have tested various means for identifying collocations (statistical associations of word forms in the text), including mutual information, Z-score etc. The rationale for these calculations is that they are important for lexical acquisition, in that these scores “can help the lexicographer decide which collocate should be included in the lexicon” (Ooi 1998: 83).

This part of the project is more fully described in Lyse & Andersen (forthcoming). We first produced n-gram statistics for the entire corpus and calculated rank scores using a variety of association measures. For bigrams, their collocational strengths were measured by nine different calculations. The rank order and frequency of n-grams reveals information about their degree of lexicalisation. The results of the different association measures are seen in Table 4, which lists the top ten bigrams for each method (Evert 2004).

Table 4. Results of different association measures of bigrams.

Odds ratio discr.

Z-score corr.

Chi-squared

Pointwise MI

corned beef

corned beef

graderte dokumenter

vilkårsett skattefritaking

etc. etc.

etc. etc.

uttørkete elveleier

varannan damernas

practical jokes

practical jokes

dår lige

unilaterally destroyed

lorem lipsum

lorem lipsum

ankende part

twam asi

lipsum lorem

lipsum lorem

hellige kuer

suvas bohciidit

journ anm

kaustisk soda

mistankens skjerpede

slow starters

eines fahrenden

journ anm

fulladet batteri

skrimmi nimmi

commedia dellarte

eines fahrenden

lytter oppmerksomt

rollon rolloff

haemophilus influenzae

mørkets frambrudd

innoverskrudd corner

respiratory infection

%. stem.

commedia dellarte

anaerobe terskelen

redu sert

T-score

Local MI

Likelihood ratio

jaccard

dice

til å

til å

til å

bekjente sitter

bredere parti

for å

for å

for å

bør ansette

fremdeles parkert

å få

å få

i i

ene leggbenet

mars avsluttes

i en

i en

å få

forbeholdt utenlandske

overraskende foreslo

å ha

å ha

å i

fremstille meg

unngå fiendtlig

om å

om å

å ha

fulle treninger

bør nyanseres

i den

å være

i en

ganske prekær

fra lavkostland

med å

å bli

å være

hindre stridsvognene

blir akkompagnert

i det

i den

å bli

hjemlige filmmiljø

all skiten

å være

med å

i å

kommunalministeren fremholder

direkte appeller

We have also preliminarily evaluated their usefulness in the identification of lexicalised phrases, by analysing the top of the ranked lists and the degree to which they contain multiword collocations, such as technical terms. The most important and striking observation from the comparison of alternative association measures is the considerable differences between the various rankings, and consequently the differences in suitability of the methods to the task of identifying lexicalised phrases. A more thorough investigation is planned (Lyse & Andersen forthcoming), but some general observations can be made from our preliminary investigation. The association measures t-score, local-MI and log likelihood ratio give a high rank order to highly frequent formulaic sequences such as det er, til å, for å, i en, å komme and millioner kroner that are not lexicalised units. These are seen as of little importance to lexicography, although they may well be interesting from a phraseological or other point of view. It should also be pointed out that a few frequent lexicalised phrases like i tillegg (’in addition’), i går (’yesterday’) and i fjor (’last year’) receive a high rank score with this measure. Other measures are much more suited for isolating relevant multiwords by giving a high rank to lexicalised phrases, specifically chi square, z-score-corrected, odds-ratio discriminative and pointwise MI, which pick out many technical terms like anaerobe terskelen, eneggede tvillingene, honorære konsuler and amyotrofisk lateralsklerose, and multiword anglicisms like lucky loosers, corned beef, practical jokes, slow starters, jumpers knee, consumer confidence, honky tonk, splendid isolation, due diligence, extreme makeover and danish dynamite. Of these, odds-ratio-discriminative and z-score corrected favour bigrams with a low frequency and where the word forms of the bigrams are used exclusively as collocates and not in other contexts, such as practical jokes, which occurs 78 times in the corpus. Finally, two association measures, dice and jaccard, appear not to be able to pick out collocations relevant for lexicography since there are no lexicalised multiwords among these tokens, nor do they seem particularly apt for phraseology purposes.

A manual check of the 500 most highly ranked bigrams according to one of the most promising association measures, the odds ratio calculation, showed that approximately 18.6% of them were anglicism candidates, as opposed to 5.4% of all neologisms in the corpus; cf. Table 1 above. Another important observation is that many highly ranked non-anglicism bigrams are multiword borrowings from other languages – many of Latin origin – such as commedia dell’arte, abortus provocatus, solar plexus, annus horribilis, notarius publicus, lingua franca, tabula gratulatoria, tabula rasa and mea culpa. Among the highly ranked bigrams we also find idiomatic Norwegian phrases like navns nevnelse, flammenes rov, rangen stridig, tenners gnissel and bange anelser. Examples like cage aux and erat demonstrandum display an interesting methodological point, namely that the investigation of the longer n-grams should precede the shorter ones, in order to pick out phrases like cage aux folles and quod erat demonstrandum as trigrams.

That our proposed n-gram-based method is capable of identifying relevant multiword units that are English loan words, is seen from the following inventory of forms from the top-500 list, using the odds ratio statistics:

jumpers knee; consumer confidence; corned beef; practical jokes; honky tonk; splendid isolation; due diligence; extreme makeover; danish dynamite; final countdown; devil wears; adult contemporary; lame duck; whiter shade; coincident indicators; wishful thinking; shetland sheepdog; european currency; whistle blowers; passenger cabins; metabolic activators; guilty pleasures; flip flops; ferocious freshwater; cheap seats; irish coffee; tribal peoples; scroll lock; predatory pricing; paralytic shellfish; corpore sano; cole slaw; mountain bike; certain regard; unilaterally destroyed; rollon rolloff; respiratory infection; border collie; parental advisory; main objectives; compounds resembling; combined ratio; jobless claims; banana split; slow motion; unit linked; attention deficit; black metal; tabbed browsing; shabby chic; french fries; boogie woogie; incidents involving; brown sugar; corporate governance; technical visits; crystal meth; corn flakes; pancreas disease; negro spirituals; yellow submarine; smooth operator; electric boogie; wide receiver; thousand island; graphic novel; gentlemans agreement; clotted cream; sudden death; hey hey; whole lotta; ooh aah; documentary evidence; brain drain; impartial investigation; worst case; alaska pollack; instant messaging; stay behind; sore shins; lonely hearts; ethical treatment; conspicious consumption; cash flow; driving range; plastic fantastic; plea bargaining; movie awards; manchester united; spin doctors; big business; royal flush; early works; tutti frutti; lucky loosers; electoral votes; pole position; stormy weather

6. Frequency profiling

Given the vast amount of new words that occur in the corpus, it is particularly important to study frequency, as this information holds the key as to whether or not a particular form or meaning should be included in the dictionary, and in which orthographic form (Pulcini 2008). Overall frequency and frequency development over time has a direct bearing on the relevance for lexicographic inclusion of individual word forms and neologistic phrases. Generally, new words with a high frequency are more relevant than lower-frequency words, and words with a steady or gradually increasing frequency in the period represented in the data are more relevant than those with a fluctuating or decreasing frequency. To illustrate, the lexicographer may be faced with the choice of whether to include the neologisms weblogg as well as its synonym blogg in a Norwegian dictionary. Corpus frequency provides an important clue and avoids reliance on intuition when making such considerations, and sometimes, the corpus frequencies speak for themselves (Figure 3):

Frequency profiles of weblogg.* and blogg.* in the Norwegian Newspaper Corpus

Figure 3. Frequency profiles of weblogg.* and blogg.* in the Norwegian Newspaper Corpus (the notation .* indicates truncation).

However, other factors also come into play, and non-frequent items may also be worth including. In the selection process one cannot rely exclusively on frequency; subjective considerations made by the lexicographer is sometimes necessary (Pulcini 2008).

Lexicographers and other users of the Norwegian Newspaper Corpus can retrieve usage statistics of individual words as images created on-the-fly. For more systematic considerations of frequency developments, we have developed frequency filters which systematically extract frequency profiles for new words and select the words that look the most interesting from the point of view of frequency development over time. The filters are based on linear regression statistics calibrated with the least square method (Fjeld & Nygaard forthcoming). The first picks out non-frequent words (n < 10), which are discarded and not included among the neologism candidates which are manually edited by lexicographers. The second filter picks out words with a high and stable frequency. These are manually checked by a lexicographer, and the most relevant words are included in the Norwegian Word Bank. The third filter identifies words with a high and increasing frequency, of which most are included in the Norwegian Word Bank.

Examples of some of the words that have been picked out by the various filters are given in Table 5.

Table 5. Result of frequency-based filtering of new words.

Not included in Norwegian Word Bank

alpindamene, bøllefri, dettestår, vikig, viestad

Included but not new

akilles, bakfull, lukeparkere

Included and new

dyssosial, fistet, flatskjermen, foodprosessor, hacker, hajj, lagbygging, omrokeringer, zappe

In the category of words not relevant for inclusion we find spelling errors (dettestår, viestad, vikig) and non-lexicalised compounds (alpindamene, bøllefri). Among forms which proceed further to manual lexicographical edition, we find both real neologisms like hajj and foodprosessors, but also, interestingly, lexicographical lacunae, that is, words which are not new, but nevertheless missing in existing dictionaries, like akilles, bakfull (‘hung over’) and the verb lukeparkere (‘parallel park’). This shows that the inductive method for neologism extraction has an advantage over manual methods, in that it may lead to a better coverage of existing words in the language.

7. Concluding remarks

As seen by earlier projects like COBUILD and ACRONYM, I have attempted to show in this paper that a corpus-based approach to lexicography is a useful one and more specifically that there are several advantages of using a large monitor corpus as a basis for studying neologisms and anglicisms. The Norwegian Newspaper Corpus is a self-expanding dynamic corpus of considerable size and coverage, and it appears to be a good resource for such purposes. First, it captures new micro-level developments of new word forms and new uses of old words. Second, it allows for large-scale quantitative studies. Third, it provides a continuously updated inventory. Fourth, it involves a continuous monitoring of language development by statistical comparison between different time sections of the corpus, thus enabling the study of ‘short-term change in diachrony’ (Kytö, Rudanko & Smitterberg 2000: 92). Thus, our Norwegian-based experience so far corroborates that of earlier projects.

Thanks to recent advances in corpus building and technology, word formation and neology can be studied in empirical quantitative detail. The monitor corpus allows us to become less dependent on our intuitions and rely on statistical facts. The corpus-based approach is a valuable supplement to traditional lexicography/terminography, which involves manual extraction of words. It does not offer the full answer as to which forms to include and which forms to leave out, but it promises a systematic and empirically based proposal of where to start looking. This will hopefully lead to a significant reduction of manual work and a radical simplification of the task of looking for the needle in the linguistic hay-stack.

Notes

[1] The Norwegian Newspaper Corpus project is funded by the Research Council of Norway; cf. http://avis.uib.no/.

[2] See http://www.sprakradet.no/upload/9832/norsk_i_hundre.pdf.

[3] See https://www.regjeringen.no/no/dokumenter/stmeld-nr-35-2007-2008-/id519923/.

[4] In fact, this example contains another salient loan translation of a structural kind: når det kommer til is a vogue phrase used as a discourse marker and a direct translation of ’when it comes to’.

Electronic references

Bokmålsordboka. 27 March 2011. http://www.dokpro.uio.no/ordboksoek.html.

The British National Corpus (BNC). 27 March 2011. http://www.natcorp.ox.ac.uk/.

The Corpus of Contemporary American English (COCA). 27 March 2011. http://corpus.byu.edu/coca/. [link updated 25 Nov 2016]

The Norwegian Newspaper Corpus. 27 March 2011. http://avis.uib.no/.

Tilburg Memory-Based Learner (TIMBL). 27 March 2011. http://ilk.uvt.nl/timbl/ [link no longer available, see https://languagemachines.github.io/timbl/ 25 Nov 2016]

WebCorp. 27 March 2011. http://www.webcorp.org.uk/.

References

Andersen, Gisle. 2005. “Assessing algorithms for automatic extraction of anglicisms in Norwegian texts”. Proceedings of Corpus Linguistics 2005. Birmingham: University of Birmingham.

Andersen, Gisle. 2010. “Halvautomatisk ekserpering av anglisismer i norsk” [Semi-automatic extraction of anglicisms in Norwegian]. Nordiska studier i lexikografi 10: 72-85.

Andersen, Gisle. Forthcoming. “A corpus-based study of adaptations of English import words in Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.

Andersen, Gisle & Knut Hofland. Forthcoming. “Building a large monitor corpus based on newspapers on the web”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.

Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta. 2009. “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”. Language Resources and Evaluation 43: 209-226.

Davies, Mark. 2009. “The 385+ million word Corpus of Contemporary American English (1990-2008+): Design, architecture, and linguistic insights”. International journal of corpus linguistics 14: 159-190.

Evert, Stefan. 2004. “The Statistics of Word Cooccurrences: Word Pairs and Collocations”. IMS. University of Stuttgart.

Faarlund, Jan Terje, Svein Lie & Kjell Ivar Vannebo. 1997. Norsk referansegrammatikk. Oslo: Universitetsforl.

Fjeld, Ruth Vatvedt & Lars Nygaard. Forthcoming. “Lexical neography in modern Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.

Furiassi, Cristiano. 2008. “What dictionaries leave out: new non-adapted Anglicisms in Italian”. Investigating English with Corpora, ed. by Aurelia Martelli & Virginia Pulcini, 153-169. Monza: Polimetrica.

Graedler, Anne-Line. 1998. Morphological, semantic and functional aspects of English lexical borrowings in Norwegian. Oslo: Faculty of Arts Scandinavian University Press.

Görlach, Manfred. 2001. A Dictionary of European anglicisms: a usage dictionary of anglicisms in sixteen European languages. Oxford: Oxford University Press.

Haugen, Einar. 1950. “The analysis of linguistc borrowing”. Language 26: 210-231.

Hofland, Knut. 2000. “A self-expanding corpus based on newspapers on the Web”. The Second International Language Resources and Evaluation Conference (LREC) Paris: European Language Resources Association (ELRA).

Hundt, Marianne, Carolin Biewer & Nadja Nesselhauf. 2007. Corpus linguistics and the web. Amsterdam: Rodopi.

Johansson, Stig & Anne-Line Graedler. 2002. Rocka, hipt og snacksy : om engelsk i norsk språk og samfunn / elektronisk ressurs. Kristiansand: Høyskoleforl.

Kilgarriff, Adam & Gregory Grefenstette. 2003. “Introduction to the Special Issue on Web as Corpus”. Computational Linguistics 29: 1-15.

Kilgarriff, Adam & David Tugwell. 2002. “Sketching words”. Lexicography and Natural Language Processing – A Festschrift in Honour of B.T.S. Atkins, ed. by Marie-Hélène Corréard, 125-137. Gothenburg: EURALEX.

Kytö, Merja, Juhani Rudanko & Erik Smitterberg. 2000. “Building a bridge between the present and the past: A corpus of 19th-century English”. ICAME Journal 24: 85-97.

Losnegaard, Gyri & Gunn Inger Lyse. Forthcoming. “A data-driven approach to anglicism identification in Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.

Lyse, Gunn Inger & Gisle Andersen. Forthcoming. “Collocations and statistical analysis of n-grams – multiword expressions in newspaper text”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.

Ooi, Vincent B. Y. 1998. Computer corpus lexicography. Edinburgh: Edinburgh University Press.

Pulcini, Virginia. 2008. “Corpora and lexicography: the case of a dictionary of Anglicisms”. Investigating English with corpora: studies in honour of Maria Teresa Prat, ed. by Aurelia Martelli & Virginia Pulcini, 189-203. Monza: Polimetrica.

Renouf, Antoinette. 1996. “The ACRONYM Project: Discovering the textual thesaurus”. Synchronic corpus linguistics,ed. byCarol E. Percy, Charles F. Meyer & Ian Lancashire, 171-187. Amsterdam & Atlanta: Rodopi.

Renouf, Antoinette. 2007. “Corpus development 25 years on: from super-corpus to cyber-corpus”. Corpus linguistics 25 years on, ed. by Roberta Facchinetti, 27-49. Amsterdam, New York: Rodopi.

Renouf, Antoinette, Andrew Kehoe & Jay Banerjee. 2005. “The WebCorp Search Engine: A holistic approach to web text search”. Corpus Linguistics 2005.Birmingham: University of Birmingham.

Renouf, Antoinette & John McH. Sinclair. 1991. “Collocational frameworks in English”. English Corpus Linguistics - Studies in Honour of Jan Svartvik, ed. by Karin Aijmer & Bengt Altenberg, 128-143. London, New York: Longman.

Sandøy, Helge. 1997. Lånte fjører eller bunad? Om norsk skrivemåte av importord. Oslo: Kulturdepartementet/Norsk språkråd.

Sinclair, John McH., ed. 1987. Looking up. London & Glasgow: Collins ELT.

Stubbs, Michael. 2007. “An example of frequent English phraseology: distributions, structures and functions”. Corpus Linguistics 25 Years on, ed. by Roberta Facchinetti, 89-105. Amsterdam, New York: Rodopi.

Wangensteen, Boye. 2002. “Nettbasert nyordsinnsamling”. Språknytt, 17-19.