This paper reports on the status of ongoing corpus building and lexicographical work within the framework of the Norwegian Newspaper Corpus project. Specifically it describes the work flow, tools and methods used in the identification and analysis of new anglicisms in Norwegian. Surveying the lexical borrowing from English serves a variety of purposes, including lexical acquisition, the extraction of terminology, language technology, and more general linguistic purposes such as surveying the amount and inventory of English loan words in various usage domains. Observations from such a survey may form an empirical basis for language policy decisions, such as considering efforts for preventing domain loss. While previous work in Norwegian lexicography has generally relied on manual methods for excerpting new words – and for identifying anglicisms among the new words, the current project is an effort to develop tools which automatise the process of identifying, segmenting and analysing new loan words from English. A system for corpus-based language monitoring has been set up at Uni Digital (formerly Unifob AKSIS), in close cooperation with lexicographers at the University of Oslo. The paper describes briefly the overall workflow and focuses especially on alternative methods for identifying anglicisms (lexicon-based, n-gram-based, combinatory methods). The paper also presents some main trends and statistics regarding the use of anglicisms, as well as future plans for exploitation of this material.
1. Introduction
Among the significant events in recent corpus development, we find the emergence of dynamic corpora and web-based corpora, as well as the exploitation of corpora in lexicographic projects (Renouf 2007). Initiated by Atkins and Sinclair in the 1970s, the Collins COBUILD project at the University of Birmingham was the first lexicography initiative which systematically used a corpus as its main source of knowledge about words and their use in the language (Sinclair 1987). This effort has been described as a revolution which “changed the principles and methods of dictionary making” (Pulcini 2008: 189) and which enabled lexicographers to “view the evidence of how a word was used without the arbitrary filter of who thought what was an interesting example of a word” (Kilgarriff & Tugwell 2002: 125). Corpus lexicography provides a methodology “for measuring hard evidence of the lexical behaviour of words since this will (arguably) result in a more representative, coherent and consistent output than a lexicon produced from conventional means” (Ooi 1998: 37). With respect to corpus building, “the first ‘dynamic’ corpus of unbroken chronological text” (Renouf 2007: 36) was established in 1990 as part of the AVIATOR project in Birmingham, using the Times newspaper as its source. This was followed by other large, monitor corpora such as the ACRONYM project corpus (Renouf 1996) and more recently the Corpus of Contemporary American English (Davies 2009). Since the turn of the millennium, it has become increasingly common to develop and explore web-based corpora, aka. ‘cyber-corpora’ (Renouf 2007), resulting in a growing body of corpus-based studies using the web as its prime source of data (Kilgarriff & Grefenstette 2003; Hundt, Biewer & Nesselhauf 2007). Corpus-building efforts such as WebCorp (Renouf, Kehoe & Banerjee 2005) and the Wacky initiative (Baroni, Bernardini, Ferraresi & Zanchetta 2009) use web crawler technology for making the web serve as a linguistic corpus.
Inspired by such technological innovations, The Norwegian Newspaper Corpus project [1] is an ongoing effort to create a large monitor corpus representing present-day Norwegian in both its written standard varieties, Bokmål and Nynorsk, and developing associated language processing tools particularly tailored for lexicographical work (Hofland 2000, Wangensteen 2002, Andersen 2005, 2010). The web-based corpus is compiled by daily harvesting and processing of published texts from the web edition of several Norwegian newspapers. It is a ‘modern diachronic corpus’, in the sense described by Renouf (2007), enabling the study of language change, neologistic usage and lexical productivity and creativity as it unfolds in written language. The open-ended monitor corpus can be used to study language as a changing phenomenon at the diachronic micro-level, by comparing time frames calibrated at a daily, weekly, monthly or yearly basis within the time-span the corpus represents. This makes it particularly well suited for studying ongoing lexical and stylistic innovation and variation, in addition to other types of innovation, such as grammatical change. Although the main rationale of the Norwegian Newspaper Corpus has been linked to the need for updated text for lexicography purposes, a much wider usage can be envisaged (Andersen & Hofland forthcoming).
The current paper gives an outline of the methods and language processing tools that have been developed for corpus-based lexicographical work within the framework of the Norwegian Newspaper Corpus project. It focuses in particular on the work flow, tools and methods used in the identification and analysis of English loan words that occur in the Norwegian language. The methods described below include tools for neologism extraction, anglicism detection, collocation analysis and frequency profiling.
2. The system architecture of the Norwegian Newspaper Corpus
The text collection for the Norwegian Newspaper Corpus began in 1998. As of September 2010, the corpus consists of about 850 million words, which makes it the largest searchable corpus of Norwegian. The daily growth is on average approximately 230,000 words. The corpus consists of the full web version of about 25 Norwegian newspapers. The selection of sources has been limited to newspapers that also have a printed counterpart. This includes large, national newspapers like Aftenposten, Dagbladet and VG, major regional newspapers like Bergens Tidende and Stavanger Aftenblad and local newspapers like Sogn Avis and Gudbrandsdølen/Dagningen. Most of the newspapers are of general interest, but a few niche publications are also included, specifically the newspaper Nationen, with a special focus on agricultural issues, Vårt Land, which is explicitly religious and Morgenbladet, a weekly newspaper with a strong academic profile. The full political spectrum is also represented, ranging from the business newspaper Dagens Næringsliv to Klassekampen representing the political left. The selection of newspapers has resulted in a large corpus with a wide topical coverage containing relatively homogeneous data, despite the fact that it is harvested from the web. Although maximal efforts have been made to ensure a balance between the two language varieties, the Bokmål variety is massively larger than Nynorsk. This discrepancy reflects the degree to which Bokmål and Nynorsk are used in newspapers on the web, but it is not necessarily representative of the use of the two varieties in other contexts.
The system involves several stages of processing, most of which run automatically due to a self-executing batch file. Its architecture is visualised as a data flow diagram in Figure 1 and involves the following main steps:
harvesting: two different web-crawler programmes, w3mir and wget, download the full internet version of Norwegian newspapers
boilerplate removal: a set of specifically designed programs automatically select the core text, including headlines, introductory and main text, and image texts, but discarding advertisements, navigation menus, metatext, html code, etc.
language classification: the texts are classified as either Bokmål or Nynorsk, while English texts are discarded
text annotation: metadata concerning date, author and source are extracted from the source texts, and the texts are machine classified according to topic and morphosyntactically tagged by the Oslo-Bergen tagger
new word form extraction: the inventory of word forms of newly harvested texts is compared with an accumulated list of word forms, and a list of forms not previously recorded is extracted and added to the accumulated word list
new word form classification: new word forms are classified according to orthographical criteria, and anglicism candidates are identified
frequency profiling: statistical filters are used to identify neologisms that are most relevant for lexicography
lexical database entry: selected neologisms are registered in the Norwegian Word Bank with relevant morphosyntactic and semantic information
Steps 1–6 are performed automatically on a daily basis, while steps 7–8 require manual intervention and are performed in less regular batches. The development of tools for boilerplate removal has been a particularly time-consuming and complicated task (Andersen & Hofland forthcoming), and this procedural step also includes algorithms for removal of duplicate texts. The current paper is mostly concerned with steps 5–7, to be described in more detail below.
3. Extraction of new words
The tangible output of the processing described above is a substantive daily growth in terms of types and tokens in the corpus. Of the c. 230,000 running words that are daily added to the text database, on average 1,300 are new word forms. It should be pointed out that, in line with the corpus linguistic tradition, a ‘word’ is a technical concept that is defined as any sequence of graphemes found between two spaces in authentic running text. A ‘new word’ is any word that is not included in a large, accumulated reference word list (cf. step 5 above), against which all harvested text is checked. As of January 2010, this reference list consists of about 3.9 million word forms, including a full-form lexicon derived from the comprehensive dictionary Bokmålsordboka. Table 1 shows the variety of items that are considered new words by this definition, including some categories that are highly relevant for lexicography and others which are not.
In the first category of Table 1, we find words that lack a special orthographic feature (capital letter, hyphen or the like) but consist of lower-case letters only, accounting for about half of the new words. Lexicographers are interested in real neologisms – new, linguistically motivated and authentic lexical items – and these are typically found in this category, represented here by forms like tidsklemma, a new compound meaning ‘the time squeeze’ and pingle, a new lexical item meaning ‘weak, cowardly person’. But the first category also includes lower-case spelling errors not previously recognised, such as ektremistisk (ekstremistisk, ‘extremist’), irrelevant from a lexicographer’s point of view, but nevertheless relevant to the developer of spell checking systems or to the psycholinguist focusing on error studies, etc. Some new words, about 5%, are classified as anglicism candidates, including whistleblower, blogg and subprime. The procedure for classification is described below. Some new words have special orthographic features which make them less relevant for inclusion in dictionaries. About 10% of the new words are productive, hyphenated compounds. Newspaper language refers widely to names of people, places, companies and products, and, unsurprisingly, a substantial proportion of new words, about 30%, are orthographically distinguishable as name candidates (including hyphenated/compound names). Besides, the new words include abbreviations, digits, URLs and e-mail addresses, and a small proportion of garbage (2.3%), that is, letter/symbol combinations that do not conform with any recognisable pattern, such as bokmål‰rart.
4. Anglicisms in Norwegian
It is a well known fact that English words thrive in many languages, including Norwegian (Graedler 1998; Görlach 2001). They may either represent new concepts like podcast or they may be new and vogue words for existing concepts, such as cap. Although not a comprehensive list, the following examples give an impression of how English words can be used in Norwegian contexts:
(1)
Det finnes også en egen kategori for podcast. (AP080115)
There is also a separate category for podcast.
(2)
De oppdaget en mann i hvit cap, mørk jakke og mørk bukse (BT090422)
They saw a man in a white cap, dark jacket and dark trousers
(3)
Prøv den i en smoothie. (VG090208)
Try it in a smoothie.
(4)
Avtalen er forutsatt av due diligence. (DN080611)
The agreement depends on due diligence.
(5)
En übercool snowboard-dude med franske foreldre fra Stavanger. (DB0301)
An übercool snowboard dude with French parents from Stavanger.
(6)
Stilsikkert i croonertradisjonen (DB050111)
True to the style of the crooner tradition
(7)
Likevel var hun 67 sekunder bak Simone Luder (24) under det første parkverdenscupløpet. (AP020511)
Nevertheless, she was 67 seconds after Simone Luder in the first park world cup race.
(8)
Raser mot barnereality (DB050613)
Raging against child reality
(9)
Vi leier bare ut rom til Manson og crewet hans. (VG990707)
We only hire rooms for Manson and his crew.
(10)
Man går til boka med skepsis, fordi man er vant til den danske formen for coolhet og København-slang i Turèlls dikt (DB071022)
One approaces the book with scepticism, because one is used to the Danish type of coolness and Copenhagen slang in Turèll’s poetry
(11)
Siden Tolkien satt i kveldinga og nerda med alvespråkene sine, er verden blitt okkupert av… (SA011219)
After Tolkien sat in the evenings and [nerded] with his elfish languages, the world has been occupied by…
Anglicisms in Norwegian come as monomorphemic words like smoothie, or as multiword units. Multiword anglicisms may be lexicalised phrases (lexical collocations; cf. below) imported as one unit, like due diligence, a common business language term, or they may be non-lexicalised clusters of anglicisms, like übercool snowboard-dude in (5). They partake in regular Norwegian-based word formation and may undergo various morphological processes after adoption. In addition to constituting words in their own right, it is common that anglicisms are used as part of mixed compounds (Haugen 1950), where the English component may appear in initial, medial or final position in a word, as shown in (6)–(8). Inflectional processes and morphological integration are illustrated by crewet in (9) with the clitic singular definite article -et attached to an English stem. Anglicism stems may also combine with derivational affixes like -het in coolhet (‘coolness’) as in (10), and they may be affected by syntactic change, for example conversion to a different word class, illustrated by nerda (‘nerded’) in (11), which is a verb containing a Norwegian past tense ending -a. It seems likely that the conversion in this case is a post-borrowing process, although nerd as verb is found in English internet usage (but not in the Oxford English Dictionary).
The Norwegian Newspaper Corpus project devotes particular attention to anglicisms for a variety of reasons. Firstly, recent language policy documents, such as Norsk i hundre![2] and Mål og meining[3] have expressed a worry that Norwegian is losing ground, that it is in need of protection from English influence and that there is a risk of domain loss. The proposed corpus-based method allows us to assess foreign influence on Norwegian language both quantitatively and qualitatively. Relevant questions that might be posed include: How large is the influx from English? Is it constant or varying over time? To what extent is the influx dependent on domain, and which domains are in the greatest danger of domain loss? What other variables have a bearing on anglicism density in written language, such as type of newspaper, genre, author’s gender, etc? Secondly, the lexicon is in constant change and modern dictionaries are in need of updated word lists. The project effort is clearly relevant for lexicographical work, by providing a daily updated vocabulary and information about the etymology of words, identifying imported words among neologisms. Thirdly, despite valuable contributions like Graedler (1998) and Johansson & Graedler (2002), anglicisms in Norwegian represent an understudied field of linguistics. A brief mention of some of the topics that should be investigated seems justified. Morphosyntactic variation of imported words is one relevant topic. There is a need to study variation in lemma form in imported words like cap, whose inflectional paradigm can be realised in two different ways, either as seen in Alternative 1 below with a singular form cap as its imported stem, to which Norwegian endings are attached, or as seen in Alternative 2 with a plural form as its imported stem:
Alternative 1:
cap
capen
caper
capene
Alternative 2:
caps
capsen
capser (caps)
capsene
cap
the cap
caps
the caps
Although we know that such variation occurs (Graedler 1998), we know little about the extent and nature of this variation, let alone the semantic or cognitive motivation of importing a plural form as stem. There is a need to investigate the phonological and morphological nature of words with alternative morphological realisations, what characterises the ‘competition’ between the two paradigms, constraints of usage, mutual exclusivity, etc. Morphosyntactic variation is also seen in adjectives ending in -y, like trendy and crazy, which the most comprehensive Norwegian reference grammar (Faarlund, Lie & Vannebo 1997) regards as uninflectable, but which turns out to have definite and plural inflected forms trendye and crazye in the corpus. Moreover, imported verbs are usually inflected according to the morphosyntactic pattern of the paradigmatic verb kaste ‘throw’ (Graedler 1998), but the corpus shows signs of the other major class of Norwegian verbs, the lyse ‘shine’ class, as evidenced by the verb form rulte ‘ruled’. This leads to two competing inflectional paradigms for finite forms of imported verbs:
Alternative 1:
rule
ruler
rulet
rulet
(kaste)
Alternative 2:
rule
ruler
rulte
rult
(lyse)
rule
rule/rules
ruled
ruled
The resources developed in the Norwegian Newspaper Corpus project enable us to embark on systematic empirical studies of these and related phenomena, moving beyond the intuitional approach via quantitative explorations of a large and continuously updated corpus.
Occasionally, English loan words get a normalised ‘norwegified’ spelling, which corresponds more closely to Norwegian pronunciation than the English orthography does. This is the case with the common adjective døll ‘dull’, which, in fact, grossly outnumbers the original English spelling dull in the corpus; cf. (12). Sometimes the ‘norwegification’ is only partial in compounds, as shown in jønkfood ‘junk food’ in (13).
(12)
Det var en skikkelig døll plass. (DB010810)
It was a really dull place.
(13)
Det hjelper ikke enslige mødre eller arbeidsledig ungdom med kropper som blir vandaliserte av jønkfood og sinn som sultefores av voksen tafatthet. (SA030328)
It does not help single mothers or unemployed youth with bodies which are vandalised by junk food and minds which are starved by adult indolence.
Norwegified spelling may be the result of top-down or bottom-up processes. The former process applies in the cases where the Norwegian Language Council has proposed an alternative spelling to the English original, as in the case of gaid/guide (Sandøy 1997). The latter process is illustrated by the examples given, where the normalised spelling is initiated by the language users themselves, in this case, newspaper journalists whose articles may go through an in-house editing process. There is a need for studies which investigate the effect and result of both of these processes and which explore the nature of non-normative spelling normalisation (Andersen forthcoming).
Anglicisms need not contain overt source language material but may consist of Norwegian material only, in which case they represent loan shifts (Haugen 1950). This is the case for loan translations (calques), in which only a meaning is imported but the forms used to express this meaning are native. An example would be nedlasting, a direct translation of downloading. Besides, semantic loans, in which the meaning of a word is extended without the import of any lexical material, are represented by karakter ‘character’ (in the new sense of a fictitious person in a play or a film), which seems to be a recent semantic loan from English. Naturally, both types occur widely in the corpus.
(14)
Men når det kommer til nedlasting av filer, er jeg direkte imponert [4]. (SA990831)
But when it comes to downloading of files, I am really impressed.
(15)
Nå var han ikke noen god far, denne karakteren i novellen. (DB0430)
He was not a good father, this character in the short story.
The question is, then, how can we use automatic or semi-automatic methods for retrieving English-based inventory in Norwegian texts? The answer depends on the orthographical characteristics of the words in question, and there are various methods associated with the different types of anglicisms described above. We have developed an in-house language processing tool which tries to identify those anglicisms that are non-adapted (Furiassi 2008), that is, those with an English orthography, i.e. not loanshifts. The module, described in Andersen (2005), uses a hybrid method combining n-gram statistics, dictionary look-up and regular expressions. English and Norwegian have rather different orthographies, and this can be exploited in the machine-based search for anglicisms. By considering grapheme typicality for English or Norwegian, it is a relatively uncomplicated task to extract words like crew, quiz, comeback, chat and shotsene ‘the shots’ as anglicisms, as they all contain character n-grams (chargrams) that do not occur in words of Norwegian origin. The algorithm that we have developed uses grapheme typicality to pick out candidates that are most likely from English, based on n-gram statistics from the BNC. For each neologism, the program considers the orthographic inventory and looks for letter combinations (bigrams) that are typically English and atypically Norwegian. A word such as crew contains the following bigrams (^ and $ signify word beginning and word end):
^c + cr + re + ew + w$
Since, for example, a word-initial cr- and a word final -ew are not found in domestic Norwegian words, crew is picked out as an anglicism candidate. However, due to the typological closeness of English and Norwegian, many anglicisms cannot be identified using this method, since they have an orthography which is not specifically English-looking. This applies to a work such as date, with the following bigrams:
^d + da + at + te + e$
None of these bigrams is atypical for domestic Norwegian words. For this reason, the anglicism identifier also uses dictionary lookup to identify words of English origin. Specifically, it uses a targeted word list that consists of all the word forms found in the BNC that are not found in the Norwegian dictionary Bokmålsordboka. This algorithm identifies date as an anglicism candidate. Finally, words that involve productive morphemes like reality and temptation are picked out as anglicism candidates by means of a set of regular expressions. The notion of “anglicism candidate” is intended to convey that one cannot be sure as to the origin of the words automatically picked out. This could be because the words are of non-English foreign origin, such as capo, because a word represents a Norwegian-English homograph, such as dull, which, in theory, could be the imperative form of a Norwegian verb dulle ‘pamper’ but which coincidentally does not occur as such in the corpus, or because the programming rules over-generate and pick out domestic words that may look like anglicisms (e.g. korthekken ’110 metre hurdles’). Therefore, a manual check of retrieved candidates is needed. Nevertheless, the semi-automatic method reduces the need for manual work in anglicism detection.
The machine-based detection of anglicisms that do not contain overt English orthography is a much more complicated task. Norwegified anglicisms like døll, podkast, and jønkfood (cf. above) and new loan translations like nedlasting are identified as new words according to the routines described above. However, it is more problematic to identify semantic loans, since they involve no formal neology but a change in use of existing forms, like the use of karakter in the sense of ‘character in a film or literary work’, or the new sense of the verb disse, ‘to dis’, from English discredit, in addition to its original sense ‘to swing’. Although not currently extracted by machine methods, these are in principle also machine-retrievable. A possible way to extract them would be in the fashion of the AVIATOR project, also based on a corpus of journalism, where the developers “have been able to discover changes in sense relations in text over time by monitoring the change of collocational profiles” (Renouf 2007: 39) of words.
At the current stage, the anglicism detection module is work in progress, and experiments with alternative methods are ongoing, using supervised machine learning techniques based on TIMBLE, in combination with the Java-based data mining software Weka (Losnegaard & Lyse forthcoming). The current precision of the module is about 75%, based on a gold standard of 10,000 manually identified anglicisms. The module identifies 128,588 anglicism candidates from 1,469,925 unique word forms in the corpus, amounting to 8.7%. Table 2 shows the result of applying the anglicism detection module on the words in the corpus, listing the most common anglicisms in the Norwegian Newspaper Corpus, seen in the right column, and the most common unigrams overall in the left column.
The most common anglicism, the past tense form of the verb score, ranks 772nd on the unigram frequency list. We note that vocabulary from sports (scoret, keeper, manager), music (rock, jazz) and general vocabulary (sex, mobbing) are represented among the frequent anglicisms.
A manual classification of the most frequent anglicisms gives the result presented in Figure 2 and Table 3.
Table 3. Most frequent anglicisms by topical category.
As mentioned above, anglicisms are often imported as lexicalised phrases which constitute multiword collocations. Examples of such multiword anglicisms are due diligence, easy listening, break even and a set of attitudinally salient discourse markers such as get a life or the ironical yeah right. Given the occurrence of borrowed phrases such as these, the correct analysis of anglicisms requires processing of multiword units. Again, the notion of n-gram becomes crucial, defined as ‘recurrent strings of uninterrupted word forms’ (Stubbs 2007: 90), and statistical measures of association provide a key to the identification of such units in running text. The correct identification and segmentation of multiword collocations is crucial for lexicography and language technology purposes, such as improvement of word class taggers, disambiguation of homographic words etc. This part of the analysis relies on well established methods for collocation analysis, with a view to identifying tight collocations (Renouf & Sinclair 1991), i.e. sequences of words with a strong tendency to collocate. We have tested various means for identifying collocations (statistical associations of word forms in the text), including mutual information, Z-score etc. The rationale for these calculations is that they are important for lexical acquisition, in that these scores “can help the lexicographer decide which collocate should be included in the lexicon” (Ooi 1998: 83).
This part of the project is more fully described in Lyse & Andersen (forthcoming). We first produced n-gram statistics for the entire corpus and calculated rank scores using a variety of association measures. For bigrams, their collocational strengths were measured by nine different calculations. The rank order and frequency of n-grams reveals information about their degree of lexicalisation. The results of the different association measures are seen in Table 4, which lists the top ten bigrams for each method (Evert 2004).
Table 4. Results of different association measures of bigrams.
Odds ratio discr.
Z-score corr.
Chi-squared
Pointwise MI
corned beef
corned beef
graderte dokumenter
vilkårsett skattefritaking
etc. etc.
etc. etc.
uttørkete elveleier
varannan damernas
practical jokes
practical jokes
dår lige
unilaterally destroyed
lorem lipsum
lorem lipsum
ankende part
twam asi
lipsum lorem
lipsum lorem
hellige kuer
suvas bohciidit
journ anm
kaustisk soda
mistankens skjerpede
slow starters
eines fahrenden
journ anm
fulladet batteri
skrimmi nimmi
commedia dellarte
eines fahrenden
lytter oppmerksomt
rollon rolloff
haemophilus influenzae
mørkets frambrudd
innoverskrudd corner
respiratory infection
%. stem.
commedia dellarte
anaerobe terskelen
redu sert
T-score
Local MI
Likelihood ratio
jaccard
dice
til å
til å
til å
bekjente sitter
bredere parti
for å
for å
for å
bør ansette
fremdeles parkert
å få
å få
i i
ene leggbenet
mars avsluttes
i en
i en
å få
forbeholdt utenlandske
overraskende foreslo
å ha
å ha
å i
fremstille meg
unngå fiendtlig
om å
om å
å ha
fulle treninger
bør nyanseres
i den
å være
i en
ganske prekær
fra lavkostland
med å
å bli
å være
hindre stridsvognene
blir akkompagnert
i det
i den
å bli
hjemlige filmmiljø
all skiten
å være
med å
i å
kommunalministeren fremholder
direkte appeller
We have also preliminarily evaluated their usefulness in the identification of lexicalised phrases, by analysing the top of the ranked lists and the degree to which they contain multiword collocations, such as technical terms. The most important and striking observation from the comparison of alternative association measures is the considerable differences between the various rankings, and consequently the differences in suitability of the methods to the task of identifying lexicalised phrases. A more thorough investigation is planned (Lyse & Andersen forthcoming), but some general observations can be made from our preliminary investigation. The association measures t-score, local-MI and log likelihood ratio give a high rank order to highly frequent formulaic sequences such as det er, til å, for å, i en, å komme and millioner kroner that are not lexicalised units. These are seen as of little importance to lexicography, although they may well be interesting from a phraseological or other point of view. It should also be pointed out that a few frequent lexicalised phrases like i tillegg (’in addition’), i går (’yesterday’) and i fjor (’last year’) receive a high rank score with this measure. Other measures are much more suited for isolating relevant multiwords by giving a high rank to lexicalised phrases, specifically chi square, z-score-corrected, odds-ratio discriminative and pointwise MI, which pick out many technical terms like anaerobe terskelen, eneggede tvillingene, honorære konsuler and amyotrofisk lateralsklerose, and multiword anglicisms like lucky loosers, corned beef, practical jokes, slow starters, jumpers knee, consumer confidence, honky tonk, splendid isolation, due diligence, extreme makeover and danish dynamite. Of these, odds-ratio-discriminative and z-score corrected favour bigrams with a low frequency and where the word forms of the bigrams are used exclusively as collocates and not in other contexts, such as practical jokes, which occurs 78 times in the corpus. Finally, two association measures, dice and jaccard, appear not to be able to pick out collocations relevant for lexicography since there are no lexicalised multiwords among these tokens, nor do they seem particularly apt for phraseology purposes.
A manual check of the 500 most highly ranked bigrams according to one of the most promising association measures, the odds ratio calculation, showed that approximately 18.6% of them were anglicism candidates, as opposed to 5.4% of all neologisms in the corpus; cf. Table 1 above. Another important observation is that many highly ranked non-anglicism bigrams are multiword borrowings from other languages – many of Latin origin – such as commedia dell’arte, abortus provocatus, solar plexus, annus horribilis, notarius publicus, lingua franca, tabula gratulatoria, tabula rasa and mea culpa. Among the highly ranked bigrams we also find idiomatic Norwegian phrases like navns nevnelse, flammenes rov, rangen stridig, tenners gnissel and bange anelser. Examples like cage aux and erat demonstrandum display an interesting methodological point, namely that the investigation of the longer n-grams should precede the shorter ones, in order to pick out phrases like cage aux folles and quod erat demonstrandum as trigrams.
That our proposed n-gram-based method is capable of identifying relevant multiword units that are English loan words, is seen from the following inventory of forms from the top-500 list, using the odds ratio statistics:
Given the vast amount of new words that occur in the corpus, it is particularly important to study frequency, as this information holds the key as to whether or not a particular form or meaning should be included in the dictionary, and in which orthographic form (Pulcini 2008). Overall frequency and frequency development over time has a direct bearing on the relevance for lexicographic inclusion of individual word forms and neologistic phrases. Generally, new words with a high frequency are more relevant than lower-frequency words, and words with a steady or gradually increasing frequency in the period represented in the data are more relevant than those with a fluctuating or decreasing frequency. To illustrate, the lexicographer may be faced with the choice of whether to include the neologisms weblogg as well as its synonym blogg in a Norwegian dictionary. Corpus frequency provides an important clue and avoids reliance on intuition when making such considerations, and sometimes, the corpus frequencies speak for themselves (Figure 3):
Figure 3. Frequency profiles of weblogg.* and blogg.* in the Norwegian Newspaper Corpus (the notation .* indicates truncation).
However, other factors also come into play, and non-frequent items may also be worth including. In the selection process one cannot rely exclusively on frequency; subjective considerations made by the lexicographer is sometimes necessary (Pulcini 2008).
Lexicographers and other users of the Norwegian Newspaper Corpus can retrieve usage statistics of individual words as images created on-the-fly. For more systematic considerations of frequency developments, we have developed frequency filters which systematically extract frequency profiles for new words and select the words that look the most interesting from the point of view of frequency development over time. The filters are based on linear regression statistics calibrated with the least square method (Fjeld & Nygaard forthcoming). The first picks out non-frequent words (n < 10), which are discarded and not included among the neologism candidates which are manually edited by lexicographers. The second filter picks out words with a high and stable frequency. These are manually checked by a lexicographer, and the most relevant words are included in the Norwegian Word Bank. The third filter identifies words with a high and increasing frequency, of which most are included in the Norwegian Word Bank.
Examples of some of the words that have been picked out by the various filters are given in Table 5.
Table 5. Result of frequency-based filtering of new words.
In the category of words not relevant for inclusion we find spelling errors (dettestår, viestad, vikig) and non-lexicalised compounds (alpindamene, bøllefri). Among forms which proceed further to manual lexicographical edition, we find both real neologisms like hajj and foodprosessors, but also, interestingly, lexicographical lacunae, that is, words which are not new, but nevertheless missing in existing dictionaries, like akilles, bakfull (‘hung over’) and the verb lukeparkere (‘parallel park’). This shows that the inductive method for neologism extraction has an advantage over manual methods, in that it may lead to a better coverage of existing words in the language.
7. Concluding remarks
As seen by earlier projects like COBUILD and ACRONYM, I have attempted to show in this paper that a corpus-based approach to lexicography is a useful one and more specifically that there are several advantages of using a large monitor corpus as a basis for studying neologisms and anglicisms. The Norwegian Newspaper Corpus is a self-expanding dynamic corpus of considerable size and coverage, and it appears to be a good resource for such purposes. First, it captures new micro-level developments of new word forms and new uses of old words. Second, it allows for large-scale quantitative studies. Third, it provides a continuously updated inventory. Fourth, it involves a continuous monitoring of language development by statistical comparison between different time sections of the corpus, thus enabling the study of ‘short-term change in diachrony’ (Kytö, Rudanko & Smitterberg 2000: 92). Thus, our Norwegian-based experience so far corroborates that of earlier projects.
Thanks to recent advances in corpus building and technology, word formation and neology can be studied in empirical quantitative detail. The monitor corpus allows us to become less dependent on our intuitions and rely on statistical facts. The corpus-based approach is a valuable supplement to traditional lexicography/terminography, which involves manual extraction of words. It does not offer the full answer as to which forms to include and which forms to leave out, but it promises a systematic and empirically based proposal of where to start looking. This will hopefully lead to a significant reduction of manual work and a radical simplification of the task of looking for the needle in the linguistic hay-stack.
[4] In fact, this example contains another salient loan translation of a structural kind: når det kommer til is a vogue phrase used as a discourse marker and a direct translation of ’when it comes to’.
Andersen, Gisle. 2005. “Assessing algorithms for automatic extraction of anglicisms in Norwegian texts”. Proceedings ofCorpus Linguistics 2005. Birmingham: University of Birmingham.
Andersen, Gisle. 2010. “Halvautomatisk ekserpering av anglisismer i norsk” [Semi-automatic extraction of anglicisms in Norwegian]. Nordiska studier i lexikografi 10: 72-85.
Andersen, Gisle. Forthcoming. “A corpus-based study of adaptations of English import words in Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.
Andersen, Gisle & Knut Hofland. Forthcoming. “Building a large monitor corpus based on newspapers on the web”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.
Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta. 2009. “The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora”. Language Resources and Evaluation 43: 209-226.
Davies, Mark. 2009. “The 385+ million word Corpus of Contemporary American English (1990-2008+): Design, architecture, and linguistic insights”. International journal of corpus linguistics 14: 159-190.
Evert, Stefan. 2004. “The Statistics of Word Cooccurrences: Word Pairs and Collocations”. IMS. University of Stuttgart.
Faarlund, Jan Terje, Svein Lie & Kjell Ivar Vannebo. 1997. Norsk referansegrammatikk. Oslo: Universitetsforl.
Fjeld, Ruth Vatvedt & Lars Nygaard. Forthcoming. “Lexical neography in modern Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.
Furiassi, Cristiano. 2008. “What dictionaries leave out: new non-adapted Anglicisms in Italian”. Investigating English with Corpora, ed. by Aurelia Martelli & Virginia Pulcini, 153-169. Monza: Polimetrica.
Graedler, Anne-Line. 1998. Morphological, semantic and functional aspects of English lexical borrowings in Norwegian. Oslo: Faculty of Arts Scandinavian University Press.
Görlach, Manfred. 2001. A Dictionary of European anglicisms: a usage dictionary of anglicisms in sixteen European languages. Oxford: Oxford University Press.
Haugen, Einar. 1950. “The analysis of linguistc borrowing”. Language 26: 210-231.
Hofland, Knut. 2000. “A self-expanding corpus based on newspapers on the Web”. The Second International Language Resources and Evaluation Conference (LREC) Paris: European Language Resources Association (ELRA).
Hundt, Marianne, Carolin Biewer & Nadja Nesselhauf. 2007. Corpus linguistics and the web. Amsterdam: Rodopi.
Johansson, Stig & Anne-Line Graedler. 2002. Rocka, hipt og snacksy : om engelsk i norsk språk og samfunn / elektronisk ressurs. Kristiansand: Høyskoleforl.
Kilgarriff, Adam & Gregory Grefenstette. 2003. “Introduction to the Special Issue on Web as Corpus”. Computational Linguistics 29: 1-15.
Kilgarriff, Adam & David Tugwell. 2002. “Sketching words”. Lexicography and Natural Language Processing – A Festschrift in Honour of B.T.S. Atkins, ed. by Marie-Hélène Corréard, 125-137. Gothenburg: EURALEX.
Kytö, Merja, Juhani Rudanko & Erik Smitterberg. 2000. “Building a bridge between the present and the past: A corpus of 19th-century English”. ICAME Journal 24: 85-97.
Losnegaard, Gyri & Gunn Inger Lyse. Forthcoming. “A data-driven approach to anglicism identification in Norwegian”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.
Lyse, Gunn Inger & Gisle Andersen. Forthcoming. “Collocations and statistical analysis of n-grams – multiword expressions in newspaper text”. Exploring Newspaper Language – Corpus Compilation and Research based on the Norwegian Newspaper Corpus, ed. by Gisle Andersen. To be published by John Benjamins.
Ooi, Vincent B. Y. 1998. Computer corpus lexicography. Edinburgh: Edinburgh University Press.
Pulcini, Virginia. 2008. “Corpora and lexicography: the case of a dictionary of Anglicisms”. Investigating English with corpora: studies in honour of Maria Teresa Prat, ed. by Aurelia Martelli & Virginia Pulcini, 189-203. Monza: Polimetrica.
Renouf, Antoinette. 1996. “The ACRONYM Project: Discovering the textual thesaurus”. Synchronic corpus linguistics,ed. byCarol E. Percy, Charles F. Meyer & Ian Lancashire, 171-187. Amsterdam & Atlanta: Rodopi.
Renouf, Antoinette. 2007. “Corpus development 25 years on: from super-corpus to cyber-corpus”. Corpus linguistics 25 years on, ed. by Roberta Facchinetti, 27-49. Amsterdam, New York: Rodopi.
Renouf, Antoinette, Andrew Kehoe & Jay Banerjee. 2005. “The WebCorp Search Engine: A holistic approach to web text search”. Corpus Linguistics 2005.Birmingham: University of Birmingham.
Renouf, Antoinette & John McH. Sinclair. 1991. “Collocational frameworks in English”. English Corpus Linguistics - Studies in Honour of Jan Svartvik, ed. by Karin Aijmer & Bengt Altenberg, 128-143. London, New York: Longman.
Sandøy, Helge. 1997. Lånte fjører eller bunad? Om norsk skrivemåte av importord. Oslo: Kulturdepartementet/Norsk språkråd.
Sinclair, John McH., ed. 1987. Looking up. London & Glasgow: Collins ELT.
Stubbs, Michael. 2007. “An example of frequent English phraseology: distributions, structures and functions”. Corpus Linguistics 25 Years on, ed. by Roberta Facchinetti, 89-105. Amsterdam, New York: Rodopi.