Discovering new verb-preposition combinations in New Englishes
Gerold Schneider and Lena Zipp
English Department, University of Zurich
Abstract
The grammatical description of New Englishes is a relatively young field but at the same time one that benefitted much from recent developments in corpus linguistics. Standard reference corpora such as the International Corpus of English (ICE) have made it possible to research grammatical phenomena even in smaller outer circle varieties of English. In the field of grammar, innovations typically start out at the intersection of grammar and lexis. We investigate verb-preposition combinations in four corpora of first and second language varieties of English, among them the preliminary version of the written component of ICE Fiji. Our focus is on what has been termed ‘new prepositional verbs’ (cf. Mukherjee 2009, Nesselhauf 2009), i.e. novel combinations of verbs and prepositions.
We compare a manual and a semi-automated approach to the study of new verb-preposition combinations. The manual approach consists of a surface search for prepositions followed by a careful manual filtering process. The semi-automated approach is a corpus-driven investigation using parsed corpora and detecting variation-specific prepositional collocations. Typically, the advantage of manual searches is that precision is very high; the disadvantage is that the investigation is time-consuming and recall can be incomplete, because the scope of investigations may have to be restricted. The advantage of automatic, parse-based methods is that they are fast and corpus-driven, which may increase recall; the disadvantage is that error-rates are high, which seriously affects precision. We discuss similarities and differences in the results of the two approaches and show examples of new verb-preposition combinations from ICE India and ICE Fiji that the two approaches deliver. We conclude that both methods validate, but also complement each other.
1. Introduction
1.1 Corpora for New Englishes
The detailed grammatical description of New Englishes is a comparatively recent trend. Previous descriptive approaches of grammatical phenomena in second language varieties of English (ESL) relied largely on anecdotal evidence (see e.g. Foley 1988, Bautista and Gonzales 2006), or at best on the (mainly manual) analysis of sociolinguistic data (see Schreier 2003) rather than on representative collections of text. The International Corpus of English (ICE) project began to compile standard reference corpora of first and second language varieties of English in the early 1990s, and regional components are now becoming available for many ESL varieties: ICE Philippines was released in 2005, ICE Jamaica came out in 2009, and ICE Fiji (amongst others) is currently being compiled (Biewer et al. 2010). This set of comparable corpora provides the basis for corpus-linguistic studies that complement sociolinguistic investigations; it makes representative data available that ranges across various text-types, including the upper end of the stylistic spectrum. Furthermore, its matching design allows for comparative studies of a quantitative nature.
1.2 Corpus-based and corpus-driven approaches
Overall, recent corpus-based descriptions of ESL varieties (see Sand 2004, Schneider 2004 or Sedlatschek 2009) have been conducted on orthographic, i.e. not annotated, corpora, as work on tagging and parsing the ICE components is still under way. As a result, these descriptions have to rely on more or less sophisticated searches based on lexical items. It may be possible, however, to arrive at a partly corpus-driven description of grammatical phenomena in New Englishes: Mukherjee and Hoffmann (2006), for example, make use of tagged web-derived collections of text, and Xiao (2009) employs a tagger on five ICE corpora to conduct a multidimensional register analysis. In our approach, we use more richly annotated corpus material: The required corpora are annotated syntactically and our aim is to explore in how far this annotation may yield useful information for the description of New Englishes. In turn, the New Englishes databases might be exploited in fine-tuning the annotation tools to the structural challenges that second language varieties of English present; this very approach is described and tested on selected phenomena in Schneider and Hundt (2009). In the present study, we compare the traditional method of carefully analysing orthographic corpora manually with a corpus-driven approach based on automatically parsed corpus data; we focus on the case of lexico-grammatical phenomena in the verb phrase.
1.3 Lexico-grammar
Studies investigating patterns located at the lexis-grammar interface can in principle draw from the two main characteristics of the field in question. Starting a search query from the purely lexical end soon reveals that all content words express semantic differences and are very specific to the topics that happen to be discussed in the selected texts. As described by Zipf’s law, most content words are rare, which leads to a sparse-data-problem even when using large corpora. If corpora of appropriate size are used, the results are dominated by regional differences such as place names and semantic differences. Function words are a more promising starting place for lexical search queries: Because they do not express semantic concepts directly, they are less affected by regional or semantic differences. Furthermore, while being comparatively frequent, they also form closed lists, which facilitates manual search, especially in languages with as little morphology as English. We will explore this point of departure in section 2.
When starting investigations at the grammatical end, for example by comparing frequencies of part-of-speech tags, phrase types, or grammatical relations, it can be noticed that regional differences are relatively small while genre differences are often bigger (Biber, Conrad and Reppen 1998). While not impossible (and a crucial task for future research), it will be very difficult to disentangle genre differences from regional variation. Variational differences are often too subtle to leave a visible impact in frequency counts. In fact, the vast majority of sentences in e.g. ICE India or ICE Fiji could just as well have been produced by a British or American speaker, there is nothing ‘unusual’ in them.
While one-dimensional investigations of only the lexicon or of only the grammar may lead to limited success, it has been observed that the crucial variationist differences happen in the interaction of lexis and grammar. Schneider (2004: 229) for example states that in World Englishes,
distinctive phenomena tend to concentrate at the interface between grammar and lexicon, concerning structural preferences of certain words (like the complementation patterns that verbs allow), co-occurrence and collocational tendencies of words in phrases, and also patterns of word formation.
It will thus be revealing to investigate the lexical material that is used in syntactic relations. As a first approximation to words in syntactic relations, one can investigate surface word or word-tag sequences. For example, investigating trigrams that are frequent in ICE India but absent in the one-hundred-times larger British National Corpus (BNC) leads to the list given in figure 1, after filtering trigrams containing proper names and punctuation. Besides text selection, Indian features like archaic spellings (now a days), formal language (the honourable minister), unusual verb complementation with prepositional phrases (is called as), and written numbers (sixty-six and half) appear in this list. Examples that show the unusual verb complementation trigram is called as are:
(1) |
A substance which is helping in chemical reaction is called as a reagent. (ICE-IND:S1B-004) |
(2) |
Thus the intermediate state between crystaline and isotopic state is called as the mesophase or liquid crystals. (ICE-IND:W1A-020) |
Trigram |
f(ICE-India) |
now_RB a_DT days_NN |
42 |
special_JJ P_NN P_NN |
35 |
canvassed_VBN before_IN this_DT |
32 |
statement_NN was_VBD recorded_VBN |
28 |
learned_JJ special_JJ P_NN |
28 |
is_VBZ called_VBN as_IN |
27 |
scene_NN of_IN offence_NN |
26 |
the_DT honourable_JJ minister_NN |
23 |
for_IN grain_NN yield_NN |
22 |
the_DT learned_JJ special_JJ |
21 |
in_IN the_DT cyclone_NN |
19 |
delay_NN in_IN reply_NN |
18 |
best_JJS feature_NN film_NN |
18 |
avoid_VB delay_NN in_IN |
18 |
small_JJ circle_NN to_TO |
17 |
of_IN solid_JJ wastes_NNS |
17 |
general_JJ body_NN meeting_NN |
17 |
evidence_NN of_IN P_NN |
17 |
feature_NN film_NN in_NN |
16 |
crores_NNS of_IN rupees_NNS |
16 |
in_IN the_DT nodules_NNS |
15 |
has_VBZ also_RB canvassed_VBN |
15 |
sixt-six_NN and_CC half_NN |
14 |
Figure 1. Unusual trigrams in ICE India.
Is called as in the examples above belongs to a field of lexico-grammar that has been noted to hold great potential for studies of variety-specific usage patterns: verb complementation. For example, Olavarría de Ersson and Shaw (2003: 138) state that “Verb complementation is an all-pervading structural feature of language and thus likely to be more significant in giving a variety its character than, for example, lexis.” A number of previous studies have focussed on the variability of verb-particle combinations (with both prepositions or adverbial particles) in particular (see Mukherjee and Hoffmann 2006, Mukherjee 2009, Nesselhauf 2009, Zipp forthcoming). All of these studies investigate the occurrence of ‘new prepositional verbs’, i.e. novel combinations of verbs and prepositions that are triggered by a process of ‘semantico-structural analogy’, “a process by means of which non-native speakers of English as a second language are licensed to introduce new forms and structures into the English language because corresponding semantic and formal templates already exist in the English language system” (Mukherjee and Hoffmann 2006: 166-167). In most cases, the result of this process is a verb-particle combination with a redundant, i.e. additional preposition attached to a verb that does not usually combine with a particle (this is the classic case investigated in most of the previous studies). However, other divergent types could be the following: cases of different preposition, missing preposition, or un-idiomatic usage of existing verb-particle combinations. The first type is also included in the analyses here: verb-particle combinations in which we find a different preposition than the ones that are codified. The second type, missing preposition, could not be investigated by our research method. Schneider (2004) searches for a small set of specific verbs. While this allows the retrieval of missing particles, the complete set of verbs would be forbiddingly cumbersome. To complement Schneider (2004) we use a more corpus-driven all-inclusive method here. The third type however, un-idiomatic usage of existing verb-particle combinations, is addressed in the manual analysis in section 2. Regarding combinations of verbs and prepositional particles, traditional grammars distinguish between phrasal, prepositional and phrasal-prepositional verbs. For the investigation of verb complementation in this study, however, we purposely leave the distinction between preposition and verbal particle underspecified. All verb-preposition constructions are included, irrespective of whether they are specified or unspecified, continous or discontinuous. The manual approach only includes complements, whereas the automatic approach also delivers adjuncts.
1.4 Data
The data used for both methodological parts of this study comes from the same set of varieties of English: Fiji English, Indian English, New Zealand English and British English. This selection is justified by the original research object of previous studies, Fiji English, and its geographical, historical and cultural relations to India, New Zealand and Great Britain. The Fiji, Indian and New Zealand data are all part of the International Corpus of English (ICE) project; for the purposes of the present study, however, it was necessary to work with different datasets due to the methods we report on: The manual method is only concerned with the respective written parts of the corpora; ICE Fiji is still in the process of compilation at the moment of writing, and therefore represented by a part of the written subcorpus only. For the corpus-driven parsing method, all corpora were automatically parsed; this includes the Fiji corpus and the complete (i.e. spoken and written) regional components for Indian and New Zealand English. For reasons of corpus size, the basis of comparison for the parsing method is the written BNC corpus with approximately 90m running words (see Table 1).
Manual method |
|
ICE written components |
Fiji |
India |
NZ |
GB |
number of 2,000 word files |
140 |
200 |
200 |
200 |
Parsing method |
|
ICE |
Fiji |
India |
NZ |
BNC written |
number of 2,000 word files |
140 |
500 |
500 |
90m words |
Table 1: Corpora used
2. Manual method
This section reports on the processes that we followed and the results that we achieved on the task of investigating new verb-particle combinations in untagged corpora of New Englishes. All technical levels of this traditional corpus-linguistic study, which is based on lexical search queries, serve as the matrix against which the corpus-driven, parser-based method described below (section 3) is evaluated. As mentioned above, lexico-grammatical phenomena are claimed to be very good indicators of variety-specific structural nativisation. From a formal point of view, the grammatical elements of lexico-grammatical phenomena are represented by function words, including e.g. determiners (see Schneider and Hundt 2009), auxiliaries, modal verbs (see Biewer 2009), complementizers, and prepositions, which we will investigate here. In the specific case of verb complementation by particles, verb lexemes in all their inflectional forms combine with prepositions, i.e. function words. Prepositions are members of a closed class and not subjected to morphological operations; they can thus easily and exhaustively be found with simple lexical searches.
The first step of the manual investigation was therefore a lexical surface search for prepositions. For reasons of efficiency, the search queries were limited to the five most productive particles in verb-particle combinations (see Villavicencio 2006), up, out, down, off and away, and two prepositions that have repeatedly been claimed to be productive in the formation of new particle verbs in New Englishes, into and about. This was followed by a careful manual filtering process, as a result of which all instances were eliminated in which the prepositions did not occur within the verb phrase, or constituted false positives (e.g. due to linebreaks, within editorial comments, or added with the help of corrective mark-up). Whenever a verb-preposition combination was not recorded in the following sources, it was considered as ‘unrecorded’: The Collins COBUILD Phrasal Verbs Dictionary (2002), the Collins COBUILD Advanced Learner’s Dictionary (Resource Pack CD) (2003), the Oxford English Dictionary Online (2009) and, in selected cases, the Internet by means of Google search in selected cases.
|
ICE Fiji |
ICE IND |
ICE GB |
ICE NZ |
V + up |
38 (11) |
31 (13) |
5 (2) |
5 (2) |
V + out |
14 (4) |
31 (13) |
- |
9 (4) |
V + down |
10 (3) |
2 (1) |
- |
5 (2) |
V + off |
7 (2) |
24 (10) |
- |
2 (1) |
V + away |
3 (1) |
24 (10) |
11 (5) |
16 (7) |
V + into |
52 (15) |
24 (10) |
9 (4) |
9 (4) |
V + about |
28 (8) |
17 (7) |
2 (1) |
7 (3) |
total pmw (raw) |
152 (44) |
153 (64) |
27 (14) |
53 (23) |
Table 2: Distribution of unrecorded verb-preposition combinations across varieties per million words and (raw)
Table 2 shows the total number and normalized distribution of unrecorded verb-preposition combinations in each of the written subcorpora used for the manual analysis. The difference between the total number of unrecorded combinations is statistically significant at the p<0.001 level (chi-square contingency test, d.f.=3). Indian and Fiji English exhibit the highest number of innovations, followed by New Zealand English and British English. Across varieties, the prepositions up, away, into and about are consistently used in new verb-preposition combinations. From a variety-specific perspective, the first language varieties (GB and NZ) exhibit the greatest productivity in combination with the preposition away, which in most cases is used to add an aspectual dimension of continuity to a procedural verb (with the particle establishing the notion of ‘persistent action’, see Quirk et al. 1985: 1162). On the other hand, the prepositions into, up, about and out are used most often in unrecorded verb-preposition combinations in the second language varieties under observation. Note that unrecorded verb-preposition combinations were found in all corpora, despite the limited data size and the partly lexical nature of the phenomena investigated.
In order to shed light on the types of verb-preposition combinations that were detected on the basis of the manual method, we will now present a selection of examples. As mentioned in section 1.3, there are four types of possible divergence in verb-particle combinations, of which three can be investigated with the help of the manual method: redundant particle, different particle, and un-idiomatic usage.
2.1 Redundant particle
Examples 3 to 5 show four instances in which a redundant preposition is added to a simple verb. This process of creating ‘new prepositional verbs’ by analogy to existing, semantically related particle verbs is described in detail by Mukherjee (2009) and Nesselhauf (2009), and discussed and applied by Zipp (forthcoming). Below, we present a range of typical examples: The combination cope up with has been noted before in the context of many New Englishes, explaining about and discussing about belong to a relatively homogenous group of disquisition verbs combining with this preposition (such as talk about or speak about), and listed down might be triggered by analogy to the verb put down.
(3) |
As a result some or nearly most women in the world have now turned to becoming prostitutes in order to cope up with poor living standard, they may be experiencing. (ICE-FJ:W1A-016) |
(4) |
First, I would be explaining about the gender inequality, which often leads to the high incidence of poverty amongst women, which is what I would be discussing about in the second part of this essay. (ICE-FJ:W1A-016) |
(5) |
Adi Asenaca said an Asian Development Bank poverty participation survey listed down forms of poverty in the country and her ministry was following up on the recommendations. (ICE-FJ:W2C-013) |
2.2 Different particle
Sentences 6 to 11 are examples from the Fiji and Indian data in which unusual prepositions are used in combination with various verbs; a common phenomenon is the use of off for of, and into instead of in. The former may also arise from a typing mistake, but the frequency of its occurrence renders such an interpretation unlikely. It might be argued that the distinction between these prepositions is comparatively fine-grained and thus a pre-determined point of confusion in English. However, we do not aim at explaining the motivation of the phenomena we describe here; further research on the cognitive processes linked to the semantic perception of these two sets of prepositions will have to be undertaken.
(6) |
One of the side effects of alcohol is that it rids our body off nutrients, and the reason we feel like a truck has rolled over our heads is because we need vitamins to function. (ICE-FJ:W2D-012) |
(7) |
We have allowed racism to manifest itself into the education system directly or indirectly through our actions or through the examples we have set to our students as role models. (ICE-FJ:W2B-007) |
(8) |
Some of these are; women involving themselves into prostitution, selling their infants, migration across world, low standard in society. (ICE-FJ:W1A-018) |
(9) |
In some situations, however, waste can be a big health hazard and must be disposed off properly, for example by sanitary land fill. (ICE-IND:W2A-031) |
(10) |
Raju’s work has eased out a bit. (ICE-IND:W1B-014) |
(11) |
This resulted into a deep sense of growing loneliness which affected the individual life. (ICE-IND:W2A-005) |
2.3 Un-idiomatic usage
The last type of new verb-preposition combinations occurs in two sub-types: combinations that were used in contexts that do not match the interpretations given by dictionaries (examples 12 to 14), and combinations in which an existing particle verb is used where the simple verb would be the more appropriate choice (examples 15 to 18). These instances of un-idiomatic usage benefit from semantic evaluation of the context in which the verb occurs; the combination itself is recorded and thus difficult to detect by automatic retrieval methods. At best, they could be automatically detected based on the increased frequency of the particular verb-preposition combination.
(12) |
By the by, we’re looking for a media person, someone who can front up to the hacks without crumbling. (ICE-FJ:W2F-016) |
(13) |
Tukania picked up competitive football in 1980 and a year later forced his way into national coach late Billy Singh’s South Pacific Games squad. (ICE-FJ:W2C-019) |
(14) |
When the migrant races want to dominate us economically and now politically, through the 1997 Constitution, even though we have a higher population distribution of 51 per cent, the so-called democratic system does not stack up for our rights and differences. (ICE-FJ:W2B-012#34:1) |
(15) |
Coming over to play Fiji is an experience no one can rob them of, it’s not about the win but the exposure and the pride to play up against one of the world’s best is all that counts,” she said. (ICE-FJ:W2C-007) |
(16) |
Indian women were mostly dressed up in sarees. Even the woman indentured labourers came to work on plantations in sarees. (ICE-FJ:W2B-008) |
(17) |
Exceed that and it will be hello hangover the following day. Drink up water in betweens, it will fill you up very quickly, and the many trips to the bathroom will flush out the alcohol. (ICE-FJ:W2D-012) |
(18) |
The women participation in labour-force, more than doubled up between 1960-90. (ICE-IND:W2A-005) |
2.4 Results
The manual method described above consists of combining a lexical surface search for the function words in multi-word verbs with a manual filtering process and analysis of the hits. It produced a significant number and range of results, i.e., a variety of new particle-verb combinations from all national varieties of English under observation. It has to be stressed, however, that this method is only concerned with detecting possible new combinations, not with assessing their status. Whether a new combination finally enters the lexicon of a particular variety of English and becomes standardised will have to be investigated on the basis of larger amounts of data, or by follow-up investigations of a diachronic nature. For the time being, it cannot be ruled out that a considerable number of new combinations are potential nonce formations (see example 19 and 20).
(19) |
Manju was in Goyle, a nearby village, and Manju’s parents were always coconut wirelessed about her health and happenings. (ICE-FJ:W2F-013) |
(20) |
One of the creepers had tubers the size of large turnips that we had to tomahawk out. (ICE-NZ:W1B-008) |
However, the value of this analysis for determining potential starting points for further investigation of structural nativisation cannot be denied. Along the same line of argumentation, we refrain from judging whether the phenomena we report are indeed instances of variety-specific usage or performance errors. We believe that this distinction is above all a question of ideology; phenomena that are interpreted as instances of structural nativisation by variationist linguists are often seen as learner errors or substratum interference within the paradigm of second language acquisition, or slips of the tongue in the field of psycholinguistics. In the future, studies based on larger amounts of text will hopefully give a clearer picture of the respective frequencies; slips of the tongue will remain singular or at least rare occurrences, while structural nativisation phenomena will report more hits.
3. Parsing method
3.1 Using parsers for descriptive linguistics
Parsing technology has made considerable advances recently, opening new perspectives for descriptive linguistics. Van Noord and Bouma (2009: 37) state that “[k]nowledge-based parsers are now accurate, fast and robust enough to be used to obtain syntactic annotations for very large corpora fully automatically.” We apply parsed corpora as a new resource for linguists. Automatically parsed treebanks, also called tree jungles, have been used for e.g. Danish (Bick 2003) and French (Bick 2010). No treebanks for English regional varieties or World Englishes exist yet. In this situation, automatically parsed corpora can be used as a stopgap to Treebanks. We have parsed the available ICE corpora and many other large corpora like the BNC using a dependency parser (Schneider 2008).
The semi-automated corpus-driven approach using parsed corpora is described in detail in Schneider and Hundt (2009). Here we apply it to the detection of variety-specific prepositional collocations. Advantages of (semi-)automatic, parse-based methods are that they are fast and corpus-driven, which may increase recall. A disadvantage is that error-rates are still relatively high in automatic parsing, which seriously affects precision. The small size of the ICE corpora poses an additional challenge: The detection of rare collocations is particularly difficult due to the low counts.
We have used a probabilistic dependency parser, Pro3Gres (Schneider 2008), which is quite fast (the BNC parses in 24 hours) and which has been evaluated on several genres and varieties (Haverinen et al. 2008, Lehmann and Schneider 2009). The grammar can be adapted manually to genres and varieties. We have used the same grammar on all ICE corpora, in order not to risk adding skews. The parser is suitable for parsing different varieties of English, because it is robust and because its output has been evaluated on a number of English varieties (Schneider and Hundt 2009). For example, it does not enforce subject-verb agreement, it allows zero-determiners everywhere, it uses statistical preferences instead of strict subcategorisation frames. This entails for example that non-ditransitive verbs can act as ditransitive, and that prepositional phrases with divergent prepositions get attached, a feature that we need for our investigation here. The parser outputs intuitive dependency relations. A subset of them is given in table 3. Verb-PP (prepositional phrase) attachment is expressed by the dependency relation pobj.
RELATION
|
LABEL
|
EXAMPLE
|
verb–subject
|
subj
|
he sleeps
|
verb–direct object
|
obj
|
sees it
|
verb–second object
|
obj2
|
gave (her) kisses
|
verb–adjunct
|
adj
|
ate yesterday
|
verb–subord. clause
|
sentobj
|
saw (they) came
|
verb–pred. adjective
|
predadj
|
is ready
|
verb–prep. phrase
|
pobj
|
slept in bed
|
noun–prep. phrase
|
modpp
|
draft of paper
|
noun–participle
|
modpart
|
report written
|
verb–complementizer
|
compl
|
to eat apples
|
noun–preposition
|
prep
|
to the house
|
Table 3. Important dependency relations that are output by Pro3Gres
3.2 Parser evaluation
An evaluation of the performance on subject, object PP-attachment and subordinate clause relations, using the 500 sentence GREVAL gold standard (Carroll et al. 2003), is given in table 4. Compared to others parsers, these rates are competitive (Schneider 2008). While some of the performance values may appear low at the first sight, the following facts alleviate the impact of errors: first, only precision errors indicate a wrong assertion, while recall errors entail that an instance has been missed, the signal remains unaffected. Second, errors are largely unsystematic, which means that the signal is weakened but skewed much less than by the error rate. Third, PP-attachment performance on complements (which is what we mainly need for this application) is better than on adjuncts.
|
Subject |
Object |
PP-attachment |
clausal |
Precision |
92.3% (865/937) |
85.3% (353/414) |
76.9% (702/913) |
74.3% (451/607) |
Recall |
78.0% (865/1095) |
82.5% (353/428) |
68.6% (702/1023) |
61.7% (451/731) |
Table 4. Performance on the GREVAL gold standard corpus
In order to assess if performance is affected by variational differences, we have manually evaluated 100 random sentences from ICE GB and ICE Fiji and found similar performance (Lehmann and Schneider 2009, Schneider and Hundt 2009).
3.3 Detecting rare PP-collocations semi-automatically
As our method for detecting PP-collocations, we use the “surprise about” finding specific verb-preposition or verb-particle combinations. We use O / E (Observed / Expected) as measurement. We decided to use O / E instead of the t-test or log-likelihood, which are frequently used for the detection of collocations (see e.g. Evert 2009) for the following reasons: First, O / E is a measure of surprise, not of statistical significance. Collocation significance does not directly correspond to collocation strength, a measure of surprise may serve as a better proxy to measuring collocation strength. Second, O / E has the characteristic that it tends to give particularly high scores to rare events, which is beneficial for our purpose, as many of the new verb-PP combinations which we are investigating are very rare. In fact, they are often too rare to reach statistical significance. As considerable manual interaction is needed in our approach, manual validation of the suggestions made by the computer replaces the need for statistical significance. Third, O / E has been shown to work well for rare collocations, particularly if relatively clean data is used. Lehmann and Schneider (2009) use parsed data from the BNC and other large corpora to detect PP-collocations with O / E. While windows-based methods using O / E typically report a large amount of garbage in the top-ranked positions, O / E on parsed data delivers considerably better results (Lehmann and Schneider 2009). Windows-based methods (e.g. Stubbs 1995) are still commonly used for collocation detection. They use an observation window from N words before to N words after a key word (e.g. a verb) and count all words inside the window as co-occurrence. N is typically about 3. The distinction between different types of collocations (e.g. subject-verb, verb-object and verb-PP) is often left underspecified.
Windows-based methods typically lead to relatively many errors, both precision errors (false positives) and recall errors (false negatives). They suffer from precision errors due to the lack of implicit head extraction and due to the fact that words appearing close together are often not syntactically related. In the example sentence We report on the Epstein Barr virus will spread windows-based methods typically also report report on Epstein and report on Barr as verb-PP collocation counts due to the lack of head extraction. In the example sentence The virus we reported on last week has dangerous consequences windows-based methods typically report week and possibly, depending on N, consequences, as verb-PP collocation counts.
Recall is intrinsically low with windows-based methods because many of the dependencies appear further then N words away. Recall can be increased by increasing N, but at a forbidding cost of decreasing precision. We do not use the O / E measure directly, but we compare O / E obtained from an individual ICE corpus to O / E measures obtained from the BNC, in order to express how much more surprising the frequency of a verb-PP combination is in the ICE corpus under investigation, i.e. how much stronger a collocation is in an ICE corpus in comparison to the BNC. We compare to the BNC instead of ICE GB, because with ICE GB we experienced a serious sparse data problem. Very many verb-PP combinations, also many that are perfectly acceptable in British English, do not occur in ICE GB, while most of them appear in the BNC. As the ICE corpora are relatively small for our investigation, using a sufficiently large base of comparison can partly alleviate the sparse data problem.
We compare O / E measures by calculating a ratio. For the example of ICE Fiji, the formula is:
where N is corpus size, R is the verb-PP attachment relation (pobj, see table 3), w1 the head verb, w2 the preposition or verbal particle.
This formula assigns a value to the hundreds of verb-PP combinations that are seen in both corpora. The O / E ratio is above 1 if the collocation is more frequent in ICE Fiji (or whichever ICE corpus we apply), and below 1 if it is more frequent in the BNC. We are particularly interested in very high ratios, so we filter the list of verb-PP combinations, for example only to O / E ratio > 10 (i.e. at least ten times more surprising in ICE Fiji). The list thus obtained contains some surprising collocations and some collocations that are also acceptable and frequent in British English. The latter usually have high O / E values in the BNC, and due to coincidence, small corpus size, text selection, semantic content, etc. end up being more frequent in ICE Fiji. In order to filter them out, we also set a threshold on O / E values from the BNC: only O / E values below a certain threshold (we have used 3 in table 5), i.e. combinations that are not strong collocations also in the BNC are allowed.
We have also looked at verb-PP combinations that are present in an ICE corpus but absent in the BNC.
3.4 New verb-PP combinations in ICE Fiji
If we set the filter to O / E ratio > 10 and O / E in the BNC < 3 we get the list shown in table 5.
O / E ratio
|
Head
|
Prep
|
f (Fiji)
|
O / E (Fiji)
|
O / E (BNC)
|
manual inspection comment
|
14.4021
|
regard
|
to
|
7
|
41.9521
|
2.91292
|
serendipitous: he or she will be reading in regards to a bigger picture
|
14.616
|
cause
|
on
|
3
|
34.3407
|
2.34952
|
yes: The thought of how much anxiety he had caused on his parents ...
|
19.7136
|
stick
|
as
|
2
|
42.1458
|
2.1379
|
no
|
10.9451
|
pick
|
to
|
2
|
11.5253
|
1.05301
|
yes: allow me to pick my team to the world cup
|
33.9525
|
join
|
into
|
2
|
52.5526
|
1.54783
|
yes: Women by joining into these organisation benefit a lot
|
11.1615
|
involve
|
into
|
2
|
24.255
|
2.17311
|
yes: women involving themselves into prostitution
|
33.3689
|
include
|
into
|
2
|
65.2377
|
1.95505
|
yes: they have included rare ... species ... into the displays
|
22.3632
|
implicate
|
for
|
2
|
46.4807
|
2.07845
|
no
|
472.801
|
gather
|
upon
|
2
|
895.141
|
1.89327
|
yes, adjunct: upon evaluating the ... Education Act, it was gathered that ....
|
15.2663
|
explain
|
from
|
2
|
40.2206
|
2.6346
|
no, consistent parsing error
|
81.3601
|
engage
|
through
|
2
|
167.625
|
2.06028
|
no
|
31.246
|
concentrate
|
from
|
2
|
54.5852
|
1.74695
|
no
|
48.866
|
capable
|
in
|
2
|
14.2045
|
0.290684
|
yes, adjective: are capable in committing themselves to work
|
61.3927
|
arrive
|
into
|
2
|
43.9975
|
0.716656
|
yes: Megan Simpson is expected to arrive into the country
|
Table 5. Results from ICE Fiji. For O / E ratio > 10 verbs and O / E (BNC) < 3
The first column displays the O / E ratio as given in the formula. The second column contains the verb, and the third column the preposition in the PP-attachment relation. The fourth column, f (Fiji) reports how often the verb-PP combination is seen in ICE Fiji. Note that most of these values are very low, too low to reach statistical significance, which is one of the reasons why we have chosen O / E. In fact, we have also tested log-likelihood measures and obtained slightly worse results. Columns 5 and 6 show O / E from the two corpora. The last column contains our manual assessment (‘yes’ meaning this is a new Fijian verb-PP combination, ‘no’ meaning probably not) and an example for the cases where we have a typically Fijian verb-PP collocation. False positives are due to many different reasons; we have observed two as particularly frequent: first, consistent parsing errors. Second, the parser as we have used it here underspecifies the distinction between PP-argument and PP-adjunct, in order to increase recall: Unusual verb-PP argument combinations would hardly ever be recognised by the parser otherwise. This entails that frequent adjuncts, for example “concentrated ... from”, which occurs repeatedly in scientific texts, appears in the list, or “upon ... it was gathered” which appears in judicial texts. We have decided to report the latter as it may be a candidate for very formal, seemingly slightly archaic expressions, which are generally more frequent in Asian English than in today’s British English.
From a semantic perspective, many of these examples have been noted to display the “tendency to make the direction expressed in verbs of movement more explicit, even if this is already present in the meaning of the verb” (Nesselhauf 2009: 20, also see Zipp forthcoming). Some of these combinations of directional nature have been described before in selected New Englishes (e.g. Mukherjee 2009: 123, Nesselhauf 2009: 18, Sedlatschek 2009); the following are examples found in our data: arrive into, include into, join into. They can be seen as supporting the image of entering into a framed container or clearly framed status. Others may be seen as further specifying the verb meaning, for example pick to (restricting the meaning to select, which partly overlaps with pick), or as supporting the verb meaning, for example cause on (the preposition to is fairly neutral as in give to, offer to, the preposition on is negative, conjuring up exert on, put on, looming on, impending on).
As we can see in table 5, about half of the reported verb-PP collocations are false positives, so-called “garbage”. Our approach does not intend to be fully automatic, and we are not aware of a fully automatic approach. Since the counts are too low to reach statistical significance, and since the corpus linguist is interested in assessing and interpreting the results anyway, the manual filtering involved is usually acceptable and less work-intense than reading the whole corpus. In applications where the focus is on recall, less strict filters are used and a linguistic annotator has to filter more false positives. For example, with the very high O / E ratio > 40, but no O / E (BNC) threshold we get the list in table 6 from ICE-Fiji. We have selected thresholds that deliver interesting and particularly different results.
O / E ratio
|
Head
|
Prep
|
f (Fiji)
|
O / E (Fiji)
|
O / E (BNC)
|
manual inspection comment
|
381.198
|
reduce
|
amongst
|
7
|
1412.92
|
3.70652
|
no, consistent parsing error
|
89.4924
|
educate
|
than
|
3
|
763.081
|
8.52677
|
no
|
132.128
|
wrap
|
over
|
2
|
916.23
|
6.93443
|
no
|
169.581
|
tread
|
because
|
2
|
2287.58
|
13.4896
|
no
|
86.2943
|
renew
|
through
|
2
|
670.498
|
7.7699
|
no
|
121.186
|
renew
|
because
|
2
|
1715.69
|
14.1574
|
no
|
91.7332
|
poll
|
as
|
2
|
358.24
|
3.90523
|
no
|
51.3641
|
miss
|
without
|
2
|
474.255
|
9.2332
|
no
|
52.5289
|
know
|
behind
|
2
|
176.812
|
3.366
|
no, parsing error
|
50.3294
|
influence
|
towards
|
2
|
511.696
|
10.1669
|
yes: Leadership is defined as the ability to influence people towards the attainment of goals
|
472.801
|
gather
|
upon
|
2
|
895.141
|
1.89327
|
yes, see table 5
|
81.3601
|
engage
|
through
|
2
|
167.625
|
2.06028
|
no
|
130.402
|
enable
|
despite
|
2
|
5555.56
|
42.6032
|
no
|
124.367
|
award
|
over
|
2
|
732.984
|
5.89372
|
no
|
61.3927
|
arrive
|
into
|
2
|
43.9975
|
0.716656
|
yes, see table 5
|
429.654
|
anticipate
|
within
|
2
|
3381.64
|
7.87063
|
|
Table 6. Results from ICE Fiji with O / E ratio > 40, but no O / E (BNC) threshold
While returning more false positives this list also contains a new finding, in which the preposition also seems to re-iterate the verb meaning : influence someone towards something. Repetitions of similar constructions affect the results. As several student essays in the ICE Fiji corpus are on the same topic, some combinations appear often: “reduce poverty amongst women”, “... are more educated than ... ”, and “Through interpretation, tourist begins to engage” appear in more than one student essay. In essence, these are sparse data problems.
3.5 New verb-PP combinations in ICE India
We have applied the same formula on other ICE corpora, particularly on other L2 corpora, where exonormative standardisation can be expected
For ICE India, using O / E ratio > 35 and O / E (BNC) < 3 as thresholds we obtain the results listed in table 7.
O / E ratio
|
Head
|
Prep
|
f (India)
|
O / E (India)
|
O / E (BNC)
|
manual inspection comment
|
80.6962
|
discuss
|
about
|
10
|
148.012
|
1.83419
|
yes: You come we will discuss about it.
|
51.3664
|
study
|
about
|
7
|
67.7127
|
1.31823
|
yes: Today we are studying about rotation and revolution of the earth.
|
705.33
|
advise
|
into
|
7
|
279.731
|
0.396597
|
no, consistent parsing error
|
39.8306
|
result
|
into
|
5
|
55.3685
|
1.3901
|
yes: This resulted into a deep sense of growing loneliness
|
78.7867
|
burst
|
of
|
5
|
234.214
|
2.97276
|
no
|
53.0517
|
arrest
|
from
|
5
|
59.374
|
1.11917
|
yes: five more terrorists were arrested from his home
|
93.5978
|
etch
|
at
|
3
|
147.232
|
1.57303
|
no
|
67.2343
|
withstand
|
to
|
2
|
139.353
|
2.07265
|
no
|
46.6381
|
significant
|
on
|
2
|
33.1642
|
0.711096
|
no
|
45.8399
|
nice
|
on
|
2
|
70.0133
|
1.52734
|
no
|
84.4974
|
line
|
of
|
2
|
120.453
|
1.42552
|
no
|
47.4123
|
land
|
into
|
2
|
102.124
|
2.15396
|
yes: Atul’s tendency of worrying too much ... landed him into trouble
|
107.968
|
exciting
|
on
|
2
|
315.06
|
2.9181
|
no
|
214.685
|
benefit
|
out
|
2
|
128.156
|
0.596949
|
yes: So they’ll benefit out of the faculty teaching
|
Table 7. Examples from ICE-India. For O / E ratio > 35 verbs and O / E (BNC) < 3
Again, there is a considerable amount of false positives that need to be filtered. The verb-PP combination discuss about is frequent in L2 and learner English as a whole (see above).
The parser output for one of the example sentences is given in figure 2. Result into is another example where the prepositional semantics is used to support the verb meaning. What is special in this case is that the existing phrasal verb result in which contains an opaque, semantically non-compositional particle in is rendered more transparent by the use of the preposition or particle into. This leads to a very similar construction which is probably ungrammatical in Standard English but has transparent semantics.
For the experiments of section 3.4, we have hitherto used the entire ICE India, including the spoken part. This leads to a biased comparison, both compared to ICE Fiji in section 3.3 and to the manual method in section 2, where only the written sub-corpora of ICE India are used.
For O / E ratio > 8 and O / E (BNC) < 3 we get the short list given in table 8.
O / E ratio
|
Head
|
Prep
|
f (India)
|
O / E (India)
|
O / E (BNC)
|
manual inspection comment
|
17.7654
|
value
|
to
|
3
|
49.0517
|
2.76108
|
no
|
9.70496
|
such
|
in
|
3
|
1.52625
|
0.157265
|
no
|
20.5223
|
result
|
into
|
3
|
28.528
|
1.3901
|
yes, see table 7
|
69.9399
|
line
|
of
|
2
|
99.7009
|
1.42552
|
no
|
12.5842
|
issue
|
out
|
2
|
31.1721
|
2.47708
|
yes: A thesis will not be issued out of the Library
|
11.2238
|
influence
|
on
|
2
|
30.9885
|
2.76097
|
yes: There is also some political factor which also influences on cultural ...
|
74.3361
|
exciting
|
on
|
2
|
216.92
|
2.9181
|
no, almost identical sentence twice in same
|
9.33135
|
add
|
into
|
2
|
19.613
|
2.10184
|
no, parsing error
|
Table 8. Results on ICE India subpart, O / E ratio > 8 and O / E (BNC) < 3
We get fewer hits and fewer true positives, as the sparse data problem is considerably more acute. There is probably less structural nativisation in written texts than in spoken texts, which also contributes to the better results we get when including the spoken part. Repetitions of similar sentences also affect our findings. Interestingly, we also get two new findings.
3.6 Further verb-PP combinations
As mentioned, if we use less strict thresholds we get longer lists with much lower precision, but more instances are recalled. Going through longer lists lead to the following additional findings.
In ICE Fiji:
(21) |
Papua New Guinea where its Constitution emphasises on equal participation by women citizens (ICE-FJ:W1A-016) |
(22) |
... today the indigenous Fijians are still marginalised from the development process (ICE-FJ:W2B-012) |
(23) |
The downloaded data was collated, analyzed and summarized into Table III. (ICE-FJ:W2A-033) |
(24) |
I can’t sleep from worrying. (ICE-FJ:W2F-017) |
Example 24 contains an adjunct, which we have included in the automatic approach. Typically, collocations are complements, but many adjuncts can also be found, for example sigh with relief, roar with laughter, appear before magistrate, prove beyond doubt (Lehmann and Schneider 2011).
In ICE India: Written only:
(25) |
Of course modern technology is improving the quality and hence even the hardened antagonists are switching over to them. (ICE-IND:W2D-019) |
(26) |
You had the guts of your blighted mother to complain against us to the Governor. (ICE-IND:W2F-018) |
(27) |
Wings are absent to apterygotes. (ICE-IND:W1A-019) |
(28) |
The rule is that the company is the right person to sue and that it is not open to the individual members to assume to themselves the right of suing in the name of the company (ICE-IND:W2A-016) |
Including Spoken:
(29) |
He was using the stones and preparing instruments out of it (ICE-IND:S1A-072) |
(30) |
he has described all about that. (ICE-IND:S1A-092) |
(31) |
the government of late has decided to slash down the export target for the year. (ICE-IND:S1B-056) |
(32) |
... retro-rockets were automatically fired to slow off the spaceship. (ICE-IND:S1B-006) |
(33) |
he tried to enlighten the people and be aware towards all these irregularities. (ICE-IND:S1A-007) |
We have also conducted experiments on verb-PP combinations that occur several times in ICE Fiji but are entirely absent in the BNC. The lists are dominated by adjuncts and by repetitions. The list of unseen verb-PP combinations that appear at least twice in ICE Fiji are given in table 9.
O / E ratio*
|
Head
|
Prep
|
f (Fiji)
|
manual comment
|
6273 |
strengthen |
along |
4 |
no, repetitions |
3310 |
thump |
up |
2 |
no, repetitions |
555 |
nest |
around |
2 |
no |
330 |
download |
by |
2 |
no, internet age |
1029 |
discriminate |
since |
2 |
no, repetitions |
3463 |
cut |
onto |
2 |
no, repetitions |
52 |
crosslink |
with |
2 |
no |
2298 |
collaborate |
with |
2 |
no |
117 |
choreograph |
for |
2 |
no |
137 |
chirp |
in |
2 |
no |
348 |
bag |
from |
2 |
no, parsing error |
*A frequency of 0.1 was assumed for all unseen events, which makes it possible to calculate O / E for unseen combinations. Such smoothing techniques are standardly used in statistics.
Table 9. Verb-PP combinations unseen in the BNC while occurring at least twice in ICE Fiji
Including hapax legomena leads to a considerably longer lists, many false positives but also a few true positives, which are given in the following.
In ICE India: Written only:
(34) |
Adi Asenaca said an Asian Development Bank poverty participation survey listed down forms of poverty in the country and her ministry was following up on the recommendations. (ICE-FJ:W2C-013) |
(35) |
Many still insist that they can get formal education due to insufficient funds and how to indulge into such activities where they get easy money and feed themselves. (ICE-FJ:W1A-020) |
(36) |
Ravi watched horrified as his mother crashed towards the floor. (ICE-FJ:W2F-012) |
(37) |
As a result some or nearly most women in the world have now turned to becoming prostitutes in order to cope up with poor living standard, they may be experiencing. (ICE-FJ:W1A-016) |
4. Comparison of methods and conclusion
Based on the results that we obtained and our experiences with the processes of both the manual and the semi-automatic method, we compare advantages and disadvantages of each method in this section. The two methodological approaches to verb-preposition combinations both presented viable options with a number of results.
The particular advantages of the manual method described in section 2 of the present paper are the following: The analysis is very fine-grained, with high precision and recall. It is self-contained within each corpus under observation, which entails that phenomena can be detected for each variety without the need for a database for comparison. Furthermore, the method grants control over the scope of analysis; only manual analysis allows for a context-based semantic examination (see section 2.3). The disadvantages of the manual method, on the other hand, are first and foremost that it is very tedious and time-consuming. Therefore, it is technically barely possible to conduct it in connection with large corpora or highly frequent prepositions. Second, this method relies on a predetermined starting point, i.e. a set of prepositions, as well as on codified and standardised dictionaries to assess the status of unrecorded verb-preposition combinations.
The semi-automatic method has advantages and disadvantages as well. A first advantage is that the method is corpus-driven, no prior set of prepositions needs to be assumed to start with, and theoretically, findings that are entirely different from those reported previously could be found. Second, the method scales well, not only to all prepositions, but also e.g. to adjective-PP combinations or to much larger texts. As sparse data is a serious issue, this method can only use its full potential when applied to much larger corpora. We will take a step into this direction in section 5. A first disadvantage of the semi-automatic method is that it misses many instances; it has relatively low recall, especially of semantically fine-grained distinctions. A second disadvantage is that manual interaction is still needed, the suggested results contain very many false positives. Furthermore, the method is particularly sensitive to duplicates, i.e. the same construction occurring several times in the same or a thematically related text.
We have partly found the same new verb-PP constructions using two diametrically opposed methods, and partly found different verb-PP constructions. Considering the different results, the methods complement each other precisely as they are very different in nature: they allow a researcher to attain much higher recall than either of the two methods on its own. The large overlap in results validates both approaches and gives one an assessment of the recall of each method.
5. Outlook: Scaling up with the Statesman Corpus (Semi-Automatic)
We are using existing ICE corpora for this pilot study, but the aim is to apply the same methodology to larger, web-derived (and thus somewhat ‘messier’) data. Using larger texts in the future will hopefully allow us to get a clearer distinction between production errors (slips of the tongue, typos, etc.) and structural nativisation phenomena: production errors remain nonce or rare occurrences, while structural nativisation phenomena report more hits. Concerning the semi-automatic method, we are using larger corpora, for example a subset of the Indian The Statesman newspaper and again compare to the BNC. A 3 million words excerpt of The Statesman Newspaper Archive excerpt has for example given us the findings listed in Table 10. We show the results obtained after manual filtering. We generally get more hits than on the small ICE corpora, but we also get many near-duplicates as newspaper articles may be repetitive. The results are also affected by genre differences: in comparison to the BNC, all or almost all texts come from the news genre.
verb
|
prep
|
f
|
Example
|
arrest
|
from
|
128
|
Seventeen contractual workers were arrested from the spot.
|
emphasise
|
on
|
12
|
He was a great reformer and throughout his life emphasised on the concepts of women’s education and women’s empowerment.
|
attach
|
with
|
8
|
... they would have to exercise caution in attaching themselves with projects they are not comfortable with.
|
aware
|
about
|
5
|
Tara Cancer Foundation, an NGO has been set up ... to make people aware about cancer ...
|
alert
|
about
|
5
|
However, we have to be alert about any possible attack .
|
aspire
|
for
|
4
|
It is Bollywood and not serious film makers that aspires too much for the Oscar glory these days.
|
list
|
out
|
4
|
I just can not understand your logic, he said and listed out statistics on funds allocated for various rural development projects.
|
discuss
|
about
|
3
|
The two also discussed about the entry of foreign educational institutions in India ...
|
dismiss
|
off
|
3
|
... but the South African had the last laugh by dismissing him off the last ball of the over.
|
devote
|
for
|
3
|
... Dr Ambedkar devoted his life for social justice for backward classes in the country.
|
rid
|
off
|
2
|
... and eight balls later Ajantha Mendis got rid off Sarwan ...
|
blind
|
into
|
2
|
the ... government in France had been blinded by supposed French interests in the region into siding with radical ... Hutu groups.
|
Table 10. New Verb-PP combinations found using the semi-automatic method on an excerpt of the The Statesman Newspaper Archive
Sources
Indian The Statesman newspaper: http://www.thestatesman.com
Bibliography
Bautista, Maria Lourdes S. & Andrew B. Gonzales. 2006. “Southeast Asian Englishes.” The Handbook of World Englishes, ed. by Braj B. Kachru, Yamuna Kachru & Cecil L. Nelson, 130–44. Malden, MA: Blackwell.
Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.
Bick, Eckhard. 2003. “A CG & PSG hybrid approach to automatic corpus annotation”. Proceedings of SProLaC2003, ed. by Kiril Simow & Petya Osenova, 1–12. Lancaster: Lancaster University. http://www.bultreebank.org/SProLaC03Proceedings.html
Bick, Eckhard. 2010. “FrAG, a hybrid constraint grammar parser for French”. Proceedings of LREC 2010, ed. by Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias. Valletta: European Language Resources Association (ELRA).
Biewer, Carolin. 2009. “Modals and semi-modals of obligation and necessity in South Pacific Englishes”. Anglistik 20(2): 41–55.
Biewer, Carolin, Marianne Hundt & Lena Zipp. 2010. “How a Fiji corpus? Challenges in the compilation of an L2 ICE component.” ICAME Journal 34: 5–23
Carroll, John, Guido Minnen & Edward Briscoe. 2003. “Parser evaluation: using a grammatical relation annotation scheme”. Treebanks: Building and Using Parsed Corpora, ed. by Anne Abeillé, 299–316. Dordrecht: Kluwer.
Collins COBUILD Phrasal Verbs Dictionary. 2002. John Sinclair, ed. Glasgow: HarperCollins
Collins COBUILD Advanced Learner’s Dictionary (Resource Pack CD) - Lingea Lexicon. 2003. Glasgow: HarperCollins.
Evert, Stefan. 2009. “Corpora and collocations”. Corpus Linguistics. An International Handbook, article 58, ed. by Anke Lüdeling & Merja Kytö, 1212–1248. Berlin: Mouton de Gruyter.
Foley, Joseph A., ed. 1988. New Englishes: The Case of Singapore. Singapore: Singapore University Press.
Haverinen, Katri, Filip Ginter, Sampo Pyysalo & Tapio Salakoski. 2008. “Accurate conversion of dependency parses: targeting the Stanford scheme”. Proceedings of Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, 2008.
Lehmann, Hans Martin, & Gerold Schneider. 2009. “Parser-Based Analysis of Syntax-Lexis Interaction”. Corpora: Pragmatics and Discourse. Papers from the 29th International conference on English language research on computerized corpora (ICAME 29) (Language and computers 68), Ascona, Switzerland, 14–18 May 2008, ed. by Andreas H. Jucker, Daniel Schreier & Marianne Hundt, 477–502. Amsterdam: Rodopi.
Lehmann, Hans Martin – Gerold Schneider. 2011. “A large-scale investigation of verb-attached prepositional phrases”. Methodological and Historical Dimensions of Corpus Linguistics, ed. by Paul Rayson, Sebastian Hoffmann & Geoffrey Leech. (Studies in Variation, Contacts and Change in English 6). Helsinki: Research Unit for Variation, Contacts, and Change in English. http://www.helsinki.fi/varieng/series/volumes/06/lehmann_schneider/
Mukherjee, Joybrato & Sebastian Hoffmann. 2006. “Describing verb-complementational profiles of New Englishes. A pilot study of Indian English.” English World-Wide 27(2): 147–173.
Mukherjee, Joybrato. 2009. “The lexicogrammar of present-day Indian English”. Exploring the Lexis-Grammar Interface, ed. by Ute Römer & Rainer Schulze, 117–135. Amsterdam: John Benjamins.
Nesselhauf, Nadja. 2009. “Co-selection phenomena across New Englishes. Parallels (and differences) to foreign learner varieties”. English World-Wide 30(1): 1–26.
van Noord, Gertjan & Gosse Bouma, 2009. “Parsed Corpora for Linguistics”. Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? Athens, Greece, 33–39.
Olavarría de Ersson, Eugenia, & Shaw, Philip. 2003. “Verb Complementation Patterns in Indian Standard English”. English World-Wide 24(2): 137–161.
Oxford English Dictionary Online. 2009. Oxford University Press. http://www.oed.com
Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech & Jan Svartvik. 1985. A Comprehensive Grammar of the English Language. London: Longman.
Sand, Andrea. 2004. “Shared Morpho-syntactic features in contact varieties of English: Article use”. World Englishes 23(2): 281–298.
Schneider, Edgar W. 2004. “How to trace structural nativization: particle verbs in world Englishes”. World Englishes 23(2): 227–249.
Schneider, Edgar W. 2007. Postcolonial English. Varieties around the world (Cambridge Approaches to Language Contact). Cambridge: Cambridge University Press.
Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. Ph.D. dissertation, Institute of Computational Linguistics, University of Zurich.
Schneider, Gerold & Marianne Hundt. 2009. “Using a parser as a heuristic tool for the description of New Englishes.” Proceedings of the Fifth Corpus Linguistics Conference, Liverpool, 20–23 July 2009. http://ucrel.lancs.ac.uk/publications/cl2009/
Schreier, Daniel. 2003. Isolation and Language Change: Contemporary and Sociohistorical Evidence from Tristan da Cunha English (Palgrave Studies in Language Variation 1). Houndmills/Basingstoke & New York: Palgrave Macmillan.
Sedlatschek, Andreas. 2009. Contemporary Indian English: Variation and Change. Amsterdam & Philadelphia: John Benjamins.
Stubbs, Michael, 1995. “Collocations and semantic profiles: on the cause of the trouble with quantitative studies”. Functions of Language 2(1): 23–55.
Villavicencio, Aline. 2006. “Verb-Particle Constructions in the Wold Wide Web”. Syntax and Semantics of Prepositions, ed. by Patrick Saint-Dizier, 115–130. Dordrecht: Springer.
Xiao, Richard. 2009. “Multidimensional analysis and the study of world Englishes”. World Englishes 28(4): 421–450
Zipp, Lena. forthcoming. Exo- and endonormative models in Fiji – A corpus-based study on the dynamics of first and second language varieties with a focus on Indo-Fijian English. Ph.D. dissertation, English Department, University of Zurich.
|
|