The search for repulsion: a new corpus analytical approach *

Antoinette Renouf & Jayeeta Banerjee
Research and Development Unit for English Studies, Birmingham City University

Abstract

In the last two decades, our research has centred on word collocation and its role in the construction of meaning in text. In this paper, we propose that there is a 'force', that we call 'repulsion', which operates in an opposing way to that of lexical collocation. By 'repulsion', we mean the intuitively-observed tendency in conventional language use for certain pairs of words not to occur together. We write within the context of a large-scale study, which has the goal of establishing how repulsion operates in text and whether it has the status of an objective and measurable 'force'. We are interested in identifying the process of actual distancing between words, rather than just enumerating the instances where word co-occurrence is prohibited by other factors such as grammatical norms, and we further wish to make a clear distinction between cases of 'indifference' and of active repulsion. This is a hitherto unexplored aspect of language in use, and we hope to develop an objective 'lexical repulsion' measure, capable of providing insights into text creation which will be of use in lexicology, language pedagogy and NLP.

1. Introduction

Until now, our research focus at the Research and Development Unit for English Studies (RDUES), has been on collocation (Firth 1951), the way in which certain words significantly prefer each other's company, whether in adjacent pairings or in discontinuous phrasal frameworks. This information is of intrinsic interest to the linguist, but we have found that we can also exploit the resultant inventory of word pairs in various practical ways; for instance, in automating the finding of sense-related words on the basis of their collocational similarities (e.g. Renouf 1996); or in identifying changes in word use across time through the changes in the collocational behaviour (e.g. Renouf 1993; Pacey et al. 1998).

In this paper, we propose to introduce and explore the notion that there is another 'force', which we shall call 'repulsion' [1], that operates on the construction of text in an opposing way to that of collocational 'attraction'. We start from an intuition, based on long-term observation of textual organisation, that there is a tendency in conventional language use to avoid putting certain pairs of words together - for instance, it is conventional in English to say impeccable + manners, but not spotless + manners. We describe and discuss the method by which we try to establish how this phenomenon operates, and whether it is measurable and sufficiently robust to merit being posited as a 'force' in the organisation of text.

We are interested in articulating the nature of active distancing, rather than in just identifying instances where words are prohibited from co-occurring for well-known reasons, such as grammatical norms. We also hope to differentiate between indifference, where words do not co-occur because they have no relationship with each other, and cases of actual repulsion.

2. Background to study

The English native-speaker routinely says Merry Christmas, Happy Christmas and Happy Birthday but not Merry Birthday, so we applied our existing z-score collocational statistics (set at span +/-1, and with case insensitivity) to this apparent phenomenon of avoidance, and found that it was measurable. As shown in Table 1, the measures for the three word pairs above revealed that, while Pairs 1, 2 and 3 collocate strongly, Pair 4 does not collocate at all and produces a negative z-score:

Table 1. Collocation of merry, happy, christmas and birthday.

Word1    Corp freq Word2    Corp freq Collocates Collocate score
(merry    2326)

(christmas    90670)

450

393.205

(happy    8323)

(christmas    90670)

299

196.010

(happy    8323)

(birthday    2416)

526

516.16

(merry    2326)

(birthday    2416)

0

-1.014

This finding encouraged us to develop a more detailed hypothesis about the existence and nature of 'repulsion' in text, and to test it systematically. Given that our methodology is based on collocational considerations, we focus on 'lexical' repulsion, i.e. the intuitively-observed tendency in conventional language use for certain pairs of words not to occur together, for no apparent reason other than convention. We also decided to extend our investigation to semantic repulsion, based on our experience that semantic phenomena in text are accessible by collocational means, at least to some extent.

3. 'Constraint' versus 'repulsion'

Of course, we already know that a type of repulsion exists in text at several levels of description and generality. To the extent that 'repulsion' means 'restricted co-occurrence', this has been shown in the literature to occur at the levels of grammar, semantics (e.g. Resnik 1997), morphology (e.g. Aronoff 1976; Yip 1998) and phonology (e.g. Kim 1998). On the other hand, little or no investigation seems to have been carried out either into the nature of actual 'repulsion', or of 'lexical repulsion', in our terms. The closest approaches we have found are, firstly, in computational linguistics, where Beeferman et al. (1997), refer to repulsion, but only in terms of the 'lexical exclusion principle': they state that exact word repetition 'occurs less frequently' within shorter text spans, for stylistic and syntactic reasons. Meanwhile, in the fields of translation (e.g. Laviosa-Braithwaite 1996) and foreign language teaching (e.g. Bonci 2002), 'collocational constraints', 'collocational restrictions', and 'collocational clashes' are concepts which are intuitively applied. However, they are never systematically treated or quantified.

4. Data

The corpus data used in the study comprised 800 million words of 'broadsheet' journalistic text, the Independent and Guardian newspapers from 1989-2006. This had the benefit of providing us with sufficient information to allow statistical measures of significance to be applied sensibly. It should, on the other hand, be borne in mind in interpreting our results and findings that they only reflect the facts of the language as found in this particular type of written discourse, though we are confident that it furnishes reasonably relevant and 'mainstream' language use. The newspapers, though covering a time-period, were processed as a synchronic entity for the purpose of our research goals. Article boundaries were noted, since the phenomenon of repulsion can be assumed to function within the confines of a single article.

5. Studying lexical repulsion

As a starting hypothesis, it was assumed that lexical repulsion is a matter of convention in the English native-speaking community; we are concerned with cases where two words simply do not collocate with each other in text for no obvious reason other than habit: that is, where the repulsion is not due to the fact that they are semantically incompatible, or grammatically disallowed, or morphologically or phonologically blocked. Our chosen remit involved the investigation of both words and phrases, of both contiguous & discontinuous word relations, and also of the aberrant collocation of words which are conventionally found to repel each other.

5.1 Lexical repulsion between synonymous word pairs

As many collocational studies including our own have shown, an individual word actually 'attracts' relatively few words to any significant degree. Given the extensive inventory of the English vocabulary, this means that there remain a very large number of word types to which a word is either indifferent or, according to our hypothesis, 'hostile'. In order to preempt the predicted generation of vast lists of unsurprising output which any valid repulsion methodology could accordingly be assumed to generate for a word, we began our study of repulsion with a focus on the clearest and most manageable case of the phenomenon. This was deemed to lie in the behaviour of synonymous word pairs. Synonyms can reasonably be expected by virtue of their shared meanings to share significant collocates in text. This, so our thinking went, would mean that the instances of repulsion would be particularly surprising and informative, and the list of items repelled by each synonym of the other's would be reduced. To limit our initial studies further, we decided to focus only on the contiguous collocates of each synonym pair, in order to eliminate any cases where an ambiguity could arise about whether it was repulsion or simply close collocation that we were witnessing. Our concept of repulsion is represented diagrammatically in Figure 1 below.

graphical representation of repulsion The left-hand crescent repels word B and attracts word A In between are words that strongly attract word A and weakly collocate with word B The central overlapping area has words that strongly attract both words A and B In between are words that strongly attract word B and weakly collocate with word A The right-hand crescent repels word A and attracts word B

Figure 1. Graphic representation of 'repulsion'.

Word A and Word B in Figure 1 are synonyms, shown within their respective 'collocational spaces'. The circular area on the left represents the significant collocate set for Word A, and the circular area on the right represents the significant collocate set for Word B. The overlap between these two sets of collocates represents their shared significant collocates. The middle area of the circle for each word represents those collocates which might also collocate with the other word, but only weakly or insignificantly. Meanwhile, the extreme outer crescents represent areas of actual repulsion by Word B of Word A's collocates, and by Word A of Word B's collocates.

5.2 Measuring lexical repulsion

The statistical measures of repulsion which we applied were based on relative frequency in relation to the 700 million word corpus as a whole. We have built 'collocational profiles' containing this information for each word in newspaper text, and used a z-score cut-off to identify only the most significant collocates. Collocational z-scores are a measure of the strength of a relationship based on comparing a) the frequency with which two observed words collocate within a given span with b) their expected frequency in a body of text if the occurrence of one word of the pair was at random relative to the other word. The statistical thresholds (z-score cut-offs) were as follows:

  • for 'repulsion', the strength of association was set at < -2. That is, the words road + life are deemed to exhibit repulsion because their strength of association is -9.466
  • for 'weak collocation', the strength of association lay between -2 and 2. That is, the words road + assaults exhibit weak collocation, with a strength of association: 0.254; and the words street + becomes with a strength of association: -0.125
  • for 'strong collocation', the strength of association was set at > 2. That is, the words main + street have a collocation strength of 331.407; while the words main + road have one of 232.256

It should be noted that our methodology rests upon the notion that both collocational attraction and lexical repulsion sit at different points on the same scale, and the statistical thresholds are set accordingly.

5.3 Selection of synonym candidates

Our synonym selection process is still underway, but we began by building up a working inventory of word pairs based i) on intuition, ii) on automatically extracted 'nymic' thesaural output from our ACRONYM system (Renouf 1996), and iii) an assortment of traditional pre-corpus-linguistic categories of synonym (e.g. Palmer 1981); that is to say, categorised according to parameters along which the synonyms can be said to differ in some respect (with reference to region, etymology, level of formality, and so on). We selected good synonym candidates on the basis of their intuitively greater or lesser semantic proximity. We have progressively selected from and modified this list in the light of new questions arising from the observation of each successive synonym pair.

5.4 Results

We expected that the type of repulsion which would emerge from the comparison of synonyms would be occurring for collocational, and thus purely arbitrary and conventional, reasons. We discovered in the course of our iterative processing and observation that there were in fact other, quite systematic, types of repulsion involved. This will be illustrated in a series of cases below.

5.4.1 Synonyms: road and street

The synonyms road and street, seen in Table 2, were selected out of curiosity. Palmer (1981) sees them as embodying fundamental referential distinctions, while we know that their referential spheres are not consistent in text, and that their meaning distinctions can be fuzzy (even disregarding their proper name functions). They also have the merit of being relatively equi-frequent (with 86,298 and 69,657 occurrences respectively) given the overall corpus size, thus presenting less of a statistical balancing act from the outset.

Table 2. Road and street repulsion cross-tabulation.

road street cross-tabulation

A cross tabulation (Table 2, shown above) was used to display the joint distribution of the three categories of association: repulsion, weak collocation and strong collocation, between the collocates of road and street. Table 2 shows that though street occurs fewer times than road in the corpus, street strongly collocates with more word types (850) than does road (794). This implies that road operates within a more restricted domain and consequently repels more word types than street in the collocate space. The two words share many, in fact, 129, strong collocates, but on the other hand road actively repels 124 of street's collocates, almost as many as it shares. The word street repels 92 of road's collocates. Referring to the graphic representation in Table 2, 129 collocates occupy the common shaded area between the two circles, whereas 124 and 92 items occupy the crescents on either edge of the circles respectively, representing repulsion. There are 573 strong collocates of road which only weakly collocate with street, and conversely 597 strong collocates of street which only weakly collocate with road. The other boxes are empty because we have been considering only the significant collocate space of the two words, and in this space, we are not likely to find any output of words that are either repelled by both road and street or weakly collocating with each of them. Only when we consider all collocates of road and street can we expect these boxes to be filled. We shall later in the project also consider the contents of these boxes. The considerable number of repulsion candidates indicates that road and street not only have semantic and referential differences, but also different textual uses. Table 3 lists some of the words that either road or street repel or attract.

Table 3. Collocates exhibiting different textual uses of road and street.

road repels
(street attracts)
street repels
(road attracts)
both attract
 
cred
corners
vendors
robberies
urchins
banks
retailer
robbery
retailers
civvy
demonstrations
cleaning
 
rage
pricing
haulage
trunk
hauliers
junctions
congestion
rocky
blocks
junction
toll
coastal
 
layout
sweeper
cobbled
deserted
adjacent
urban
protests
quiet
pedestrian
littered
opposite

5.4.2 Synonyms: royal and regal

The synonyms royal and regal shown in Table 4 were picked from Palmer's list of 'etymological synonyms', and our working assumption was that there was no obvious difference in meaning [2] between them.

Table 4. Royal and regal repulsion cross-tabulation.

royal regal cross-tabulation

In fact, we discover in the output that the two words share only 27 collocates, while regal actively repels 658 of royal's collocates. The explanation for this is two-fold. Firstly, the word royal is much more frequent in the corpus than the term regal, in a ratio of 24,036:1,262 occurrences, and thus has more general currency, while the rarer term regal is by definition more selective about its collocates. The second reason is that the output shows that there is clearly a distinction in the sense of each of the words when used in text: regal is differentiated from royal in meaning 'not actually royal, but acting in a manner imitating or associated with royalty' (see Table 5 below).

Table 5. Collocates exhibiting different textual uses of royal and regal.

royal repels
(regal attracts)
regal repels
(royal attracts)
both attract
 
disdain
suitably
pelargoniums
poise
appropriately
splendidly
vice
gesture
hauteur
surroundings
arrogance
setting
quality
look
 
assent
palace
charter
yacht
household
wedding
Saudi
infirmary
family's
prerogative
commission
warrant
blue
jelly
 
style
dignity
touch
authority
procession
robes
progress
purple
tour
status
figure
treatment
presence
power

5.4.3 Synonyms: immaculate and impeccable

The immaculate-impeccable pairing shown in Table 6 was chosen out of curiosity because it forms part of a set of semantically similar including also flawless, spotless, faultless which had been suggested by another writer (Laviosa-Braithwaite 1996). This pair was judged by us to be particularly similar semantically; they also had the advantage of being equi-frequent, in the ratio of 3,723:3,355.

Table 6. Immaculate and impeccable repulsion cross-tabulation.

immaculate impeccable cross-tabulation

In fact, in Table 6 it can be seen that, despite their being rather low-frequency words, immaculate and impeccable share 68 strong collocates, thus confirming a high degree of semantic overlap between the two words. Nevertheless, impeccable also actively repels 46 of immaculate's collocates, fewer than it shares, while immaculate actively repels 60 of impeccable's collocates.

Table 7. Collocates exhibiting different textual uses of immaculate and impeccable.

immaculate repels
(impeccable attracts)
impeccable repels
(immaculate attracts)
both attract
 
credentials
logic
source
liberal
behaviour
character
sources
integrity
academic
connections
references
speaks
judgment
provenance
politeness
craftsmanship
 
conception
garden
lawn
gardens
grass
dark
flat
coiffure
coercion
bungalow
nails
copy
kitchen
passing
waiters
home
 
timing
English
display
taste
technically
otherwise
performance
technique
control
playing
sense
normally
detail
taste
handling
manners
black

When we look at the actual elements of attraction and repulsion in question, shown in Table 7, the reason becomes clear. The word impeccable is associated with abstract qualities, while the word immaculate characterises physical states of perfection (primarily appearance).

5.4.4 Synonyms: ideas and plans

The synonym pair ideas and plans seen in Table 8 was chosen out of curiosity because it forms part of a larger semantic set, with concepts, proposals, projects, and propositions. The particular pair was selected because it was intuited to be closer semantically than some of the others and so likely to generate more surprising repulsion.

Table 8. Ideas and plans repulsion cross-tabulation.

ideas plans cross-tabulation


Table 9. Collocates exhibiting different textual uses of ideas and plans.

ideas repels
(plans attracts)
plans repels
(ideas attracts)
both attract
 
contingency
best-laid
shelved
expansion
announced
equity
afoot
shelve
finalising
spending
 
preconceived
bright
abstracts
bounce
exchanging
swap
philosophical
progressive
wacky
sharing
 
innovative
radical
half-baked
revolutionary
imaginative
grandious
definite
floated
original
submit

However, the collocational-repulsion output in Table 9 does reveal some clear differences. The word ideas seems to sit at the provisional, abstract, ideational, conceptual end of a process, which then progresses to the plans stage, which becomes concrete and ordered and executable.

5.4.5 Synonyms: expert and specialist

We then took the words expert and specialist, in the belief that they were even closer semantically and would therefore show low repulsion scores. In fact, we found that the opposite was true.

Table 10. Expert and specialist repulsion cross-tabulation.

expert specialist cross-tabulation


Table 11. Collocates exhibiting different textual uses of expert and specialist.

expert repels
(specialist attracts)
specialist repels
(expert attracts)
both attract
 
new
company
course
school
record
programme
five
role
position
centre
companies
small
sales
press
business
biggest
bank
shows
areas
paper
jobs
firm
hospital
area
event
 
cover
site
coach
clubs
seven
largest
selling
station
commercial
practice
troops
cars
magazine
department
products
community
student
interests
sport
interest
policies
programmes
network
agencies
squad
 
world
great
evidence
view
report
real
patients
hands
touch
comment
views
space
telling
Lord
relationship
 
witnesses
cancer
advice
weapons
panel
opinion
leading
explosives
computer
fertility
medical
fitness
forensic
knowledge
independent
legal
Professor
eye
pensions
tuition
foremost
advisers
security
relations
advisory
 
writer
judges
care
staff
seeking
turnaround
assessment
supervision
design
recruitment
engineering
offering
mortgage
development
technology
radiation
ethics
resources
infertility
psychiatric
Computer
welfare
accountancy
investment
advisors

Both expert and specialist have similar frequencies in the corpus and share 123 strong collocates (see Table 10 above). The proportion of shared collocates is considerably bigger than the proportion that they repel in their joint collocate space. This implies that both words operate in a very similar semantic sense. However, the highest repulsion scores for expert are with new, company, course and school, whilst those for specialist are with world, great, evidence, view and report (see Table 11). The data thus show a general trend by which specialist can be applied to both human and inanimate objects and concepts, whereas expert tends to be restricted to people. The difference between the two words seems, as with road and street, to come down to specific semantic attributions within the journalistic medium.

6. Next stage tasks

We have given a taste of the early findings with relation to the lexical repulsion obtaining between synonyms, findings which are proving intuitively promising, and certainly intriguing. But the project has run only for four months at the time of writing. Over the next twenty months, the plan is to move on to investigate the following areas: modified statistical thresholds, repulsion spans, directionality of repulsion in fixed phrases, semantic repulsion, as well as text at phrase level, repulsion across sentence boundaries, and the effect of case sensitivity.

6.1 Modifying statistical measures and thresholds

Although we have been observing some clear examples of lexical repulsion between the collocates of synonyms, the statistical measures and thresholds applied were giving preference to the more frequent words in the corpus. To begin with, the statistical formula for assigning repulsion (Equation 1) was related to traditional ideas of collocation whereby having an occurrence greater than that expected by chance automatically generates a high score. These statistical measures are particularly appropriate for dealing with rare events.

Equation 1

For example in the case of road and street (Table 12), words like life, market, children and London that are highly frequent in the corpus show strongest repulsion (column 4 of same table) with road, even if road sometimes collocates with these very words that it repels. This is because even if words like life which collocates with road 4 times in the corpus, the event is considered rare, since life occurs 399,518 times (column 2 of same table) in the whole corpus. The effect of high frequencies of these types of words in the corpus raises the repulsion scores and lowers the attraction with road. There are also words like sales, art, bank in the repulsion list which do not collocate at all with road, but show different (lower) repulsion scores based purely on their lower frequency in the corpus.

Table 12. Road and street repulsion scores using initial formula.

Column 1: target word; Column 2: corpus frequency of target word; Column 3: collocate frequency of target word with road over span=1; Column 4: repulsion score between target word and road; Column 5: collocate frequency of target word with street over span=1; Column 6: attraction score between target word and street

initial scores of road and street

In order to counteract the dominating effect of word frequency on the repulsion score, we tested by modifying the formula to:

Equation 2

Equation 2 shows that the square-root calculation in the denominator has been removed. This means that when the observed collocational frequency of one word with another is zero, the repulsion score will always be -1; irrespective of the frequency of the word in the corpus. The new modified scores (Table 13) show that we can now identify the more relevant candidates like cred, corners, vendors, robberies, in a relationship of repulsion with road which all have a score of -1 because they never collocate with road. The new scores have been ordered by the attraction scores for street, so that words that are more strongly attracted to street are the ones that show highest repulsion with road and sit furthest away from road, according to the graphical representation proposed for collocate space in a diagram at the start of the paper (Figure 1).

Table 13. New road and street repulsion scores using modified formula.

Column 1: target word; Column 2: corpus frequency of target word; Column 3: collocate frequency of target word with road over span=1; Column 4: repulsion score between target word and road; Column 5: collocate frequency of target word with street over span=1; Column 6: attraction score between target word and street

modified scores of road and street

6.2 Modifying repulsion spans

We have always expected to discover the strongest repulsion occurring within span 1, because observation has shown us that a word exerts the greatest influence on its immediate neighbours, and vice versa. We already know that lexical attraction decreases as the span between two words increases. A wider span brings in other influencing factors on the position of the word in the corpus.

Table 14. Repulsion between immaculate and impeccable over span 4 and span 1.

Column 1: target word; Column 2: corpus frequency of target word; Column 3: collocate frequency of target word with immaculate; Column 4: repulsion/attraction score between target word and immaculate; Column 5: collocate frequency of target word with impeccable; Column 6: repulsion/attraction score between target word and impeccable

Span 4

word corpusfreq collfreq1 immaculate collfreq2 impeccable
Labour 249743 0 -3.078 19 2.619
political 243483 0 -3.039 22 3.671
social 143061 0 -2.334 12 2.428
credentials 7864 4 2.644 365 335.699
defence 108517 10 2.844 15 4.834
character 58785 4 0.950 29 16.000
flat 52404 15 8.367 5 1.602
kitchen 27304 7 12.149 2 0.245

Span 1

word corpusfreq collfreq1 immaculate collfreq2 impeccable
political 243483 0 -1.601 12 5.536
social 143061 0 -1.352 10 5.799
credentials 7864 0 -1.019 97 94.007
defence 108517 0 -1.265 5 2.566
character 58785 0 -1.142 24 19.568
flat 52404 9 6.863 0 -1.141
kitchen 27304 6 4.566 0 -1.073

In the example of immaculate and impeccable, Table 14 shows words ordered by highest repulsion with immaculate. The word credentials collocates strongly with impeccable at span 4, but credentials also collocates 4 times with immaculate and shows no repulsion. At span 1, however, credentials does not collocate with immaculate at all, but still retains strong collocation with impeccable, so repulsion is evident with immaculate. High frequency lexical words such as political and social, which collocate strongly with impeccable over span 4 and not at all with immaculate, exaggerate their repulsion with immaculate, a skew which can be balanced by applying tighter spans. Labour, a high frequency word in the corpus, collocates 19 times (span 4) with impeccable though semantically unrelated. This effect is overcome at span1, where the word drops from the list. The results show that when considering tighter spans, particularly span 1, word pairs like immaculate + credentials, immaculate + defence, and immaculate + character never occur and actively repel each other. Similarly impeccable + flat and impeccable+ kitchen never occur either. This repulsion is not strongly evident when the span is increased to 4.

6.3 Directionality in repulsion

Our manual trial confirmed that statistics can show strong positive collocation scores in one direction and negative scores in the other, for instance in the case of so-called 'irreversible bi-nomial' and 'irreversible tri-nomial' phrases. The scores for right and left handed span of 4 were as shown in Table 15, where wine and dine, and calm and collected, respectively, are clearly shown to be the preferred ordering within their phrases.

Table 15. Directionality scores for irreversible phrases.

wine and dine   (cool,) calm and collected  
wine R+/-4 coll dine   - 0.117 calm R+/-4 coll collected     -1.235
wine L+/-4 coll dine   31.951
calm L+/-4 coll collected    57.890

We conclude from this that a repulsion calculation, using a +/-2 span, should be possible to tailor, which will be useful for identifying that subset of the phraseology of English intuitively characterised as 'irreversible bi- and tri-nomials', and we shall include this in the next phase of investigation.

7. Semantic repulsion

Semantic repulsion is a relationship which we define as obtaining between two words which do not collocate with each other in conventional text, and which also do not share significant collocates. We would have in mind here such repulsion pairs as pineapple+scarf, or happy+murder. We would try to factor out word pairs such as happy and death, which are in principle semantically incompatible, but which are juxtaposed fairly regularly if counter-intuitively to achieve a stylistic effect.

As a starting point for our assessment of semantic disparateness, and thence semantic repulsion, we selected the word pair bus and butter. On the basis of our intuition, it seemed reasonable to assume that these were not semantically related, and that this would be indicated by the absence of shared collocates (and presence of common repelled, or weakly collocating, items) between them.

Table 16. Bus and butter repulsion cross-tabulation.

bus butter cross-tabulation

The cross-tabulation shown in Table 16 confirms our expectation that most of the words in the shared collocate space are weakly collocating or showing repulsion with either bus or butter, unlike the examples of synonyms we have shown earlier, where most words strongly collocated with both. However, it also emerged that, whilst bus and butter are shown by their repulsion profiles to be largely different in semantic terms, they do in fact share a few (8 out of a total of 978) strong collocates and thus can be said to be partially related.

Table 17 shows both the lists of words they each repel on account of their semantic differences, and the list of collocates which they share. The lists are obvious when one sees them, but the content of the third column: words constituting the collocates shared by the two unrelated words, is not intuitively accessible even to a native speaker (or, to give a practised few the benefit of the doubt, at least not without lengthy reflection).

It will be apparent to the keen eye that the particular scoring method used for this particular study favours high-frequency words. For example, a word like white (corpus frequency: 120013) which only collocates with bus and butter 22 and 11 times respectively, appears as a strong collocate of both. Future work (as explained in the above section) will involve applying a variety of statistical measures to address issues like these.

Table 17. Collocates showing bus and butter repulsion.

bus repels
(butter attracts)
butter repels
(bus attracts
both attract
 
put
real
season
words
paper
oil
issues
little
products
fresh
rich
life
 
stuff
add
wine
cut
sea
cold
served
Zealand
spread
remaining 
 
new
left
hour
small
public
went
Manchester
UK
coming
last
school
next
market
car
 
coach
main
special
private
taking
old
central
group
attack
back
city
system
full
 
rolls
white
extra
substitute
instead
orange
cooking
yellow

8. Findings so far

In this paper, we have introduced the notion of 'repulsion' in text, as an opposing 'force' or tendency to that of collocation or 'attraction'. With our background in the study of the relationship between surface features of text and meaning, our focus has been on 'lexical repulsion' and briefly also on 'semantic repulsion', rather than on, say, 'grammatical repulsion', since these types can both be viewed and to some extent hopefully also explained in terms of collocation. We are interested in identifying repulsion as an active phenomenon, as opposed to one of collocational indifference, and as a measurable phenomenon.

In these early stages, we have made initial sorties into different aspects of the topic, as reflected in the structure of this paper. We have presented an analysis and discussion of the lexical repulsion revealed by a range of synonym pairs based on research so far; and drawn conclusions about what the next stages of the research should be, with reference to such things as the refinement of statistical measures and thresholds, specific aspects of repulsion such as directionality and span, and importantly, also to semantic repulsion.

We have so far confirmed our intuition that the most interesting, while also most manageable, aspect of lexical repulsion, obtains between sense-related pairs. Interesting, because it is in principle surprising that two related words would repel each others' collocates, and manageable, because the output is inevitably a much reduced subset of the total lexicon. As we know, synonymy is only a partial phenomenon, and synonyms differ in meaning according to the particular functions they each fulfil, their frequency of occurrence in text, their range of senses, and the types of context in which they typically occur. In our studies so far, we have confirmed that synonyms actively repel certain of each other's collocates wherever they differ in these aspects.

Furthermore, we have been able to discover through our repulsion output that collocational differences in the behaviour of two synonyms, which have until now been thought just to be arbitrary and conventional, are in fact systematic and explicable, primarily in terms of detailed semantics. This discovery is important for language teaching, research and NLP, because it will certainly lead to the provision of hitherto unavailable information about the lexicon which is finer-grained, objective and accessible.

Notes

* We should like to acknowledge with thanks the help of the Engineering and Physical Sciences Research Council for their support of the Repulsion project, under grant No. EP/D502551/1; as well as the helpful comments of the reviewers of this paper.

[1] The fact that, under different magnetic circumstances, electrically-charged particles can also repel each other, conveniently allows us to extend the metaphor to characterise the converse linguistic phenomenon which we argue exists, the relationship of 'repulsion' into which word pairs may under other circumstances enter.

[2] There is an element of sophistry in our selectional methodology, since one cannot gaze at words in texts for decades without developing an awareness and a predictive ability of the differences in meaning and use between so-called synonyms in English.

Sources

ACRONYM (Automatic Collocational Retrieval of NYMs), http://rdues.bcu.ac.uk/acronym.shtml

Research and Development Unit for English Studies (RDUES), http://rdues.bcu.ac.uk

References

Aronoff, M. 1976. Word Formation in Generative Grammar. Cambridge, Mass.: MIT Press.

Beeferman, D., A. Berger & J. Lafferty. 1997. "A Model of Lexical Attraction and Repulsion". Proceedings of the 35th Annual Meeting of the ACL and 8th Conference of the EACL, Madrid, Spain, 7-12 July 1997, 373-380. Morristown, N.J.: Association for Computational Linguistics.

Bonci, A. 2002. "Collocational Restrictions in Italian as a Second Language". Tuttitalia 26: 314.

Firth, J.R. 1957 [1951]. "Modes of meaning". Papers in Linguistics, 1934-1951, by J.R. Firth, 190-215. London: Oxford University Press.

Kim, D.W. 1998. "Finding the Reader in Literary Computing". Computing in the Humanities Working Papers A.11, April 1998. Jointly publ. with TEXT Technology 8.1, Wright State University. http://projects.chass.utoronto.ca/chwp/kim/

Laviosa-Braithwaite, S. 1996. "Comparable Corpora: Towards a Corpus Linguistic Methodology for the Empirical Study of Translation". Proceedings of the Maastricht Session of the 2nd International Maastricht-Lodz Duo Colloquium on 'Translation and Meaning', ed. by M. Thelen & B. Lewandowska-Tomaszczyk, Part 3, 153-163. Maastricht: Hogeschool Maastricht.

Pacey, M., A.J. Collier & A.J. Renouf. 1998. "Refining the Automatic Identification of Conceptual Relations in Large-scale Corpora". Proceedings of the Sixth Workshop on Very Large Corpora, at COLING-ACL, Montreal, 15-16 August 1998, ed. by E. Charniak, 76-84. University of Montreal & Morgan Kaufmann Publishers.

Palmer, F.R. 1981. Semantics. Cambridge: Cambridge University Press.

Renouf, A.J. 1993. "Making Sense of Text: Automated Approaches to Meaning Extraction". Proceedings of the 17th International Online Information Meeting, London, 7-9 December 1993 (= Online Information, 93), ed. by D.I. Raitt & B. Jeapes, 77-86. Oxford: Learned Information.

Renouf, A.J. 1996. "The ACRONYM Project: Discovering the Textual Thesaurus". Papers from English Language Research on Computerized Corpora (ICAME 16), ed. by I. Lancashire, C. Meyer & C. Percy, 171-187. Amsterdam & New York: Rodopi.

Resnik, P. 1997. "Selectional Preference and Sense Disambiguation". Presented at the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C., April 4-5, 1997. http://www.aclweb.org/anthology/W97-0209.pdf

Yip, M. 1998. "Identity avoidance in phonology and morphology". Morphology and its Relation to Phonology and Syntax, ed. by S. LaPointe, D. Brentari & P. Farrell, 216-246. Stanford, Calif.: CSLI Publications.