The search for repulsion: a new corpus analytical approach *
Antoinette Renouf &
Jayeeta Banerjee
Research and Development Unit for English Studies,
Birmingham City University
In the last two decades, our research
has centred on word collocation and its role in the construction of meaning in
text. In this paper, we propose that there is a 'force', that we call
'repulsion', which operates in an opposing way to that of lexical collocation.
By 'repulsion', we mean the intuitively-observed tendency in conventional
language use for certain pairs of words not to occur
together. We write within the context of a large-scale study, which has the goal
of establishing how repulsion operates in text and whether it has the status of
an objective and measurable 'force'. We are interested in identifying the
process of actual distancing between words, rather than just enumerating the
instances where word co-occurrence is prohibited by other factors such as
grammatical norms, and we further wish to make a clear distinction between cases
of 'indifference' and of active repulsion. This is a hitherto unexplored aspect
of language in use, and we hope to develop an objective 'lexical repulsion'
measure, capable of providing insights into text creation which will be of use
in lexicology, language pedagogy and NLP.
Until now, our research focus at the Research and Development Unit for English Studies
(RDUES), has been on collocation (Firth 1951), the way in which certain words significantly prefer each other's
company, whether in adjacent pairings or in discontinuous phrasal frameworks.
This information is of intrinsic interest to the linguist, but we have found
that we can also exploit the resultant inventory of word pairs in various
practical ways; for instance, in automating the finding of sense-related words
on the basis of their collocational similarities (e.g. Renouf 1996); or in
identifying changes in word use across time through the changes in the
collocational behaviour (e.g. Renouf 1993; Pacey et al. 1998).
In this paper, we propose to introduce
and explore the notion that there is another 'force', which we shall call
'repulsion'
[1],
that operates on the construction of text in an
opposing way to that of collocational 'attraction'. We start from an intuition,
based on long-term observation of textual organisation, that there is a tendency
in conventional language use to avoid putting certain pairs of words together -
for instance, it is conventional in English to say impeccable + manners,
but not spotless + manners. We describe and discuss the method by which
we try to establish how this phenomenon operates, and whether it is measurable
and sufficiently robust to merit being posited as a 'force' in the organisation
of text.
We are interested in articulating the
nature of active distancing, rather than in just identifying instances where
words are prohibited from co-occurring for well-known reasons, such as
grammatical norms. We also hope to differentiate between indifference, where
words do not co-occur because they have no relationship with each other, and
cases of actual repulsion.
The English native-speaker routinely says
Merry Christmas, Happy Christmas and Happy Birthday but not Merry Birthday, so we applied our existing
z-score collocational statistics (set at span +/-1, and with case insensitivity)
to this apparent phenomenon of avoidance, and found that it was
measurable. As shown in Table 1, the measures for the three word pairs above
revealed that, while Pairs 1, 2 and 3 collocate strongly, Pair 4 does not collocate
at all and produces a negative z-score:
Table 1. Collocation of merry,
happy, christmas and birthday.
Word1 Corp
freq |
Word2 Corp
freq |
Collocates |
Collocate
score |
(merry 2326) |
(christmas
90670) |
450 |
393.205 |
(happy
8323) |
(christmas 90670)
|
299 |
196.010 |
(happy
8323) |
(birthday 2416)
|
526 |
516.16 |
(merry
2326) |
(birthday
2416) |
0 |
-1.014 |
This finding encouraged us to develop a
more detailed hypothesis about the existence and nature of 'repulsion' in text,
and to test it systematically. Given that our methodology is based on
collocational considerations, we focus on 'lexical' repulsion, i.e. the intuitively-observed
tendency in conventional language use for certain pairs of words not to occur together,
for no apparent reason other than convention.
We also decided to extend our
investigation to semantic repulsion, based on our experience that semantic
phenomena in text are accessible by collocational means, at least to some
extent.
Of course, we already know that a type of
repulsion exists in text at several levels of description and generality. To the
extent that 'repulsion' means 'restricted co-occurrence', this has been shown in
the literature to occur at the levels of grammar, semantics (e.g. Resnik 1997),
morphology (e.g. Aronoff 1976; Yip 1998) and phonology (e.g. Kim 1998). On
the other hand, little or no investigation seems to have been carried out either
into the nature of actual 'repulsion', or of 'lexical repulsion', in our terms.
The closest approaches we have found are, firstly, in computational linguistics,
where Beeferman et al. (1997),
refer to repulsion, but only in terms of the 'lexical exclusion principle': they
state that exact word repetition 'occurs less frequently' within shorter text
spans, for stylistic and syntactic reasons. Meanwhile, in the fields of
translation (e.g. Laviosa-Braithwaite 1996) and foreign language teaching (e.g.
Bonci 2002), 'collocational constraints', 'collocational restrictions', and
'collocational clashes' are concepts which are intuitively applied. However,
they are never systematically treated or quantified.
The corpus data used in the study
comprised 800 million words of 'broadsheet' journalistic text, the
Independent and Guardian newspapers from 1989-2006. This had the
benefit of providing us with sufficient information to allow statistical
measures of significance to be applied sensibly. It should, on the other
hand, be borne in mind in interpreting our results and findings that they only
reflect the facts of the language as found in this particular type of written
discourse, though we are confident that it furnishes reasonably relevant and
'mainstream' language use. The newspapers, though covering a time-period, were
processed as a synchronic entity for the purpose of our research goals. Article
boundaries were noted, since the phenomenon of repulsion can be assumed to
function within the confines of a single article.
As a starting hypothesis, it was assumed
that lexical repulsion is a matter of convention in the English native-speaking
community; we are concerned with cases where two words simply do not collocate
with each other in text for no obvious reason other than habit: that is, where
the repulsion is not due to the fact that they are semantically incompatible, or
grammatically disallowed, or morphologically or phonologically blocked. Our
chosen remit involved the investigation of both words and phrases, of both
contiguous & discontinuous word relations, and also of the aberrant
collocation of words which are conventionally found to repel each
other.
As many collocational studies including
our own have shown, an individual word actually 'attracts' relatively few words
to any significant degree. Given the extensive inventory of the English
vocabulary, this means that there remain a very large number of word types to
which a word is either indifferent or, according to our hypothesis, 'hostile'.
In order to preempt the predicted generation of vast lists of unsurprising
output which any valid repulsion methodology could accordingly be assumed to
generate for a word, we began our study of repulsion with a focus on the
clearest and most manageable case of the phenomenon. This was deemed to lie in
the behaviour of synonymous word pairs. Synonyms can reasonably be expected by
virtue of their shared meanings to share significant collocates in text. This,
so our thinking went, would mean that the instances of repulsion would be
particularly surprising and informative, and the list of items repelled by each
synonym of the other's would be reduced. To limit our initial studies further,
we decided to focus only on the contiguous collocates of each synonym pair, in
order to eliminate any cases where an ambiguity could arise about whether it was
repulsion or simply close collocation that we were witnessing. Our concept of
repulsion is represented diagrammatically in Figure 1 below.
Figure 1. Graphic representation of 'repulsion'.
Word A and Word B in Figure 1 are
synonyms, shown within their respective 'collocational spaces'. The circular
area on the left represents the significant collocate set for Word A, and the
circular area on the right represents the significant collocate set for Word B.
The overlap between these two sets of collocates represents their shared
significant collocates. The middle area of the circle for each word represents
those collocates which might also collocate with the other word, but only weakly
or insignificantly. Meanwhile, the extreme outer crescents represent areas of
actual repulsion by Word B of Word A's collocates, and by Word A of Word B's
collocates.
The statistical measures of repulsion
which we applied were based on relative frequency in relation to the 700 million
word corpus as a whole. We have built 'collocational profiles' containing this
information for each word in newspaper text, and used a z-score cut-off to
identify only the most significant collocates. Collocational z-scores are
a measure of the strength of a relationship based on comparing a) the frequency
with which two observed words collocate within a given span with b) their
expected frequency in a body of text if the occurrence of one word of the pair
was at random relative to the other word. The statistical thresholds (z-score
cut-offs) were as follows:
- for 'repulsion', the strength of
association was set at < -2. That is, the words road + life are
deemed to exhibit repulsion because their strength of association is
-9.466
- for 'weak collocation', the strength
of association lay between -2 and 2. That is, the words road + assaults
exhibit weak collocation, with a strength of association: 0.254; and the words
street + becomes with a strength of association: -0.125
- for 'strong collocation', the strength
of association was set at > 2. That is, the words main + street have a
collocation strength of 331.407; while the words main + road have one
of 232.256
It should be noted that our methodology
rests upon the notion that both collocational attraction and lexical repulsion
sit at different points on the same scale, and the statistical thresholds are
set accordingly.
Our synonym selection process is still
underway, but we began by building up a working inventory of word pairs based i)
on intuition, ii) on automatically extracted 'nymic' thesaural output from our
ACRONYM system (Renouf 1996), and iii) an assortment of traditional
pre-corpus-linguistic categories of synonym (e.g. Palmer 1981); that is to
say, categorised according to parameters along which the synonyms can be said to
differ in some respect (with reference to region, etymology, level of formality,
and so on). We selected good synonym candidates on the basis of their
intuitively greater or lesser semantic proximity. We have progressively selected
from and modified this list in the light of new questions arising from the
observation of each successive synonym pair.
We expected that the type of repulsion
which would emerge from the comparison of synonyms would be occurring for
collocational, and thus purely arbitrary and conventional, reasons. We
discovered in the course of our iterative processing and observation that there
were in fact other, quite systematic, types of repulsion involved. This will be
illustrated in a series of cases below.
The synonyms road and
street, seen in Table 2, were selected out of curiosity. Palmer (1981) sees them
as embodying fundamental referential distinctions, while we know that their
referential spheres are not consistent in text, and that their meaning
distinctions can be fuzzy (even disregarding their proper name functions). They
also have the merit of being relatively equi-frequent (with 86,298 and 69,657
occurrences respectively) given the overall corpus size, thus presenting less of
a statistical balancing act from the outset.
Table 2. Road and street repulsion
cross-tabulation.
A cross tabulation (Table 2, shown above) was used to
display the joint distribution of the three categories of association:
repulsion, weak collocation and strong collocation, between the collocates of
road and street. Table 2 shows that though street occurs
fewer times than road in the corpus, street strongly collocates
with more word types (850) than does road (794). This implies that
road operates within a more restricted domain and consequently repels
more word types than street in the collocate space. The two words share
many, in fact, 129, strong collocates, but on the other hand road
actively repels 124 of street's collocates, almost as many as it shares.
The word street repels 92 of road's collocates. Referring to the
graphic representation in Table 2, 129 collocates occupy the common shaded area
between the two circles, whereas 124 and 92 items occupy the crescents on either
edge of the circles respectively, representing repulsion. There are 573 strong
collocates of road which only weakly collocate with street, and
conversely 597 strong collocates of street which only weakly collocate
with road. The other boxes are empty because we have been considering
only the significant collocate space of the two words, and in this space, we are
not likely to find any output of words that are either repelled by both road
and street or weakly collocating with each of them. Only when we
consider all collocates of road and street can we expect
these boxes to be filled. We shall later in the project also consider the contents
of these boxes. The considerable number of repulsion candidates indicates that
road and street not only have semantic and referential
differences, but also different textual uses. Table 3 lists some of the words
that either road or street repel or attract.
Table 3. Collocates
exhibiting different textual uses of road and
street.
road
repels
(street
attracts) |
street
repels
(road
attracts) |
both
attract |
cred
corners
vendors
robberies
urchins
banks
retailer
robbery
retailers
civvy
demonstrations
cleaning |
rage
pricing
haulage
trunk
hauliers
junctions
congestion
rocky
blocks
junction
toll
coastal |
layout
sweeper
cobbled
deserted
adjacent
urban
protests
quiet
pedestrian
littered
opposite |
The synonyms royal and regal
shown in Table 4 were picked from Palmer's list of 'etymological synonyms',
and our working assumption was that there was no obvious difference in
meaning
[2]
between them.
Table 4. Royal
and regal repulsion cross-tabulation.
In fact, we discover in the output that
the two words share only 27 collocates, while regal actively repels 658
of royal's collocates. The explanation for this is two-fold. Firstly, the
word royal is much more frequent in the corpus than the term
regal, in a ratio of 24,036:1,262 occurrences, and thus has more general
currency, while the rarer term regal is by definition more selective
about its collocates. The second reason is that the output shows that there is
clearly a distinction in the sense of each of the words when used in text:
regal is differentiated from royal in meaning 'not actually royal,
but acting in a manner imitating or associated with royalty' (see Table 5 below).
Table 5. Collocates exhibiting
different textual uses of royal and
regal.
royal
repels
(regal
attracts) |
regal
repels
(royal
attracts) |
both
attract |
disdain
suitably
pelargoniums
poise
appropriately
splendidly
vice
gesture
hauteur
surroundings
arrogance
setting
quality
look |
assent
palace
charter
yacht
household
wedding
Saudi
infirmary
family's
prerogative
commission
warrant
blue
jelly |
style
dignity
touch
authority
procession
robes
progress
purple
tour
status
figure
treatment
presence
power
|
The immaculate-impeccable pairing
shown in Table 6 was chosen out of curiosity because it forms part
of a set of semantically similar including also flawless, spotless,
faultless which had been suggested by another writer (Laviosa-Braithwaite 1996).
This pair was judged by us to be
particularly similar semantically; they also had the advantage of being
equi-frequent, in the ratio of 3,723:3,355.
Table 6.
Immaculate and impeccable repulsion
cross-tabulation.
In fact, in Table 6 it can be seen that,
despite their being rather low-frequency words, immaculate and
impeccable share 68 strong collocates, thus confirming a high degree of
semantic overlap between the two words. Nevertheless, impeccable also
actively repels 46 of immaculate's collocates, fewer than it shares,
while immaculate actively repels 60 of impeccable's
collocates.
Table 7. Collocates
exhibiting different textual uses of immaculate and
impeccable.
immaculate
repels
(impeccable
attracts) |
impeccable
repels
(immaculate
attracts) |
both
attract |
credentials
logic
source
liberal
behaviour
character
sources
integrity
academic
connections
references
speaks
judgment
provenance
politeness
craftsmanship |
conception
garden
lawn
gardens
grass
dark
flat
coiffure
coercion
bungalow
nails
copy
kitchen
passing
waiters
home |
timing
English
display
taste
technically
otherwise
performance
technique
control
playing
sense
normally
detail
taste
handling
manners
black |
When we look at the actual elements of
attraction and repulsion in question, shown in Table 7, the reason becomes
clear. The word impeccable is associated with abstract qualities, while
the word immaculate characterises physical states of perfection
(primarily appearance).
The synonym pair ideas and
plans seen in Table 8 was chosen out of curiosity because it forms part
of a larger semantic set, with concepts, proposals,
projects, and propositions. The particular pair was selected because
it was intuited to be closer semantically than some of the others and so likely
to generate more surprising repulsion.
Table 8. Ideas
and plans repulsion cross-tabulation.
Table 9. Collocates exhibiting
different textual uses of ideas and
plans.
ideas
repels
(plans
attracts) |
plans
repels
(ideas
attracts) |
both
attract |
contingency
best-laid
shelved
expansion
announced
equity
afoot
shelve
finalising
spending |
preconceived
bright
abstracts
bounce
exchanging
swap
philosophical
progressive
wacky
sharing |
innovative
radical
half-baked
revolutionary
imaginative
grandious
definite
floated
original
submit |
However, the collocational-repulsion
output in Table 9 does reveal some clear differences. The word ideas
seems to sit at the provisional, abstract, ideational, conceptual end of a
process, which then progresses to the plans stage, which becomes concrete
and ordered and executable.
We then took the words expert and
specialist, in the belief that they were even closer semantically and
would therefore show low repulsion scores. In fact, we found that the opposite
was true.
Table 10. Expert and specialist repulsion cross-tabulation.
Table 11. Collocates
exhibiting different textual uses of expert and
specialist.
expert repels
(specialist
attracts) |
specialist
repels
(expert
attracts) |
both
attract |
new
company
course
school
record
programme
five
role
position
centre
companies
small
sales
press
business
biggest
bank
shows
areas
paper
jobs
firm
hospital
area
event |
cover
site
coach
clubs
seven
largest
selling
station
commercial
practice
troops
cars
magazine
department
products
community
student
interests
sport
interest
policies
programmes
network
agencies
squad |
world
great
evidence
view
report
real
patients
hands
touch
comment
views
space
telling
Lord
relationship |
witnesses
cancer
advice
weapons
panel
opinion
leading
explosives
computer
fertility
medical
fitness
forensic
knowledge
independent
legal
Professor
eye
pensions
tuition
foremost
advisers
security
relations
advisory |
writer
judges
care
staff
seeking
turnaround
assessment
supervision
design
recruitment
engineering
offering
mortgage
development
technology
radiation
ethics
resources
infertility
psychiatric
Computer
welfare
accountancy
investment
advisors |
Both expert and specialist
have similar frequencies in the corpus and share 123 strong collocates (see
Table 10 above). The proportion of shared collocates is considerably bigger than the
proportion that they repel in their joint collocate space. This implies that
both words operate in a very similar semantic sense. However, the highest
repulsion scores for expert are with new, company, course and
school, whilst those for specialist are with world, great,
evidence, view and report (see Table 11). The data thus show a
general trend by which specialist can be applied to both human and
inanimate objects and concepts, whereas expert tends to be restricted to
people. The difference between the two words seems, as with road and
street, to come down to specific semantic attributions within the
journalistic medium.
We have given a taste of the early
findings with relation to the lexical repulsion obtaining between synonyms,
findings which are proving intuitively promising, and certainly intriguing. But
the project has run only for four months at the time of writing. Over the next
twenty months, the plan is to move on to investigate the following areas:
modified statistical thresholds, repulsion spans, directionality of repulsion in
fixed phrases, semantic repulsion, as well as text at phrase level, repulsion
across sentence boundaries, and the effect of case
sensitivity.
Although we have been observing some
clear examples of lexical repulsion between the collocates of synonyms, the
statistical measures and thresholds applied were giving preference to the more
frequent words in the corpus. To begin with, the statistical formula for
assigning repulsion (Equation 1) was related to traditional ideas of collocation
whereby having an occurrence greater than that expected by chance automatically
generates a high score. These statistical measures are particularly appropriate
for dealing with rare events.
For example in the case of road
and street (Table 12), words like life, market, children and
London that are highly frequent in the corpus show strongest repulsion
(column 4 of same table) with road, even if road sometimes
collocates with these very words that it repels. This is because even if
words like life which collocates with road 4 times in the
corpus, the event is considered rare, since life occurs 399,518 times (column 2 of same table) in
the whole corpus. The effect of high frequencies of these types of words in the corpus raises the repulsion scores and lowers the attraction with road. There are also words like sales, art, bank
in the repulsion list which do not collocate at all with road,
but show different (lower) repulsion scores based purely on their lower frequency in the
corpus.
Table 12.
Road and street repulsion scores using initial
formula.
Column 1: target word; Column 2: corpus
frequency of target word; Column 3: collocate frequency of target word with road
over span=1; Column 4: repulsion score between target word and road; Column 5:
collocate frequency of target word with street over span=1; Column 6: attraction
score between target word and street
In order to counteract the dominating
effect of word frequency on the repulsion score, we tested by modifying the
formula to:
Equation 2 shows that the square-root calculation in the
denominator has been removed. This means that when the observed
collocational frequency of one word with another is zero, the repulsion score
will always be -1; irrespective of the frequency of the word in the corpus. The
new modified scores (Table 13) show that we can now identify the more relevant
candidates like cred, corners, vendors, robberies, in a relationship of
repulsion with road which all have a score of -1 because they never
collocate with road. The new scores have been ordered by the attraction
scores for street, so that words that are more strongly attracted to
street are the ones that show highest repulsion with road and
sit furthest away from road, according to the graphical representation
proposed for collocate space in a diagram at the start of the paper (Figure 1).
Table 13. New road
and street repulsion scores using modified
formula.
Column 1: target word; Column 2: corpus
frequency of target word; Column 3: collocate frequency of target word with road
over span=1; Column 4: repulsion score between target word and road; Column 5:
collocate frequency of target word with street over span=1; Column 6: attraction
score between target word and street
We have always expected to discover the
strongest repulsion occurring within span 1, because observation has shown us
that a word exerts the greatest influence on its immediate neighbours, and vice
versa. We already know that lexical attraction decreases as the span between two
words increases. A wider span brings in other influencing factors on the
position of the word in the corpus.
Table 14. Repulsion
between immaculate and impeccable over span 4 and span 1.
Column 1: target word; Column 2: corpus
frequency of target word; Column 3: collocate frequency of target word with immaculate;
Column 4: repulsion/attraction score between target word and immaculate; Column 5:
collocate frequency of target word with impeccable; Column 6: repulsion/attraction
score between target word and impeccable
Span 4
word
|
corpusfreq
|
collfreq1 |
immaculate |
collfreq2 |
impeccable |
Labour |
249743 |
0 |
-3.078 |
19 |
2.619 |
political |
243483 |
0 |
-3.039 |
22 |
3.671 |
social |
143061 |
0 |
-2.334 |
12 |
2.428 |
credentials |
7864 |
4 |
2.644 |
365 |
335.699 |
defence |
108517 |
10 |
2.844 |
15 |
4.834 |
character |
58785 |
4 |
0.950 |
29 |
16.000 |
flat |
52404 |
15 |
8.367 |
5 |
1.602 |
kitchen |
27304 |
7 |
12.149 |
2 |
0.245 |
Span 1
word
|
corpusfreq
|
collfreq1 |
immaculate |
collfreq2 |
impeccable |
political |
243483 |
0 |
-1.601 |
12 |
5.536 |
social |
143061 |
0 |
-1.352 |
10 |
5.799 |
credentials |
7864 |
0 |
-1.019 |
97 |
94.007 |
defence |
108517 |
0 |
-1.265 |
5 |
2.566 |
character |
58785 |
0 |
-1.142 |
24 |
19.568 |
flat |
52404 |
9 |
6.863 |
0 |
-1.141 |
kitchen |
27304 |
6 |
4.566 |
0 |
-1.073 |
In the example of immaculate and
impeccable, Table 14 shows words ordered by highest repulsion with
immaculate. The word credentials collocates strongly with
impeccable at span 4, but credentials also collocates 4 times with
immaculate and shows no repulsion. At span 1, however, credentials
does not collocate with immaculate at all, but still retains strong
collocation with impeccable, so repulsion is evident with
immaculate. High frequency lexical words such as political and
social, which collocate strongly with impeccable over span 4 and
not at all with immaculate, exaggerate their repulsion with
immaculate, a skew which can be balanced by applying tighter spans.
Labour, a high frequency word in the corpus, collocates 19 times (span 4)
with impeccable though semantically unrelated. This effect is overcome at
span1, where the word drops from the list. The results show that when
considering tighter spans, particularly span 1, word pairs like immaculate
+ credentials, immaculate + defence, and immaculate +
character never occur and actively repel each other. Similarly
impeccable + flat and impeccable+ kitchen never occur
either. This repulsion is not strongly evident when the span is increased to
4.
Our manual trial confirmed that
statistics can show strong positive collocation scores in one direction and
negative scores in the other, for instance in the case of so-called
'irreversible bi-nomial' and 'irreversible tri-nomial' phrases. The scores for right and left handed span of 4 were
as shown in Table 15, where wine and dine, and calm and
collected, respectively, are clearly shown to be the preferred ordering
within their phrases.
Table 15. Directionality
scores for irreversible phrases.
wine and dine |
(cool,) calm and collected
|
wine R+/-4 coll dine
- 0.117 |
calm R+/-4 coll
collected -1.235 |
wine L+/-4 coll dine
31.951
|
calm L+/-4
coll collected 57.890
|
We conclude from this that a repulsion
calculation, using a +/-2 span, should be possible to tailor, which will be
useful for identifying that subset of the phraseology of English intuitively
characterised as 'irreversible bi- and tri-nomials', and we shall include this
in the next phase of investigation.
Semantic repulsion is a relationship
which we define as obtaining between two words which do not collocate with each
other in conventional text, and which also do not share significant collocates.
We would have in mind here such repulsion pairs as pineapple+scarf, or
happy+murder. We would try to factor out word pairs such as happy
and death, which are in principle semantically incompatible, but
which are juxtaposed fairly regularly if counter-intuitively to achieve a
stylistic effect.
As a starting point for our assessment of
semantic disparateness, and thence semantic repulsion, we selected the word pair
bus and butter. On the basis of our intuition, it seemed
reasonable to assume that these were not semantically related, and that this
would be indicated by the absence of shared collocates (and presence of common
repelled, or weakly collocating, items) between them.
Table 16. Bus and butter repulsion cross-tabulation.
The cross-tabulation shown in Table 16
confirms our expectation that most of the words in the shared collocate space
are weakly collocating or showing repulsion with either bus or
butter, unlike the examples of synonyms we have shown earlier, where most
words strongly collocated with both. However, it also emerged that,
whilst bus and butter are shown by their repulsion profiles to be
largely different in semantic terms, they do in fact share a few (8 out of a
total of 978) strong collocates and thus can be said to be partially related.
Table 17 shows both the lists of words
they each repel on account of their semantic differences, and the list of
collocates which they share. The lists are obvious when one sees them, but the
content of the third column: words constituting the collocates shared by the two
unrelated words, is not intuitively accessible even to a native speaker (or, to
give a practised few the benefit of the doubt, at least not without lengthy
reflection).
It will be apparent to the keen eye that
the particular scoring method used for this particular study favours
high-frequency words. For example, a word like white (corpus frequency:
120013) which only collocates with bus and butter 22 and 11 times
respectively, appears as a strong collocate of both. Future work (as explained
in the above section) will involve applying a variety of statistical measures to
address issues like these.
Table 17.
Collocates showing bus and butter
repulsion.
bus
repels
(butter
attracts) |
butter repels
(bus
attracts |
both
attract |
put
real
season
words
paper
oil
issues
little
products
fresh
rich
life |
stuff
add
wine
cut
sea
cold
served
Zealand
spread
remaining
|
new
left
hour
small
public
went
Manchester
UK
coming
last
school
next
market
car |
coach
main
special
private
taking
old
central
group
attack
back
city
system
full |
rolls
white
extra
substitute
instead
orange
cooking
yellow |
In this paper, we have introduced the notion of 'repulsion' in text, as an opposing 'force'
or tendency to that of collocation or 'attraction'. With our background in the study of the
relationship between surface features of text and meaning, our focus has been on 'lexical
repulsion' and briefly also on 'semantic repulsion', rather than on, say, 'grammatical repulsion',
since these types can both be viewed and to some extent hopefully also explained in terms of collocation.
We are interested in identifying repulsion as an active phenomenon, as opposed to one of collocational
indifference, and as a measurable phenomenon.
In these early stages, we have made initial sorties into different aspects of the topic, as reflected in the structure of this paper. We have presented an analysis and discussion of the lexical repulsion revealed by a range of synonym pairs based on research so far; and drawn conclusions about what the next stages of the research should be, with reference to such things as the refinement of statistical measures and thresholds, specific aspects of repulsion such as directionality and span, and importantly, also to semantic repulsion.
We have so far confirmed our intuition that the most interesting, while also most manageable, aspect of lexical repulsion, obtains between sense-related pairs. Interesting, because it is in principle surprising that two related words would repel each others' collocates, and manageable, because the output is inevitably a much reduced subset of the total lexicon. As we know, synonymy is only a partial phenomenon, and synonyms differ in meaning according to the particular functions they each fulfil, their frequency of occurrence in text, their range of senses, and the types of context in which they typically occur. In our studies so far, we have confirmed that synonyms actively repel certain of each other's collocates wherever they differ in these aspects.
Furthermore, we have been able to discover through our repulsion output that collocational differences in the behaviour of two synonyms, which have until now been thought just to be arbitrary and conventional, are in fact systematic and explicable, primarily in terms of detailed semantics. This discovery is important for language teaching, research and NLP, because it will certainly lead to the provision of hitherto unavailable information about the lexicon
which is finer-grained, objective and accessible.
* We should like to acknowledge with thanks the help of the Engineering and Physical Sciences Research Council for their support of the Repulsion project, under grant No. EP/D502551/1; as well as the helpful comments of the reviewers of this paper.
[1] The fact that, under different magnetic circumstances, electrically-charged
particles can also repel each other, conveniently allows us to extend the
metaphor to characterise the converse linguistic phenomenon which we argue
exists, the relationship of 'repulsion' into which word pairs may under
other circumstances enter.
[2] There is an element of sophistry in our selectional methodology, since one
cannot gaze at words in texts for decades without developing an awareness and
a predictive ability of the differences in meaning and use between so-called
synonyms in English.
ACRONYM (Automatic Collocational Retrieval of NYMs), http://rdues.bcu.ac.uk/acronym.shtml
Research and Development Unit for English Studies (RDUES), http://rdues.bcu.ac.uk
Aronoff, M. 1976. Word Formation in
Generative Grammar. Cambridge, Mass.: MIT Press.
Beeferman, D., A. Berger & J. Lafferty.
1997. "A Model of Lexical Attraction and Repulsion". Proceedings of
the 35th Annual Meeting of the ACL and 8th Conference of the EACL,
Madrid, Spain, 7-12 July 1997,
373-380. Morristown, N.J.: Association for Computational Linguistics.
Bonci, A. 2002. "Collocational
Restrictions in Italian as a Second Language". Tuttitalia 26:
314.
Firth, J.R. 1957 [1951]. "Modes of meaning".
Papers in Linguistics, 1934-1951, by J.R. Firth, 190-215. London: Oxford University Press.
Kim, D.W. 1998. "Finding the Reader in
Literary Computing". Computing in the Humanities Working Papers A.11, April 1998. Jointly publ.
with TEXT Technology 8.1, Wright State University.
http://projects.chass.utoronto.ca/chwp/kim/
Laviosa-Braithwaite, S. 1996.
"Comparable Corpora: Towards a Corpus Linguistic Methodology for the
Empirical Study of Translation".
Proceedings of the Maastricht
Session of the 2nd International Maastricht-Lodz
Duo Colloquium on 'Translation and Meaning',
ed. by M. Thelen & B. Lewandowska-Tomaszczyk, Part 3, 153-163.
Maastricht: Hogeschool Maastricht.
Pacey, M., A.J. Collier & A.J.
Renouf. 1998. "Refining the Automatic Identification of Conceptual
Relations in Large-scale Corpora". Proceedings of the Sixth Workshop
on Very Large Corpora, at COLING-ACL, Montreal, 15-16 August 1998, ed. by E. Charniak,
76-84. University of Montreal & Morgan Kaufmann Publishers.
Palmer, F.R. 1981. Semantics.
Cambridge: Cambridge University Press.
Renouf, A.J. 1993. "Making Sense
of Text: Automated Approaches to Meaning Extraction". Proceedings of the
17th International Online Information Meeting, London, 7-9 December 1993
(= Online Information, 93), ed. by D.I. Raitt & B. Jeapes, 77-86. Oxford: Learned Information.
Renouf, A.J. 1996. "The ACRONYM
Project: Discovering the Textual Thesaurus". Papers from English
Language Research on Computerized Corpora (ICAME 16), ed. by I. Lancashire,
C. Meyer & C. Percy, 171-187. Amsterdam & New York: Rodopi.
Resnik, P. 1997. "Selectional Preference
and Sense Disambiguation". Presented at the ACL SIGLEX Workshop on
Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C., April
4-5, 1997. http://www.aclweb.org/anthology/W97-0209.pdf
Yip, M. 1998.
"Identity avoidance in phonology and morphology".
Morphology and its Relation to Phonology and Syntax,
ed. by S. LaPointe, D. Brentari & P. Farrell, 216-246.
Stanford, Calif.: CSLI Publications.
|