A phraseological comparison of international news agency reports published online: Lexical bundles in the English-language output of ANSA, Adnkronos, Reuters and UPI

Federico Gaspari
University of Bologna at Forlì (Italy)

Abstract

This paper presents a study of the lexical bundles (LBs) used in the English-language reports published online by four international news agencies: ANSA and Adnkronos (both based in Italy), Reuters and United Press International (UPI) – headquartered in the UK and the USA, respectively. Given that they provide insights into the phraseological make-up of news texts, LBs are considered revealing indicators of the complex strategies at work in the news-making process: the LBs found in the ANSA and Adnkronos online news reports (which result from an elaborate process of linguistic mediation combining translation, non-native writing, cross-linguistic summarisation, editing and adaptation) are compared with those used by Reuters and UPI (representing the native/original benchmark), in search for commonalities as well as divergences, and peculiar usage patterns displayed by individual news agencies.

Only one 4-word LB occurs in all four sources (“the end of the”), and the analysis looks more closely at those that are over- and under-represented across the corpora, considering their discursive functions to account for the observed discrepancies. The results bring to the fore the distinctive phraseological features of the mediated reports published in English by the two Italian news agencies: three LBs (“the head of the”, “as well as the” and “is one of the”) are present both in the ANSA and Adnkronos data, but they do not feature in the Reuters and UPI corpora; conversely, no LB used in both native/original corpora is simultaneously absent from the ANSA and Adnkronos data.

In addition, the overall usage of 4-word LBs is much higher in the ANSA and Adnkronos corpora compared to the Reuters and UPI data, suggesting that the mediated news texts are much more formulaic than their native/original counterparts. Another interesting finding of this study is that ANSA and Adnkronos taken individually have the highest number of LBs that are not attested in the data of any of the other three news agencies, which points to a more idiosyncratic use of phraseology in the mediated news reports.

1. Introduction and overview

This paper presents a phraseological study of news agency reports in English published on the Internet by four international news agencies, comparing their use of lexical bundles (LBs). The investigation concerns two news agencies based in Italy (ANSA and Adnkronos) and two with headquarters in English-speaking countries, i.e. Reuters and United Press International (or UPI). The English-language news reports published online by ANSA and Adnkronos result from a combination of editorial processes: they may be based to some extent on content already available in English, partly translated from original Italian sources, occasionally by journalists, or written from scratch by non-native speakers of English, sometimes as part of cross-linguistic summaries, and subsequently edited or revised. The addition of explanations or background knowledge for the benefit of international readers is not unusual when the news has a strong Italian focus. This complex scenario gives rise to heavily mediated texts, and is quite common on the Internet, where collaborative authoring as well as the combination and cross-linguistic repurposing of multiple sources are becoming increasingly common, as in the case of Wikipedia.

On the other hand, the news published online by Reuters and UPI is generated in a more traditional and monolingual environment, which is less influenced by linguistic contact and multiple hybrid modes of content production. This study intends to investigate whether the different settings in which the news reports are created leave phraseological traces, comparing the LBs found in the mediated news texts by ANSA and Adnkronos with the native/original ones by Reuters and UPI.

Following this introduction, Section 2 presents briefly related work emphasising the relevance of phraseology and LBs in particular for variationist studies. Section 3 introduces the research hypothesis and the objectives of the study, while Section 4 presents the sources used to build the corpora, starting with a short description of the role played by international news agencies; next the procedure employed to collect the news reports semi-automatically from the Web is explained, and the main features of the corpora thus built are described.

Section 5 is devoted to the comparative analysis of LBs used by the four news agencies, and is opened by a discussion of how the notion of LB was operationalised for this investigation, in the light of previous phraseological studies. In Sections 5.25.6 the results of the analysis are discussed in detail, focusing first on the total set of LBs in the pairs of mediated vs. native/original corpora, then zooming in on those without (parts of) proper names, to conduct a more focused phraseological comparison. Next the LBs shared by all four news agencies are examined, with an exploration of special cases, i.e. borderline examples (phraseological sequences that were just below the minimum frequency cut-off point to be considered in the study) and LBs including the constituent “end”, which show an interesting behaviour. This is supplemented by an investigation of LBs found only in one of the four corpora, to look at idiosyncratic and distinctive phraseological features.

Section 5.7 completes the analysis with a methodological reflection on the research design, discussing the collocational properties of the LBs that are exclusive to the UPI corpus, to show the potential of this more detailed level of the qualitative investigation. Finally, Section 6 draws some conclusions, summarising the reasons why the main findings support the hypothesis that guided the study. Attention is also paid to the limitations of the investigation and to some outstanding methodological issues that require further research. By way of conclusion, plans for future work extending this study are outlined.

2. LBs and phraseology in variationist studies

LBs are phraseological units that have been defined as “recurrent expressions, regardless of their idiomaticity, and regardless of their structural status. [They] are simply sequences of word forms that commonly go together in natural discourse” (Biber et al. 1999: 990). LBs have been the focus of several studies on the phraseological make-up of genres, registers and varieties of English, and have been investigated extensively in variationist and contrastive research. The survey presented in this section provides a brief overview of related work, emphasising the relevance of phraseology and LBs for variationist studies focusing on written English.

Stubbs & Barth (2003) show that the frequency of individual words and LBs of varying length can help to discriminate between text types, comparing three corpora consisting of fiction, belles letters and academic writing. Similarly, Stubbs (2007) uses data extracted from the British National Corpus (BNC) to look at the distribution of high-frequency multi-word units in a variety of text types, describing their structure, lexical components and functions.

LBs have also been shown to be discriminative elements in comparing the phraseology not only of texts representing different text types and genres, but also between the language use of professional and student authors within the same field. Scott & Tribble (2006: 131ff) present a case study focusing on literary criticism texts produced by expert and novice authors, while Chen & Baker (2010) report on LBs found in published vs. student academic writing. Concerning LBs used in texts from different disciplines, Hyland (2008a, 2008b) explores the forms, structures and functions of LBs in research articles, doctoral theses and Master’s dissertations across four disciplines, while Cortes (2004) brings together these two strands of research by looking at the structure and function of LBs in published and student texts in history and biology.

Bernardini et al. (2010) take an approach based on a monolingual comparable corpus, looking at the LBs found in the institutional academic English used on the websites of Italian universities (representing a mediated variety, resulting e.g. from translations and non-native writing) alongside native original institutional texts published online by UK and Irish universities. Their phraseological analysis classifies LBs in terms of structural and functional properties, and reveals interesting similarities as well as differences between the native/original and mediated varieties.

In addition, a number of studies have used LBs to contrast spoken and written registers of English, especially in the academic domain (e.g. Biber & Conrad 1999; Biber et al. 2004; Biber 2006; Biber & Barbieri 2007). The importance of LBs is also well established in research focusing on learners of English and in studies of academic English with a pedagogical inclination (e.g. Cortes 2006; Juknevičienė 2009; Nekrasova 2009; Groom 2009). In sum, this brief overview of related work shows the relevance of considering LBs in variationist studies of English phraseology, and provides the background for the present study.

3. Research hypothesis and objectives of this study

This paper aims at comparing the differences in the use of LBs in the online news reports in English of four international news agencies, especially when they are taken in pairs, comparing the Italy-based ANSA and Adnkronos, on the one hand, with the British Reuters and the American UPI, on the other. The hypothesis underlying this research is that there are quantitative as well as qualitative differences in the LBs used by these two sets of news sources.

There is growing interest in the investigation of the editing and revisions that are routinely applied to translated and non-native texts with varying degrees of mediation, which leave distinctive traces in a range of textual features (see e.g. Murphy 2008 for the differences between edited and non-edited EU texts written in L2 English). However, the theoretical import of various mediation practices is as yet little understood by the linguistics and translation studies communities. This study therefore intends to make a contribution in this area by exploring the potential of LBs as indicators of distinctive phraseological patterns emerging as a result of the complex mediation processes entailed by the production of news agency reports published online.

The reason for this is that quantitative and qualitative phraseological differences reflected by the LBs found in the ANSA and Adnkronos news texts compared to the native and original reports by Reuters and UPI are likely to derive (at last in part) from the news-making and journalistic practices adopted in the respective contexts.

4. Data sources and corpus description

4.1 International news agencies publishing reports in English

News agencies play a substantive role in shaping what is reported by print, broadcast and web-based media all over the world (Van Dijk 1988; Bell 1991: 44ff.; Vuorinen 1997; Read 1999; Clausen 2004; Horvit 2006; Richardson 2007: 106ff.; Shrivastava 2007; van Doorslaer 2009). Stories carried by international news agencies are reproduced by several media outlets with varying degrees of adaptation, especially when reports are initially published on the Internet in major languages (Boyd-Barrett & Rantanen 1998; Holland 2006; Bielsa 2007; Bielsa & Bassnett 2009); English in particular exerts a strong influence on the (translation of) news that spreads globally (Sidiropoulou 1995; Hursti 2001; Bassnett 2005; Hajmohammadi 2005; Kuo & Nakamura 2005; Orengo 2005; Schäffner 2005; Conway & Bassnett 2006; Lee 2006; Kang 2007; Caimotto 2010).

The websites of the two leading international news agencies based in Italy, ANSA and Adnkronos,have been offering content in English for a number of years, and their online news coverage in English is currently updated and extended on a daily basis. Visitors can access the news in English via links prominently located on the home pages of the respective websites, from which they are directed to sections of the websites offering content in English. Only a fraction of this news in English is exclusively or predominantly devoted to domestic socio-political issues, and some of the reports may be partially translated from original articles in Italian, which are also published on the Web.

4.2 Data sources for the monolingual comparable corpus

Generalist international news agencies publish reports covering a very wide range of topics (politics, sports, culture, business, science, etc.) concerning stories that originate anywhere in the world, and the four news sources considered in this study are no exception. Both ANSA and Adnkronos publish a number of news stories in English on their websites on a daily basis, which have a narrower geopolitical focus compared to Reuters and UPI, partly centred on Italy, even though they also provide international coverage. In fact, most of ANSA’s and Adnkronos’ news stories available in English which seemingly or ostensibly focus on Italy have a wider international dimension or relevance, e.g. at the European level or in terms of bilateral relationships with other countries such as the United States, Russia, commercial partners in the Middle East, etc.

Since the English-language online output of all four news agencies covers stories of international relevance on a large variety of topics that are destined to a global Internet audience, the overall contents and potential readership of ANSA, Adnkronos, Reuters and UPI can be assumed to be broadly similar in principle. As a matter of fact, the websites of international news agencies play similar roles worldwide as sources of information that are consulted and quoted from by other news outlets (national and local newspapers, radios and broadcasters) as well as by a range of other organisations and individuals (government agencies, business analysts, commentators, etc.). As such, the websites of international news agencies represent ideal sources of monolingual topic-comparable corpora (Gaspari & Bernardini 2010).

4.3 Corpus construction

The corpus on which this study is based was built using BootCaT (Baroni & Bernardini 2004), a suite of integrated Perl scripts for the semi-automatic creation of corpora from the web. [1] The texts in English were downloaded between early April and mid-May 2010 from the websites of the four international news agencies, although older news reports are included in the respective corpora. [2] To make the download procedure as precise but also as comprehensive as possible, suitable “catch-all” Internet addresses were identified for each of the four websites. [3] These specific sub-domains were searched providing 17 function words as seeds to BootCaT, according to a number of parameters that were applied consistently for the compilation of all four corpora: the tuple length was set to 5, i.e. the searches were performed by BootCaT combining in random sets 5 of the 17 function words at a time; the maximum number of tuples to be randomly generated (corresponding to as many Internet searches to obtain downloadable texts) was set to 1,000; and, finally, there was a limit of 100 web pages to be downloaded for each search based on a 5-keyword tuple. [4]

BootCaT considerably speeds up the corpus construction and cleaning process since it incorporates routines that detect and discard duplicate pages, also removing non-textual components before the data are included in the corpus (e.g. HTML code, boilerplate, navigation elements, etc.). [5] All the candidate web pages with relevant texts provided by BootCaT were accepted for inclusion in the corpus. However, the downloaded texts were also checked manually to filter out boilerplate missed by BootCaT’s automatic cleaning procedures (e.g. menus, news alerts, etc.), so as to obtain a clean corpus. In addition, the author used further semi-automatic procedures to remove remaining repeated articles within the four corpora, because BootCaT eliminates duplicate web pages retrieved multiple times from the same URL after different searches, but it does not detect web pages downloaded from different URLs which contain the same text. Finally, the data automatically downloaded by BootCaT (especially from the Reuters and UPI websites) occasionally included verbatim transcripts of speeches and interviews as well as texts explicitly advertising a variety of products – all this material was manually removed because the study was intended to focus exclusively on news reports.

4.4 Corpus size and components

The four corpora compiled using this procedure were subsequently examined with AntConc, a corpus processing tool that supports the analysis of LBs (Anthony 2006). [6] Table 1 describes the overall corpus, providing details such as the size and structural features of the four components.

Mediated corpora
(Italian sources in English)

Native/original corpora
(UK, US)

ANSA ADN REUTERS UPI
Tokens 357,047 522,295 356,830 305,141
Types 21,295 23,990 26,187 21,488
T/T ratio 5.96 4.59 7.33 7.04
No. of texts 643 1,247 807 1,006
Average no. of words per text 555 419 442 303

Table 1. Corpus size and components

The ANSA and Reuters corpora have similar numbers of tokens, whereas the UPI corpus is the smallest in the set, and the Adnkronos one is by far the largest of the four. The Reuters and UPI corpora have a noticeably higher type/token ratio than the ANSA and Adnkronos corpora. The four components of the corpus also vary in just about every other respect, particularly with regard to the number of texts contained in each of them and the average length of each news story. This is a result of the consistent corpus-building procedure adopted with BootCaT (cf. Section 4.3), whereby the choice was made to normalise the counts for the analyses at a later stage, keeping exactly the same parameters for the semi-automatic corpus-building procedure for all the four news sources. This inevitably resulted in corpora of unequal size and with different internal properties for the four news agencies. However, we felt that there was no sound theoretical or practical reason for tweaking the BootCaT parameters in order to obtain four corpora of similar size and with matching structural features, which might have potentially biased the study.

5. Comparative analysis of LBs

5.1 Operationalisation of LBs

Before carrying out the analysis of the LBs found in the four corpora it was necessary to establish precisely the nature of the phraseological units to be investigated. A typical requirement is that LBs should be self-contained within a clause (without any punctuation marks within them), even though no structural completeness is required. As far as their length is concerned, Biber et al. (1999: 990) define LBs in terms of the minimum number of three constituents, saying that they should consist of “sequence[s] of three or more words”. Even though the length of LBs thus identified could be potentially unlimited, it is in fact quite common to restrict analyses to 4-member units; for example, Hyland (2008a: 8) argues that 4-word LBs “are far more common than 5-word strings and offer a clearer range of structures and functions than 3-word bundles”. Apart from their length, the other crucial aspect for the identification of LBs with a purely frequency-driven approach concerns their frequency in a given corpus, and Biber (2006: 134) recognises that “[t]he actual cut-off used to identify lexical bundles is somewhat arbitrary”, a point which is also reinforced by Hyland (2008a: 8).

To undertake this study it was necessary to decide how to operationalise the concept of LB, and a brief survey of the literature on phraseological studies showed a variety of approaches combining different criteria in various manners, with, however, some consensus on standard practice. Table 2 shows a selection of the ways in which LBs have been operationalised in the literature in terms of length (i.e. number of constituents) and minimum frequency of occurrence.

Study Bundle length Occurrences pMw

Biber et al. (1999)

4-word
5/6-word

10
5

Biber & Conrad (1999)

4-word

20

Cortes (2002, 2004, 2008)

4-word

20

Biber (2006)

4-word

40

Goźdź-Roszkowski (2006)

4-word

50

Hyland (2008a, 2008b)

4-word

20

Juknevičienė (2009)

4-word

40

Bernardini et al. (2010)

4-word

40

Table 2. Criteria to operationalise LBs in phraseological studies

The survey shows that it is standard practice to focus on 4-word units (Biber et al. 1999 also consider 5- and 6-word sequences). In addition, the minimum threshold of frequency for 4-word LBs to be taken into account varies between 10 and 50 occurrences per million words (pMw). Given that the news texts on which our study is based cover a potentially very wide range of topics and domains, we decided on a cut-off point of 40 occurrences pMw for a 4-word sequence to qualify as an actual LB and therefore to be included in our analysis. In this respect we followed Biber (2006: 134), who claims to have taken “a conservative approach, setting a relatively high frequency cut-off of 40 times per million words”. This threshold is also often used in the literature, as shown in Table 2, with only one study, namely Goźdź-Roszkowski (2006), setting a higher minimum limit at 50 occurrences pMw.

Some other requirements were applied in our study to consistently analyse LBs across the four corpora. Contracted forms (e.g. “I don’t think”) were discarded, as they raise thorny issues regarding how to count the components of the LBs, whether taking orthographic words as single units or not. Similarly, potential LBs containing apostrophes, possessives/genitives and abbreviations were filtered out. Another requirement was that LBs should not contain any punctuation, and the only exception was made for the sequence “President George W. Bush”, since it concerned an unambiguous abbreviated proper name. Lexical sequences containing non-alphabetic characters and numbers were also omitted (e.g. “in the late 80s” and “in the 20th century”).

However, in examining the LBs and presenting their analysis in Section 5.2 a couple of obvious abbreviations were consistently expanded: “US” and “U.S.” are given as “United States”, and “UN” was converted into “United Nations”. We decided against harmonising spelling, keeping e.g. the alternations “color/colour” and “centre/center”. Similarly, we did not normalise capitalisation: as a result, “the United States Government” is identified by the cluster function of AntConc as a different LB from “the United States government”, because the searches were carried out activating the case-sensitive option. Sentence-initial LBs are therefore counted separately, i.e. “In an interview with” is treated differently from “in an interview with”.

For ease of presentation, the figures and results in the data analysis are normalised to pMw. Table 3 shows the normalised minimum frequency cut-off points used for the four corpora and the corresponding absolute cut-off points.

  ANSA ADN REUTERS UPI
Tokens 357,047 522,295 356,830 305,141
Absolute cut-off 15 21 15 13
Normalised cut-off pMw 42.01 40.20 42.03 42.60

Table 3. Minimum frequency cut-off points used to identify the LBs in the four corpora

5.2 Results and discussion

Table 4 shows all the LBs that qualify under the criteria that were stipulated in the English-language output of the two Italian news agencies, ordered in descending order of frequency: ANSA has 53 of them in total, whereas Adnkronos has 57. [7]

ANSA (53 LBs)

ADN (57 LBs)

168
165
162
159
145
131
114
112
 
98
81
 
75
72
70
 
 
67
64
 
61
58
 
56
 
53
 
 
50
 
 
47
 
 
 
 
 
 
 
44
 
 
 
 
42
 
 
 
 
 
 
 
 
 
 
 
 

Foreign Minister Franco Frattini
Italian Premier Silvio Berlusconi
at the end of
People of Freedom PdL
of Freedom PdL party
the end of the
Interior Minister Roberto Maroni
for the first time
Italian Foreign Minister Franco
the centre left opposition
the head of the
the statute of limitations
one of the most
Justice Minister Angelino Alfano
as well as the
House Speaker Gianfranco Fini
Minister Franco Frattini said
at the age of
in the wake of
League leader Umberto Bossi
in the sale of
sale of film rights
the sale of film
Northern League leader Umberto
People of Freedom party
in the United States
Italy of Values IdV
The head of the
as soon as possible
four and a half
his People of Freedom
a number of issues
group the Democratic Party
is one of the
Italian President Giorgio Napolitano
of Premier Silvio Berlusconi
opposition group the Democratic
tax fraud in the
the victim of a
European Court of Human
fraud in the sale
in the Rome province
Minister Ignazio La Russa
was one of the
a member of the
at the centre of
Court of Human Rights
lawyer David Mills to
on a number of
on the night of
opposition Italy of Values
Premier Silvio Berlusconi and
will be able to

prime minister Silvio Berlusconi
said in a statement
in an interview with
Italian prime minister Silvio
an interview with Adnkronos
interview with Adnkronos
International
the United States and
with Adnkronos International (AKI)
the end of the
North West Frontier Province
for the first time
in a bid to
in the Gaza Strip
in the Middle East
a member of the
in the United States
prime minister Benazir Bhutto
the United Nations Security Council
is one of the
the head of the
prime minister Vojislav Kostunica
the leader of the
the northern city of
by the end of
have been killed in
In an interview with
in the Islamic Maghreb
People of Freedom party
and the United States
as well as the
secretary general Ban Ki
with the support of
Adnkronos International AKI that
general Ban Ki moon
former prime minister Benazir
in the West Bank
one of the most
between the two countries
in the city of
were killed in the
lower house of parliament
in the town of
people were killed and
the lower house of
the Pakistani daily Dawn
at the end of
in the northern Italian
the North West Frontier
the president of the
as a result of
is believed to have
northern Italian city of
on the basis of
people were killed in
the northern Italian city
the support of the

193
139
112
99
97
95
 
 
93
84
82
80
78
 
76
70
 
68
65
61
 
59
 
 
57
 
 
 
 
55
 
 
 
53
 
51
 
 
49
 
 
47
44
 
 
 
42
 
 
 
40
 
 
 
 
 
 

Table 4. LBs in ANSA and Adnkronos

The LBs found in the ANSA corpus show a strong presence of proper names of senior Italian politicians, often along with their role, e.g. “Foreign Minister Franco Frattini”, “Italian Premier Silvio Berlusconi” and “Interior Minister Roberto Maroni” – the only case of a LB involving the name of a person who is not Italian being “lawyer David Mills to” (42 occurrences), which refers to a British corporate lawyer who was accused of money-laundering and alleged tax fraud in connection with Mr Berlusconi. In addition, the names of political parties with the corresponding acronyms are often found in the ANSA LBs (e.g. “People of Freedom PdL” and “Italy of Values IdV”). There are also LBs that belong to the general language, such as “at the end of”, “the end of the”, “for the first time”, “the head of the” and its variant “The head of the”, which will be discussed in more detail in Section 5.3.

The Adnkronos corpus also features LBs including names of high-profile political figures, though not primarily from Italy (in fact, the LB “prime minister Silvio Berlusconi” contains the only full Italian name in the list, along with others such as “prime minister Benazir Bhutto” and “prime minister Vojislav Kostunica”), as well as others indicating international geographical locations (“North West Frontier Province”, “in the Gaza Strip”, “in the Middle East” and “in the United States”). In the Adnkronos data there are also LBs indicating international organisations and their top political representatives (“the United Nations Security Council”, “secretary general Ban Ki” and “general Ban Ki moon”) that are absent from the ANSA corpus. Unlike ANSA, the Adnkronos corpus frequently has LBs mentioning itself as the source of the news report (“an interview with Adnkronos”, “interview with Adnkronos International” and “with Adnkronos International (AKI)”, all with more than 90 occurrences pMw). Finally, other general LBs found in the Adnkronos corpus include “said in a statement”, “in an interview with”, “the end of the”, “for the first time” and “in a bid to”, to which we will return in Section 5.3.

This cursory analysis of LBs including proper names of people, political parties, geographical areas, institutions and organisations shows that the ANSA news stories have a much narrower focus on Italy compared to the reports published online by Adnkronos, which cover broader geopolitical areas. Moving on to the native/original news sources, Table 5 lists the LBs identified according to the same criteria in the reports of Reuters and UPI; interestingly, 20 are found in the Reuters texts, and 38 in the UPI data.

REUTERS (20 LBs)

UPI (38 LBs)

263
238
106
92
 
89
75
61
 
58
53
50
 
47
 
44
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

in the United States
said in a statement
of the United States
the end of the
told a news conference
at the end of
said in an interview
said in a telephone
the United States and
in a telephone interview
and the United States
as part of a
one of the most
the United States Government
to the United States
by the end of
for the first time
a member of the
at a news conference
to be able to

of the United States
The New York Times
in the United States
President George W. Bush
said in a statement
to the United States
United States President George
United States President Barack
New York Times reported
the United States and
Los Angeles Times reported
the Los Angeles Times
and the United States
in the United States
in the wake of
the United States military
United States Supreme Court
as a result of
the United States government
that the United States
the United States is
in the Middle East
the rest of the
the United States economy
the war on terrorism
by the end of
for the United States
said the United States
United Nations Security Council
in and out of
President elect Barack Obama
the end of the
the United States Senate
The Washington Post reported
The Miami Herald reported
the United States to
the United States will
Vice President Dick Cheney

275
167
147
134
127
 
117
108
104
101
85
 
78
 
68
 
 
65
 
58
 
55
 
 
 
52
 
 
49
45
 
 
 
 
42
 
 
 

Table 5. LBs in Reuters and UPI

The two lists of LBs in Table 5 show that both Reuters and UPI have a strong US focus, as testified by the LBs containing geographical names (in particular “in the United States” and “of the United States” are in the top three positions in both corpora, but interestingly in inverted order). The Reuters data does not present any LBs including people’s names, while the UPI corpus contains some, partially combined with their political role (“President George W. Bush”, “United States President George”, “United States President Barack” and “Vice President Dick Cheney”).

Another difference between the two native/original English-language news sources is that UPI frequently uses LBs including the names of other reputable news providers (e.g. “New York Times reported”, “Los Angeles Times reported” and “The Washington Post reported” – note that the names of these newspapers are always followed by the verb “reported”), whereas the Reuters LBs do not directly mention any such news sources. On the other hand, the Reuters data often include LBs explicitly describing the circumstances in which the facts included in their stories were obtained: “told a news conference”, “said in an interview”, “said in a telephone” and “in a telephone interview”, which are not found in the UPI corpus. The only exception of an LB of this kind attested in the UPI data (in the 5th position in the frequency ranking), which is also found in the Reuters corpus, is “said in a statement” (this LB warrants further investigation and will be discussed in more detail in Section 5.3).

It is also interesting to compare the LBs identified across the four corpora, even if at a fairly general level (a more detailed analysis of a subset of LBs is presented in Section 5.3). Figure 1 shows how many different LBs (or LB types) are used by ANSA, Adnkronos, Reuters and UPI, [8] while Figure 2 indicates the total number of LB tokens (given by adding up how many times each LB is used in the corpus, normalised pMw). [9]

Figure 1. Number of LB types used by the four international news agencies

Figure 2. LB tokens used by the four international news agencies (pMw)

Both ANSA and Adnkronos use a larger stock of LBs than Reuters and UPI (53 and 57 vs. 20 and 38, respectively), which provides a striking comparison especially because the ANSA corpus happens to be of virtually the same size as the Reuters one, whereas other direct pair-wise comparisons would be skewed by the different corpus sizes (the Adnkronos corpus is much bigger, while the UPI one considerably smaller). ANSA and Adnkronos use a similar quantity of LBs (53 and 57), in spite of the different corpus sizes, but interestingly they also present a very similar overall number of LBs in normalised terms (3,719 and 3,622 pMw), which is higher than UPI (3,024 pMw) and much higher than Reuters (1,605).

This suggests that the language used by the two mediated news sources is more formulaic, in that there is a larger repertoire of LBs which tend to be used more frequently, as is clearly visible comparing the (normalised) number of LBs used by ANSA and Adnkronos with those employed by Reuters, while UPI is closer to the Italian news agencies; in other words, the phraseology of both ANSA and Adnkronos is more dense with LBs than is the case for both Reuters and UPI. Finally, it should also be noted that as far as the native/original news sources are concerned, the UPI data include nearly twice as many LBs as Reuters (which is remarkable, since they were extracted from corpora with 305,141 and 356,830 tokens, respectively). The rest of the paper presents a more fine-grained investigation of the behaviour of subsets of LBs selected according to different criteria, comparing their frequency across the four corpora, and combining a quantitative and a qualitative analysis.

5.3 LBs without proper names

As discussed in Section 5.2, 4-word LBs containing proper names of people, countries, geographical locations and organisations (e.g. political parties and institutions) feature quite frequently in all four news corpora. This makes it difficult to analyse the whole data set and to interpret the results, because clearly LBs including proper names have a special status, and they simply reflect the geographical and geopolitical scope of the news texts included in the corpora; they cannot be considered indicators of the phraseological similarities and differences between the four corpora under investigation, but only signals of the main topical foci of the news stories. Hence, in the rest of the discussion we restrict the analysis to LBs without proper names in the four corpora, which are listed in Table 6, consisting of a subset of Tables 4 and 5 combined. [10]

ANSA (27 LBs)

ADN (31 LBs)

REUTERS (14 LBs)

UPI (8 LBs)

162
131
112
98
81
 
75
70
67
64
61
58
 
53
50
 
47
 
 
 
 
44
 
42
 
 
 
 
 
 
 

at the end of
the end of the
for the first time
the centre left opposition
the head of the *
the statute of limitations
one of the most
as well as the
at the age of
in the wake of
in the sale of
sale of film rights
the sale of film
The head of the *
as soon as possible
four and a half
a number of issues
is one of the
tax fraud in the
the victim of a
fraud in the sale
was one of the
a member of the
at the centre of
on a number of
on the night of
will be able to

said in a statement
in an interview with °
the end of the
for the first time
in a bid to
a member of the
is one of the
the head of the
the leader of the
the northern city of
by the end of
have been killed in
In an interview with °
as well as the
with the support of
one of the most
between the two countries
in the city of
were killed in the
in the town of
people were killed and
at the end of
in the northern Italian
the president of the
as a result of
is believed to have
northern Italian city of
on the basis of
people were killed in
the northern Italian city
the support of the

139
112
84
80
78
70
61
 
59
 
57
 
 
55
 
51
49
 
 
44
 
42
 
 
40
 
 
 
 
 
 

238
92
 
89
75
61
58
50
 
44
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

said in a statement
the end of the
told a news conference
at the end of
said in an interview
said in a telephone
in a telephone interview
as part of a
one of the most
by the end of
for the first time
a member of the
at a news conference
to be able to

said in a statement
in the wake of
as a result of
the rest of the
the war on terrorism
by the end of
in and out of
the end of the

127
68
65
55
 
52
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Table 6. LBs without proper names

Figure 3 shows how many LBs without proper names are used in total in the four corpora, with the numbers normalised pMw.

Figure 3. LB tokens without proper names used by the four international news agencies (pMw)

Comparing the data in Figures 2 and 3 one can see that the LBs found in the ANSA and Adnkronos corpora decrease by about half. With regard to the native/original international news agencies, the reduction of LBs without proper names is less substantial for Reuters (from 1,605 to 1,019), while the UPI data show a sharp fall (from 3,024 to just 512, an 83% drop). LBs without proper names are used twice as frequently by Reuters as by UPI in normalised terms, which contrasts sharply with the overall use of LBs (including those with proper names), where the opposite pattern emerges with 3,024 LBs in UPI vs. 1,605 in Reuters.

On the other hand, the overall normalised frequency for both sets of LBs remains consistent across the two mediated corpora: 1,804 LBs without proper names for ANSA and 1,776 for Adnkronos, while the total amount including all LBs is 3,719 vs. 3,622, respectively. As a result, filtering out from the analysis the LBs with proper names confirms even more convincingly the finding that the language used by the mediated news agencies is much more formulaic than is the case for the native/original news providers: on average there are 1,790 LBs without proper names pMw in the ANSA and Adnkronos corpora combined, while the corresponding average figure for Reuters and UPI taken together is 765.5 pMw.

5.4 Common LBs across the four corpora

We now turn our attention to the commonalities and differences in the use of specific LBs across the four corpora, continuing to consider only those that do not contain (parts of) proper names, as was the case for the results presented in Section 5.3. There are 10 LBs that are found in at least two of the four corpora, and they are listed in Table 7, which shows a comparison of the number of occurrences for each news agency normalised pMw associated with the respective frequency ranking within each corpus. [11]

LB ANSA ADN REUTERS UPI
at the end of 162 – 1st 42 – 22nd 89 – 4th  
the end of the 131 – 2nd 84 – 3rd 92 – 2nd 45 – 8th
for the first time 112 – 3rd 80 – 4th 44 – 11th  
said in a statement   139 – 1st 238 – 1st 127 – 1st
the head of the [12] 81 – 5th 61 – 8th    
one of the most 75 – 7th 51 – 16th 50 – 9th  
as well as the 70 – 8th 55 – 14th    
is one of the 47 – 18th 61 – 7th    
a member of the 44 – 23rd 70 – 6th 42 – 12th  
by the end of   57 – 11th 44 – 10th 52 – 6th

Table 7. LBs without proper names found in at least two corpora

Table 7 shows that out of the 10 LBs that are attested in two or more corpora, the mediated news sources share 8 of them: “said in a statement” is absent from the ANSA data, but it is by far the most frequent LB in all the other three corpora. In addition, “by the end of” is also not included in the LBs used by ANSA, but it features in the lists for Adnkronos (57 times pMw, 11th position), Reuters (44 occurrences pMw, 10th according to frequency) and UPI (52 cases pMw, 6th most frequent). It is striking that 3 LBs are present both in the ANSA and Adnkronos data, but they are absent from both native/original corpora (“the head of the”, “as well as the” and “is one of the”). Conversely, no LBs are exclusively present in both native/original corpora while being absent from both mediated ones: “the end of the” is found across all corpora, “said in a statement” and “by the end of” are shared by Reuters and UPI, but they are also present in the Adnkronos data. As a result, while there are 3 LBs that are distinctive of the mediated news data, no 4-word phraseological sequence is exclusive to the native/original corpora, which is an interesting finding suggesting that the mediated news texts do display peculiar phraseological patterns.

5.5 LBs including “end” and borderline cases

The focus now shifts to the LBs which include the lexical item “end” as one of their constituents, because they display an interesting behaviour (cf. Starcke 2008: 209ff): “the end of the”, the second most frequent 4-word LB in the BNC after the typically spoken “I don’t know”, is the only common LB across all four news corpora (2nd in the ranking for ANSA and Reuters, 3rd for Adnkronos and 8th for UPI); “at the end of”, the third most frequent 4-word LB in the BNC, is shared by all corpora except UPI, being the most frequent one in ANSA, the 4th most frequent in Reuters and 22nd in Adnkronos (cf. Forchini & Murphy 2008); [13] finally, “by the end of”, ranked 13th in frequency among the 4-word LBs in the BNC, is absent from the ANSA data, but is represented in the other three corpora (11th in Adnkronos, 10th in Reuters and 6th in UPI). [14]

These 4-word LBs have the sub-string “the end of” as a common sequence, which is the most frequent 3-word LB across all four corpora (incidentally, this is the third most common 3-word LB in the BNC, after “I don’t” and “one of the”). Table 8 shows the normalised number of occurrences pMw across all four news corpora for the three 4-word LBs that contain the string “the end of” (highlighted in bold), alongside their overall frequency in the whole of the BNC for comparison. [15]

LB ANSA ADN REUTERS UPI BNC
the end of the 131 84 92 45 105
by the end of 22 57 44 52 34
at the end of 162(1) 42 89 36 93

Table 8. 4-word LBs containing the sub-string “the end of”

Table 9 presents a similar comparison for four other 4-word LBs which are properly attested in three of the four news corpora considered in this study, but which just failed to qualify as LBs in the remaining corpus, where they occurred 39 times pMw, thus being borderline cases (indicated by the superscript “bl”) – it should be noted that in all these cases but one the borderline count is for the UPI corpus, the smallest of the four, so this situation might have been brought about by the different sizes of the corpora. The overall frequency pMw of these LBs in the BNC is also given for comparison purposes. Interestingly, “said in a statement” is relatively infrequent in the whole of the BNC (0.55 times pMw), so this LB is typical of the text type of news agency reports, although ANSA does not tend to use it very often, in sharp contrast to the other international news agencies. Conversely, the other three LBs included in Table 9 are well attested in the BNC: “for the first time” is the 6th most frequent 4-word LB, “one of the most” is in 12th position and “a member of the” in 26th place.

LB ANSA ADN REUTERS UPI BNC
said in a statement 39(bl) 139(1) 238(1) 127(1) 0.55
for the first time 112 80 44 39(bl) 54
one of the most 75 51 50 39(bl) 41
a member of the 42 70 42 39(bl) 26

Table 9. 4-word LBs with borderline cases

The results presented in Tables 8 and 9 suggest that adopting a more relaxed requirement in terms of minimum frequency for phraseological sequences to qualify as LBs (e.g. lowering the threshold to at least 30 occurrences pMw) would have provided many more phraseological units shared across all four corpora for the analysis. This insight calls for further work in this area, widening the scope of the LBs considered in the investigation by lowering the cut-off point. In addition, our observations also show that it would be interesting to consider LBs of different lengths, apart from those with four constituents (cf. the discussion of “the end of”), to investigate the texture of phraseology at various levels of granularity, thus gaining a more comprehensive picture of its features.

5.6 LBs found only in one of the four corpora

The remaining part of the analysis concentrates on the 4-word LBs that are found only in one of the four corpora, but are not attested in the others, because they remain below the cut-off point of 40 occurrences pMw. This final section of the investigation supplements the previous analysis, in that it focuses on the peculiar use of LBs displayed by individual news agencies, rather than on the search for (pair-wise) commonalities and differences that set apart the mediated news texts from the native/original data. As a result, this section intends to uncover specific uses of LBs that might reflect the individual house style of the news agencies under consideration. Table 10 shows the LBs without proper names that are attested only in one of the four corpora, but not in the other three (hence, the LBs listed in Table 10 are a sub-set of those included in Table 6).

ANSA (17 LBs)

ADN (20 LBs)

REUTERS (7 LBs)

UPI (3 LBs)

98
81
67
61
58
 
50
 
47
 
 
 
44
42
 
 
 
 
 
 

the centre left opposition
the statute of limitations
at the age of
in the sale of
sale of film rights
the sale of film
as soon as possible
four and a half
a number of issues
tax fraud in the
the victim of a
fraud in the sale
was one of the
at the centre of
on a number of
on the night of
will be able to

in an interview with °
in a bid to
the leader of the
the northern city of
have been killed in
In an interview with °
with the support of
between the two countries
in the city of
were killed in the
in the town of
people were killed and
in the northern Italian
the president of the
is believed to have
northern Italian city of
on the basis of
people were killed in
the northern Italian city
the support of the

112
78
59
 
57
 
55
49
 
 
44
 
42
 
40

92
75
61
58
50
42

told a news conference
said in an interview
said in a telephone
in a telephone interview
as part of a
at a news conference
to be able to

the rest of the
the war on terrorism
in and out of

55
 
45

Table 10. LBs without proper names found only in one of the four corpora

It is clear that the two mediated corpora present a much larger inventory of idiosyncratic LBs than the two native/original data sets. It should be borne in mind that the four corpora have different sizes, and therefore no firm generalisations can be drawn from this, but the higher number of distinctive LBs found only in the ANSA and Adnkronos texts (17 and 20, respectively) contrasts quite strikingly with the more modest figures for Reuters and UPI (i.e. 7 and 3). In particular, since the ANSA and Reuters corpora have a very similar size, the direct comparison between these two corpora is possible. Looking at the LBs specific to the ANSA corpus, the two most frequent items deserve some comment: first of all, “the centre left opposition” is a clear reference to the political situation in Italy, with the government consisting of a conservative (right-wing) coalition; secondly, “the statute of limitations” (whose Italian equivalent is “termini di prescrizione”) refers to the period of time after which a crime can no longer be prosecuted, which was a crucial issue in the trial against Mr Mills, who was initially sentenced to serve “four and a half” (another peculiar LB in the ANSA corpus) years in prison.

The frequency of these corpus-specific LBs can therefore be accounted for on the basis of their relevance to the political or judicial circumstances of topical interest in Italy, and it was already pointed out in Section 5.2 that the ANSA news stories seem to be more sharply focused on Italy compared to the Adnkronos reports, which tend to have a broader, more international, scope. However, the remaining LBs which are specific only to ANSA are less clearly tied to specific issues, but recurring themes include sales and financial transactions as well as tax fraud (e.g. “in the sale of”, “sale of film rights”, “the sale of film”, “tax fraud in the”, “fraud in the sale”) – it should also be noted that some of these LBs appear to be concatenated, i.e. they share strings of constituent lexical items.

Concerning the specific LBs found exclusively in the Adnkronos corpus, in three of them the adjective “Italian” is included, which might explain why they are particularly frequent in the reports by this news agency based in Italy, and a clear presence of concatenation is also noticeable: “northern Italian city of”, “in the northern Italian” and “the northern Italian city”. By contrast, none of the 17 LBs found only the ANSA corpus include this adjective of nationality. Other LBs occurring only in the Adnkronos data involve more generic spatial references: “the northern city of”, “between the two countries”, “in the city of” and “in the town of”. In addition, 4 of the 20 LBs in this set contain the lemma kill, which never occurs in the LBs used only by any of the other three news sources: “have been killed in”, “were killed in the”, “people were killed and” and “people were killed in”; this phraseological pattern suggests that the Adnkronos news reports often deal with crimes involving violent deaths – this is also the case for the other three news agencies, but they must use less formulaic and less idiosyncratic phraseology in this regard. Finally, by far the most frequent LB in the Adnkronos corpus that is not found elsewhere provides details regarding the circumstances in which the news was obtained: “in an interview with” and its sentence-initial typographical variant “In an interview with”.

Somewhat similarly, 5 out of the 7 LBs attested only in the Reuters corpus inform readers of how and where the news being reported was obtained: “told a news conference”, “said in an interview”, “said in a telephone”, “in a telephone interview” and “at a news conference” – the other two, i.e. “as part of a” and “to be able to”, are much more general. Finally, the 3 LBs occurring only in the UPI corpus (which is the smallest of the four data collections) are “the rest of the”, “the war on terrorism” and “in and out of”. The small number of occurrences of these LBs allows us to venture into a more fine-grained analysis of their collocational properties. The flip side of this, however, is that this more qualitative analysis based on very small numbers cannot lead to clear generalisations, but it is discussed nonetheless in Section 5.7 for its methodological interest, to illustrate how the overall phraseological analysis focusing on LBs could be enriched with collocational insights.

5.7 A methodological aside: collocational properties of the 3 LBs exclusive to the UPI corpus

This section explores the collocational properties of the LBs identified exclusively in the UPI corpus. First of all, “the rest of the” occurs 17 times in absolute (i.e. not normalised) terms in this 305,141-word corpus: 5 times the right-hand collocate of this LB is “country”, 4 times we find “world”, and twice “Middle East” (the remaining one-off collocates in the UPI data are “communist”, “Earth”, “former”, “happy”, “pro-Syrian” and “Republicans”). Given this range of collocates, the LB “the rest of the” appears to have a strong semantic preference for vocabulary indicating areas or geographical designations. On the other hand, the most common left-hand collocates of this LB are function words: “in” (3 cases), “and”, “for” and “from” (all with 2 examples each in the corpus).

Secondly, “the war on terrorism” has a very similar absolute frequency in the UPI corpus to “the rest of the”, with 16 occurrences. Interestingly, the alternative “the war on terror” is attested only 7 times in the same corpus (for other studies on the use of these two phrases and their implications see Jackson 2005: 7; Mason & Platt 2006; Barrett 2007; Bayley 2007; Fairclouch 2007: 122ff; Milizia & Spinzi 2008; Qian 2010). This alternative phraseological pattern can also be investigated in the other corpora: the 356,830-word Reuters corpus has only 1 occurrence of “the war on terror”, and none of “the war on terrorism”; the 357,047-word ANSA corpus has no examples of either phrase, whereas the 522,295-word Adnkronos corpus (by far the biggest of the four) has 11 instances of “the war on terror” (including its typographical variant “the War on Terror”) and only 1 case of “the war on terrorism” – but this sequence is not included in the LBs for the Adnkronos corpus because the number of cases is well below the minimum frequency cut-off point of 40 occurrences normalised pMw.

Finally, “in and out of” is an interesting LB, which qualifies as such only in the UPI corpus. Considering all the 14 occurrences in absolute (i.e. not normalised) terms, there are 3 left-hand collocates with “flights” (in which cases “in and out of” is followed by the names of regions or international airports), and 2 each with “been” and “women”, plus 7 one-off miscellaneous other words. The right-hand collocates of this LB are even less revealing, in that the most frequent next word is the definite article “the” (3 times), with a mixed bag of one-off collocates in the other 11 cases. By way of comparison, “in and out of” is found only 3 times in the Adnkronos and Reuters corpora, and 4 times in the ANSA data, where however it is considerably below the minimum frequency cut-off point of 40 occurrences pMw needed to be considered a proper LB within the context of this study.

6. Conclusions and future work

6.1 Summary of the main findings

Summarising the main findings of this study, we can say that the mediated news reports produced by the two Italy-based news agencies ANSA and Adnkronos are more formulaic than the texts published by Reuters and UPI, in that they present a larger stock of LBs, which are used more heavily, especially when considering 4-word LBs that do not include proper names. Only “the end of the” occurs in all four corpora, whereas the overlapping LBs “at the end of” and “by the end of” are found in three out of the four data sets; four other LBs occur in three out four corpora (“for the first time”, “said in a statement”, “one of the most” and “a member of the”). In these cases, however, the occurrences are just below the cut-off point of 40 repetitions pMw in the remaining corpus.

Out of the 10 LBs found simultaneously in two or more of the four corpora, 8 are common between the mediated news sources, and 3 of these are not featured in either of the native/original corpus. On the contrary, no LBs occur in both Reuters and UPI while being absent from both ANSA and Adnkronos. These findings provide evidence for the hypothesized phraseological distinctiveness of the mediated news agency reports. In addition, approximately half of the LBs used by ANSA and Adnkronos contain (parts of) proper names of people, places, organizations or bodies, mostly referring to politics. Finally, compared to Reuters and UPI, each of the two mediated corpora contains a higher number of idiosyncratic LBs, which are not used by the other three international news agencies. This finding further confirms our hypothesis that the LBs found in the mediated news stories are both quantitatively and qualitatively different from those used in the native/original news agency texts.

6.2 Limitations of the study and outstanding methodological issues

Following a well-established practice in phraseological studies focusing on LBs, we decided to consider 4-word sequences and to set the cut-off point at 40 occurrences pMw. The analysis has pointed out cases of partial overlap across LBs, with for example the sub-string “the end of” repeated in all four corpora as part of three high-frequency 4-word LBs. A similar issue has to do with concatenation, as for example the Adnkronos corpus presents the LBs “northern Italian city of”, “in the northern Italian” and “the northern Italian city”, which are clearly connected. Whilst to some extent these phenomena are inherent in phraseological investigations based on LBs, they do raise questions regarding the most appropriate size of LBs and which cut-off point should be adopted. For this study we followed widely used criteria that have often been reported in the literature, but they may not be ideal in all cases, depending in particular on the language varieties in question, the text type and genre of the data and the size of the corpora being compared.

A related issue consists in the relatively small and variable size of the four corpora used in our study. Whilst two of them (ANSA and Reuters) are very similar, with approximately 357,000 words, the other two vary considerably, with UPI being the smallest with just over 305,000 tokens, and the Adnkronos one by far the largest with over 522,000 words. In this respect, Cortes (2008: 46) warns that it is “highly advisable to work with at least one million words to identify lexical bundles in a corpus and to draw reliable comparisons when using more than one corpus”. In addition, the corpora contain different quantities of texts, ranging from 643 for ANSA to nearly twice as many for Adnkronos (1,247), with Reuters and UPI between these two extremes. The average text length also varies across the four news sources (the minimum being 303 words per text in UPI and the maximum 555 in ANSA, with Adnkronos and Reuters somewhere in the middle).

While some of these issues are inevitable in this kind of investigation comparing the phraseology of mediated vs. native/original news agency reports, those having to do with the size of the corpora and the number of texts contained in them are rather controversial methodologically. As a consequence of our corpus building procedure and overall research design, we decided to present the data for the comparative analyses of the LBs in normalised terms (taking the occurrences pMw as the main unit of analysis). The only alternative would have been to tweak the parameters used in collecting the news texts semi-automatically from the web using BootCaT for the four news corpora (cf. Section 4.3). This would have resulted in corpora with similar internal features (e.g. by sampling a pre-determined number of texts of the same length), and accordingly having the same overall size. However, this in itself would have given rise to another undesirable bias, and such a direct and arbitrary intervention of the researcher on the data collection process would have introduced another set of (probably equally, if not more, serious) problems. In this regard, Oakey (2009) discusses the importance of these methodological choices in comparative corpus-based studies of fixed collocational patterns, investigating the tension between isolexical and isotextual approaches, i.e. comparing corpora with the same number of tokens vs. the same number of texts. These remain interesting outstanding methodological issues when it comes to the comparison of multiple corpora to investigate phraseology and LBs, in particular in variationist studies such as the present one.

6.3 Future work

To address some of these limitations, as part of our future work we intend to replicate the study with larger corpora of a similar size for all the news sources involved. We have discussed in Section 5.5 the issue of the length of the LBs to be investigated, and it is clearly desirable to widen the analysis to sequences of 3 words as well as 5/6-word LBs. If, on the one hand, this is likely to result in more phenomena of partial overlap and concatenation, on the other hand a more inclusive approach is bound to enhance the thoroughness of the phraseological analysis; other methodological problems arise, however, such as how to change the minimum frequency cut-off point as a result of considering shorter or longer lexical sequences as LBs to be extracted from the corpora. In addition, Section 5.7 has demonstrated on a small scale the feasibility and interest of examining collocational patterns connected with LBs, and nothing prevents colligational features to be considered too, possibly along with elements of semantic preference and semantic prosody, to provide a more comprehensive qualitative phraseological investigation.

Finally, we intend to look more systematically into the editorial processes followed at both ANSA and Adnkronos to generate their English-language output, e.g. by means of site visits and structured interviews with the managers in charge of these services, as well as with the professional authors and editors involved. This would reveal, for example, the background and language profiles of the journalists writing the news reports, the extent of revision and adaptation taking place, etc. A closer understanding of the editorial context would help us to uncover with more confidence the forms of mediation that are typical of these working scenarios, and possibly to provide more insightful explanations for some of the phraseological patterns that can be observed in the data. This would be particularly useful in cases where the mediated news sources display distinctive features that cannot be ascribed to the explicit house style of the individual news agencies or to the idiosyncratic preferences of individual writers.

Acknowledgements

This paper was originally presented at the workshop on “News, (new) media, and corpora: from methodology to theory”, which was held in Giessen (Germany) on May 26th 2010 as part of the 31st ICAME conference on “Corpus Linguistics and Variation in English”. Thanks are due to Roberta Facchinetti, who organised the workshop, and to the participants for their feedback. The author is also indebted to Silvia Bernardini for advice and interesting discussions on some of the aspects of this work, and to Eros Zanchetta and Magnus Huber for very helpful comments on an earlier draft of the paper. Any inaccuracies are the sole responsibility of the author.

Notes

[1] The front-end to BootCaT was developed by Eros Zanchetta of the University of Bologna at Forlì, Italy. This free and open-source software can be downloaded from http://bootcat.sslmit.unibo.it, and is accompanied by a step-by-step online tutorial that can be accessed from the same webpage.

[2] The URLs of their four home pages are as follows: http://www.ansa.it, http://www.adnkronos.com, http://www.reuters.com and http://www.upi.com.

[3] This means that an attempt was made to find a common sub-domain for each of the four websites that would be shared by all (or many of) the webpages containing news reports published in English for subsequent download, without biasing their contents towards specific domains or topics. In addition, for the two Italian news agencies the sub-domains had to identify webpages with content exclusively in English. By way of example, the “catch-all” URL used for ANSA was http://www.ansa.it/english/index.html, while the one for UPI was http://www.upi.com/Top_News.

[4] The 17 function words used as seeds in BootCaT were chosen arbitrarily, so as to be typically very common in any English text: a, an, and, by, for, from, in, not, of, on, or, out, that, the, this, to, with.

[5] However, manual inspection of the corpus contents revealed that boilerplate stripping had not always been completely accurate, especially for some of the UPI webpages: depending on the layout of the pages containing the texts to be downloaded, occasionally passages of the actual reports were erroneously removed during boilerplate filtering, while other times some boilerplate was downloaded along with the actual text. While it was not possible to reinstate passages that had been wrongly removed by BootCaT, any remaining boilerplate or non-textual elements were manually deleted to obtain a clean corpus.

[6] AntConc is a free tool and can be downloaded from http://www.laurenceanthony.net/software/antconc/. LBs are called “clusters” in the software, and version 3.2.1w was used for the analyses presented in this study.

[7] The figures given in the count columns next to each LB indicate the normalised frequency in each corpus, factored according to the respective size, with the minimum being 40 occurrences pMw. As a result, the frequency of the LBs found in the two corpora (and indeed across all four of them, see Table 5) can be compared.

[8] The quantities given in Figure 1 refer to absolute (i.e. not normalised) numbers, in that they correspond to the stock of LBs found in the four corpora. As a result, they cannot be directly compared across the four news sources, due to the different sizes of the corpora concerned.

[9] Since all the numbers included in Figure 2 are normalised pMw, the results can be directly compared across the corpora.

[10] In the ANSA corpus, “the head of the” (lower-case initial) occurs 81 times, and it is the 5th most frequent LB, while its typographical variant “The head of the” (upper-case initial) is found 53 times and is half-way down the frequency ranking (the two relevant lines are indicated with an asterisk in Table 6). The combination of these two LBs which are only typographically different would be the 2nd most frequent sequence in the ANSA corpus. Similarly, the Adnkronos corpus contains both “in an interview with” (lower-case initial, 112 cases) and “In an interview with” (sentence-initial, 57 instances), which combined would be the by far the most frequent LB in the Adnkronos corpus (the two lines in question are marked with a circle in Table 6).

[11] Based on Table 6, when LBs have the same number of normalised occurrences within a single corpus (e.g. both “a number of issues” and “is one of the” occur 47 times pMw in the ANSA corpus) they are ordered alphabetically, so the ranking is different for LBs that are found the same number of times within a corpus. This is the reason why in the Reuters corpus “by the end of” is 10th in the ranking and “for the first time” is in 11th position, even though both of them occur 44 times pMw.

[12] With regard to the instances of “the head of the” and “The head of the” in the ANSA corpus, see footnote number 10. Table 7 includes only the more frequent spelling variant with the lower-case initial.

[13] The sequence “at the end of” is present in the UPI corpus, but it does not meet the minimum frequency requirement of at least 40 occurrences pMw which is necessary to be included in the analysis of LBs in the present study.

[14] Data about the frequency of LBs in the BNC were obtained from the “Phrases in English” database (using the “BNC N-Grams” query interface), a system developed by William H. Fletcher of the US Naval Academy in collaboration with Michael Stubbs of the University of Trier (Germany), which can be accessed from http://phrasesinenglish.org/explore.html.

[15] In Tables 8 and 9 the superscript number within parentheses in a cell indicates the frequency ranking within the corpus concerned: in this case, “at the end of” is the most frequent 4-word LB in the ANSA corpus. In addition, the underlined numbers indicate that the frequency of occurrences normalised pMw is below the threshold of 40 that was set as a minimum frequency requirement for phraseological sequences to qualify as LBs in this study (see Section 5.1). Whilst 22 occurrences pMw of “by the end of” in the ANSA corpus is well below the threshold, 36 cases pMw for “at the end of” in the UPI corpus is much closer to the requirement set to qualify as a LB in the context of the present analysis, hence this LB represents a borderline case.

Sources

Home page of the ANSA website with the link to the English-language section: www.ansa.it. English-language section of the ANSA website: http://www.ansa.it/english/index.html.

Home page of the Adnkronos website with the link to the English-language section: www.adnkronos.com. English-language section of the Adnkronos website: http://www1.adnkronos.com/IGN/Aki/English/.

References

Anthony, L. 2006. “Developing a freeware, multiplatform corpus analysis toolkit for the technical writing classroom”. IEEE Transactions on Professional Communication 49(3): 275–286.

Baroni, M. & S. Bernardini. 2004. “BootCaT: Bootstrapping corpora and terms from the web”. Proceedings of LREC 2004, ed. by M.T. Lino, M.F. Xavier, F. Ferreira, R. Costa, R. Silva et al., 1313–1316. Lisbon: ELDA. http://www.lrec-conf.org/proceedings/lrec2004/pdf/509.pdf

Barrett, D. 2007. “‘War on terror’ – An intentional choice of words? A corpus analysis of war on and war against”. Proceedings of the Corpus Linguistics Conference CL2007, University of Birmingham (UK), 27–30 July 2007, ed. by M. Davies, P. Rayson, S. Hunston & P. Danielsson. Birmingham: University of Birmingham. http://ucrel.lancs.ac.uk/publications/cl2007/paper/20_Paper.pdf

Bassnett, S. 2005. “Bringing the news back home: Strategies of acculturation and foreignisation”. Language and Intercultural Communication 5(2): 120–130.

Bayley, P. 2007. “Terror in political discourse: from the Cold War to the unipolar world”. Discourse and Contemporary Social Change, ed. by N. Fairclough, G. Cortese & P. Ardizzone, 49–72. Bern: Peter Lang.

Bell, A. 1991. The Language of News Media. Oxford: Blackwell.

Bernardini, S., A. Ferraresi & F. Gaspari. 2010. “Institutional academic English in the European context: A web-as-corpus approach to comparing native and non-native language”. Professional English in the European context: The EHEA challenge, ed. by Á. Linde López, & R. Crespo Jiménez, 27–53. Bern: Peter Lang.

Biber, D. 2006. University Language: A Corpus-Based Study of Spoken and Written Registers. Amsterdam: John Benjamins.

Biber, D. & F. Barbieri. 2007. “Lexical bundles in university spoken and written registers”. English for Specific Purposes 26: 263–286.

Biber, D. & S. Conrad. 1999. “Lexical bundles in conversation and academic prose”. Out of Corpora: Studies in Honor of Stig Johansson, ed. by H. Hasselgård & S. Oksefjell, 181–189. Amsterdam: Rodopi.

Biber, D., S. Conrad & V. Cortes. 2004. “If you look at…: Lexical bundles in university teaching and textbooks”. Applied Linguistics 25(3): 371–405.

Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan. 1999. Longman Grammar of Spoken and Written English. London: Longman.

Bielsa, E. 2007. “Translation in global news agencies”. Target 19(1): 135–155.

Bielsa, E. & S. Bassnett. 2009. Translation in Global News. London: Routledge.

Boyd-Barrett, O. & T. Rantanen, eds. 1998. The Globalization of News. London: Sage.

Caimotto, C. 2010. “Translating foreign articles with local implications: A case study”. Political Discourse, Media and Translation, ed. by C. Schäffner & S. Bassnett, 76–93. Newcastle upon Tyne: Cambridge Scholars Publishing.

Chen, Y.-H. & P. Baker. 2010. “Lexical bundles in L1 and L2 academic writing”. Language Learning & Technology 14(2): 30–49.

Clausen, L. 2004. “Localizing the global: ‘Domestication’ processes in international news production”. Media, Culture & Society 26(1): 25–44.

Conway, K. & S. Bassnett, eds. 2006. Translation in Global News: Proceedings of the Conference Held at the University of Warwick, 23 June 2006. Coventry: The Centre for Translation and Comparative Cultural Studies. http://www.ufs.ac.za/docs/librariesprovider20/linguistics-and-language-practice-documents/all-documents/feinauer-translation-in-global-news-proceedings-931-eng.pdf?Status=Master&sfvrsn=0

Cortes, V. 2002. “Lexical bundles in freshman composition”. Using Corpora to Explore Linguistic Variation, ed by. R. Reppen, S. Fitzmaurice & D. Biber, 131–145. Amsterdam: John Benjamins.

Cortes, V. 2004. “Lexical bundles in published and student disciplinary writing: Examples from history and biology”. English for Specific Purposes 23: 397–423.

Cortes, V. 2006. “Teaching lexical bundles in the disciplines: An example from a writing intensive history class”. Linguistics and Education 17: 391–406.

Cortes, V. 2008. “A comparative analysis of lexical bundles in academic history writing in English and Spanish”. Corpora 3(1): 43–57.

Fairclough, N. 2007. Language and Globalization. London & New York: Routledge.

Forchini, P. & A. Murphy. 2008. “N-grams in comparable specialized corpora: Perspectives on phraseology, translation, and pedagogy”. International Journal of Corpus Linguistics 13(3): 351–367.

Gaspari, F. & S. Bernardini. 2010. “Comparing non-native and translated language: Monolingual comparable corpora with a twist”. Using Corpora in Contrastive and Translation Studies, ed. by R. Xiao, 215–234. Newcastle upon Tyne: Cambridge Scholars Publishing.

Goźdź-Roszkowski, S. 2006. “Frequent phraseology in contractual instruments: A corpus-based study”. New Trends in Specialized Discourse Analysis, ed. by M. Gotti & D.S. Giannoni, 147–161. Bern: Peter Lang.

Groom, N. 2009. “Effects of second language immersion on second language collocational development”. Researching Collocations in Another Language: Multiple Interpretations, ed. by A.W. Barfield & H. Gyllstad, 21–33. Basingstoke: Palgrave Macmillan.

Hajmohammadi, A. 2005. “Translation evaluation in a news agency”. Perspectives: Studies in Translatology 13(3): 215–224.

Holland, R. 2006. “Language(s) in the global news: Translation, audience design and discourse (mis)representation”. Target 18(2): 229–259.

Horvit, B. 2006. “International news agencies and the war debate of 2003”. The International Communication Gazette 68(5–6): 427–447.

Hursti, K. 2001. “An insider’s view on transformation and transfer in international news communication: An English–Finnish perspective”. Helsinki English Studies 1. http://blogs.helsinki.fi/hes-eng/volumes/volume-1-special-issue-on-translation-studies/an-insiders-view-on-transformation-and-transfer-in-international-news-communication-an-english-finnish-perspective-kristian-hursti/

Hyland, K. 2008a. “As can be seen: Lexical bundles and disciplinary variation”. English for Specific Purposes 27: 4–21.

Hyland, K. 2008b. “Academic clusters: Text patterning in published and postgraduate writing”. International Journal of Applied Linguistics 18(1): 41–62.

Jackson, R. 2005. Writing the War on Terrorism: Language, Politics and Counter-Terrorism. Manchester: Manchester University Press.

Juknevičienė, R. 2009. “Lexical bundles in learner language: Lithuanian learners vs. native speakers”. Kalbotyra 61(3): 61–71.

Kang, J.-H. 2007. “Recontextualization of news discourse: A case study of translation of news discourse in North Korea”. The Translator 13(2): 219–242.

Kuo, S.-H. & M. Nakamura. 2005. “Translation or transformation? A case study of language and ideology in the Taiwanese press”. Discourse & Society 16(3): 393–417.

Lee, C.-S. 2006. “Differences in news translation between broadcasting and newspapers: A case study of Korean-English translation”. Meta 51(2): 317–327.

Mason, O. & R. Platt. 2006. “Embracing a new creed: Lexical patterning and the encoding of ideology”. College Literature 33(2): 154–170.

Milizia, D. & C. Spinzi. 2008. “The ‘terroridiom’ principle between spoken and written discourse”. International Journal of Corpus Linguistics 13(3): 322–350.

Murphy, A.C. 2008. Editing Specialized Texts in English: A Corpus-Assisted Analysis. Milano: LED.

Nekrasova, T.M. 2009. “English L1 and L2 speakers’ knowledge of lexical bundles”. Language Learning 59(3): 647–686.

Oakey, D. 2009. “Fixed collocational patterns in isolexical and isotextual versions of a corpus”. Contemporary Corpus Linguistics, ed. by P. Baker, 140–158. London: Continuum.

Orengo, A. 2005. “Localising news: Translation and the ‘global-national’ dichotomy”. Language and Intercultural Communication 5(2): 168–187.

Qian, Y. 2010. Discursive Constructions Around Terrorism in the People’s Daily (China) and The Sun (UK) Before and After 9/11: A Corpus-Based Contrastive Critical Discourse Analysis. Bern: Peter Lang.

Read, D. 1999. The Power of News: The History of Reuters. Second edition. Oxford: Oxford University Press.

Richardson, J.E. 2007. Analysing Newspapers: An Approach from Critical Discourse Analysis. Basingstoke: Palgrave Macmillan.

Schäffner, C. 2005. “Bringing a German voice to English-speaking readers: Spiegel International”. Language and Intercultural Communication 5(2): 154–167.

Scott, M. & C. Tribble. 2006. Textual Patterns. Amsterdam: John Benjamins.

Shrivastava, K.M. 2007. News Agencies from Pigeon to Internet. Elgin, IL: New Dawn Press.

Sidiropoulou, M. 1995. “Causal shifts in news reporting: English vs Greek press”. Perspectives: Studies in Translatology 3(1): 83–98.

Starcke, B. 2008. “I don’t know – differences in patterns of collocation and semantic prosody in phrases of different lengths”. Language, People, Numbers: Corpus Linguistics and Society, ed. by A. Gerbig & O. Mason, 199–216. Amsterdam: Rodopi.

Stubbs, M. 2007. “An example of frequent English phraseology: Distributions, structures and functions”. Corpus Linguistics Twenty-Five Years On, ed. by R. Facchinetti, 89–105. Amsterdam: Rodopi.

Stubbs, M. & I. Barth. 2003. “Using recurrent phrases as text-type discriminators: A quantitative method and some findings”. Functions of Language 10(1): 61–104.

Van Dijk, T.A. 1988. News Analysis: Case Studies of International and National News in the Press. Hillsdale, NJ: Lawrence Erlbaum.

van Doorslaer, L. 2009. “How language and (non-)translation impact on media newsrooms: The case of newspapers in Belgium”. Perspectives: Studies in Translatology 17(2): 83–92.

Vuorinen, E. 1997. “News translation as gatekeeping”. Translation as Intercultural Communication: Selected Papers from the EST Congress – Prague 1995, ed. by M. Snell-Hornby, Z. Jettmarová & K. Kaindl, 161–171. Amsterdam: John Benjamins.