Basic structure

ICE-GB contains 500 texts of approximately 2,000 words each. Many of these texts are composite, that is, they consist of two or more different samples of the same type which have been combined to make up a 2,000-word 'text'. In the category of business letters, for instance, a total of 198 individual letters have been included. We refer to these individual samples as 'subtexts'.

The table below provides a summary of the composition of the ICE-GB corpus.

Table 1. ICE-GB Summary statistics.

 

spoken

written

total

Number of words

637,562

423,702

1,061,264

Number of 2,000-word texts

300

200

500

Number of individual samples

447

554

1,001

Average number of words per text

2,125

2,118

2,122

Average number of words per sample

1,426

764

1,060

Number of syntactic trees

59,460

23,934

83,394

Average number of trees per text

198

119

166

Average number of trees per sample

133

43

83

ICE-GB was designed primarily as a resource for syntactic studies. Every text unit ('sentence') in ICE-GB has been syntactically parsed at function and category level, and each unit is presented in the form of a syntactic tree. ICE-GB contains 83,394 parse trees, including 59,640 in the spoken part of the corpus. The trees in the corpus represent an invaluable resource for studies of the syntax of contemporary British English.

Table 2. ICE Corpus Design.

Spoken Texts (300)

Dialogues (180) Private (100) face-to-face conversations (90)
phonecalls (10)
Public (80) classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
Monologues (100)  Unscripted (70) spontaneous commentaries (20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
Scripted (30) broadcast talks (20)
non-broadcast speeches (10)
Mixed (20)  broadcast news (20)

Written Texts (200)

Non-printed (50) Non-professional writing (20) untimed student essays (10)
student examination scripts (10)
Correspondence (30) social letters (15)
business letters (15)
Printed (150) Academic writing (40) humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Non-academic writing (40) humanities (10)
social sciences (10)
natural sciences (10)
technology (10)
Reportage (20) press news reports (20)
Instructional writing (20) administrative/regulatory (10)
skills/hobbies (10)
Persuasive writing (10) press editorials (10)
Creative writing (20) novels/stories (20)

The texts in ICE-GB date from 1990 to 1993 inclusive. This means that the printed texts were originally published, and the spoken texts originally recorded, during this period. The corpus does not include reprints, second or later editions, or transcripts of repeat broadcasts. For handwritten material, such as letters and essays, these dates refer to the date of composition.

All authors and speakers are British. This means that they were born in Great Britain, that is, England, Scotland, or Wales. In a small number of cases, we have relaxed this criterion to include those who were born elsewhere, but moved to Britain at an early age.