Basic structure

The Diachronic Corpus of Present Day Spoken English is sampled from ICE-GB and from the London-Lund Corpus. ICE-GB texts are 2,000 words each while the LLC contains 5,000-word texts. To achieve a balanced sample, it was necessary to take proportionately fewer texts from the LLC.

The texts were sampled into the following categories as follows:

Table 1. DCPSE text categories and statistics.

number of words

Text category

ICE

LLC

target

actual

A.

Face-to-face conversations, formal

20

8

80,000

90,775

B.

Face-to-face conversations, informal

90

36

360,000

403,844

C.

Telephone conversations (mostly informal)

10

4

40,000

47,242

D.

Broadcast discussions (disparates/equals)

20

8

80,000

89,157

E.

Broadcast interviews (disparates/equals)

10

4

40,000

43,046

F.

Spontaneous commentary

23

9

91,000

95,381

G.

Parliamentary language

5

2

20,000

21,083

H.

Legal cross-examination

2

1

9,000

9,658

I.

Assorted spontaneous (unscripted) speech

5

2

20,000

21,675

J.

Prepared speech (mostly monologue)

15

6

60,000

63,575

Total

200

80

800,000

885,436

Word counts per text category in DCPSE.

Figure 1. Word counts per text category in DCPSE.

Only c. 130,000 words are found in corpus texts with one speaker, the remainder are conversations or multi-speaker presentations.

DCPSE text codes are given the prefix "DI-" (for ICE-GB) or "DL-" (for LLC), the letter code A-J (above), followed by an index number. Thus,

DI-B07 is the seventh text in the ICE-GB (1990s) sourced informal face-to-face conversations.

The sociolinguistic variable Source corpus stores the source corpus (ICE-GB or LLC) for every text.