Basic structure

The corpus is an indispensable and unique collection of continuous free dialect speech. It helps us ascertain the real frequencies of certain features that have traditionally been considered typical of dialects. It further enables us to consider the spread and contexts of these features. Furthermore, synchronic study of dialectal variation might suggest distinctions that are not distinguishable from the somewhat fragmentary historical data.

The recordings were made in the 1970s and the 1980s by Finnish postgraduates, advised in this work by Professor Harold Orton and other Leeds scholars. The fieldwork done in the 1970s was supervised by Tauno F. Mustanoja at the University of Helsinki. The aim was to create a computer corpus consisting of enough continuous speech to provide material for the study of dialectal morphosyntax and thus to supplement the Leeds Survey of English Dialects, which focuses mainly on phonological and lexical data.


1 - Cambridgeshire & Isle of Ely 2 - Devon 3 - Somerset	4 - Suffolk 5 - Essex 6 - Lancashire

Map 1. Map of regions visited by the fieldworkers.

Table 1. Geographical regions covered in the corpus. The number in the brackets denotes the number of secondary informants (primary informant = sole speaker in a session or speaker of more than 500 words; secondary speaker = speaker with less than 500 words in a session with more than one informant).

Region (rural)	No of villages	No of informants
CAMBRIDGESHIRE	26	38 (+ 6)
DEVON	9	33 (+ 8)
ISLE OF ELY	20	52 (+ 6)
SOMERSET	15	24 (+ 5)
SUFFOLK	16	47 (+ 4)
Region (urban)
ESSEX	3	6 (+ 1)
LANCASHIRE	3	6 (+ 1)
Total	92	206 (+ 31)

Recorded material

The fieldworkers found their informants, both men and women, with the help of locals. The fieldworkers went from village to village on foot, by bicycle or by bus, carrying equipment weighing an average of 15 kilograms. The whole package included the recorder itself, a bag of reels, a microphone, batteries, power outlet adapters, correction tape, cables, fuse, etc. (Figure 1).

Figure 1. Recording equipment. (Click to enlarge.)

Figure 2. Fieldworker's early transcription notes. (Click to enlarge.)

Some 210 hours of these recordings are now stored in digital form. The majority of the recordings have been transcribed by the fieldworkers themselves. Most of the recorded material has been donated to the English Department of the Helsinki University.

Working the corpus material into a computer corpus was an idea introduced by Professor Ossi Ihalainen in the 1980s. The project was integrated into VARIENG in 2000, under the coordination of Kirsti Peitsara. The texts have been transcribed orthographically, with only minor changes to show dialectal pronunciation. The nearly one-million-word corpus today includes speech from 237 informants recorded in 92 localities in six counties.

Sociolinguistic coverage

At the time of the interview in the 1970s and 1980s most of the male and female rural informants were retired and in their mid 70s or older, having thus received their education during the first decade of the 20th century. The Essex and Lancashire urban corpora, collected in the late 1980s, differ from the above in giving three-generation two-sex samples, with informants ranging from 19 to 70 years.

Most of the subcorpora have a high proportion of males as primary informants; the Devon corpus has no females as primary speakers, and female speech covers circa one fifth in the Cambridgeshire, Isle of Ely and Suffolk corpora. This reflects the traditional idea of males as best preservers of genuine dialect, which was inherited from the SED model. An exception is made by the Essex and Lancashire subcorpora, in which the proportion of female speech is considerably greater, owing to the sociolinguistic theories being introduced to dialectology.

Table 2. Subcorpora word counts. Total WC indicates the number of words in the entire corpus, including interviewers' speech, preface text and prosodic expressions. Stripped WC is the number of words spoken by the informant only.

Region (rural)	Total WC/WordCruncher	Stripped WC
CAMBRIDGESHIRE	239206	190462
DEVON	87022	73350
ISLE OF ELY	140037	136944
SOMERSET	168211	132471
SUFFOLK	253270	218627
Region (urban)
ESSEX	58394	45930
LANCASHIRE	62501	48365
Total	1008641	846149

Figure 3. Relative proportions of the seven regions by word count.

Helsinki Corpus of British English Dialects

Basic structure

Recorded material

Sociolinguistic coverage