Background

The Diachronic Corpus of Present-Day Spoken English draws its material from two corpora of Modern British English, both founded at the Survey of English Usage (SEU) at University College London: the London-Lund Corpus (LLC), compiled in the 1960s, and the British Component of the International Corpus of English (ICE-GB), compiled in the 1990s. The project's aim was to construct a fully parsed and searchable diachronic corpus of spontaneous spoken English, containing carefully selected and directly comparable texts from the LLC and ICE-GB corpora.

The LLC corpus material

The London-Lund Corpus was recorded over several decades, from the earliest tape dated 1953, to the last, S-06-09, recorded in 1987. The time span of LLC texts included in DCPSE ranges from 1958 to 1977. In DCPSE the Date variable stores the year of recording.

The LLC was transcribed at the Survey of English Usage on paper and famously typed up and stored on paper cards or 'slips', which were archived at the Survey. The LLC corpus was originally stored in card index cabinets. Without computers, 'indexing' consisted of manually underlining constituents on slips, and 'retrieval' consisted of opening card indexes. It was only in the 1980s that the LLC was made accessible via a computer.

Many of the recordings in the LLC were made without the knowledge of all of the participants, a practice which today would not be considered ethical (and unlike in the case of ICE-GB). DCPSE contains an Awareness variable that records whether the speaker was aware of the recording or not.

The DCPSE project took these LLC texts and re-annotated them in a way that was as consistent with ICE-GB as possible. This meant importing ICE-GB transcription conventions, phonetic and prosodic information and segmentation, and recovering sociolinguistic information from dusty files – as well as carrying out the part-of-speech tagging and parsing of the text.

The ICE-GB material

ICE-GB is composed of spoken and written texts, distributed over thirty-two text categories. The material dates from the early 1990s. The corpus contains textual markup, wordclass tags, and – unusually – it is one of the few corpora that have been fully grammatically annotated (tagged and parsed): all the sentences/utterances in the corpus have been assigned a tree structure.