Basic structure
The corpus consists of 500 files of approximately 2000 words each. Basic coding and CLAWS C7 grammatical annotation was used. For the purposes of CQPweb, words with enclitics are split, e.g. 'do n't'. For wordlists derived using WordSmith, these words are not split.
The texts cover four main genres of published writing (press, general prose, learned writing, fiction), further subdivived into 15 sub-genres (see e.g. the Brown corpus text categories). Every effort was made to ensure that the contributors used British English.
Sampling principles
The sampling frame of the original Brown corpus had stressed the importance of taking random text extracts from a random point in each individual text (using random number tables), so as not to create a corpus of “beginnings” or “ends”. The later FLOB corpus did not use such strict random sampling, but instead the creators tried to create as close a “match” to the texts in LOB as possible. With the BE06, matched samples were |
|
Baker, P. (2009) 'The BE06 Corpus of British English and recent language change.' International Journal of Corpus Linguistics. 14:3 317. |
|
For balance, individual text samples were extracted from various parts of longer texts. Texts of 2000 words were included in full. Many of
the texts in genres A-C and E were shorter than 2,000 words and had to be combined
together to make up a 2,000 word sample. |
A further problem involved deciding what constituted a ‘British’ text. A number
of possible definitions of a British text or a British author were initially considered.
For example, any text published in the UK could be considered a British
text, although this did not transpire to be a helpful definition (as many texts may
have first been published in other countries, or were written by authors who were
not British). At the start of the corpus building project it was hypothesised that
academic journals containing the word British would be a good potential source
of data, although this did not prove to be the case as many contributors to such
journals were American or Australian. Definitions based on the author were considered
instead, although here a distinction could be made between an author who
was born in the UK, an author who currently lived in the UK, and an author who
(had) mainly lived in the UK.
|
|
Baker, P. (2009) 'The BE06 Corpus of British English and recent language change.' International Journal of Corpus Linguistics. 14:3 318. |
|
|
|