Basic structure

The corpus consists of 500 files of approximately 2000 words each. Basic coding and CLAWS C7 grammatical annotation was used. For the purposes of CQPweb, words with enclitics are split, e.g. 'do n't'. For wordlists derived using WordSmith, these words are not split.

The texts cover four main genres of published writing (press, general prose, learned writing, fiction), further subdivived into 15 sub-genres (see e.g. the Brown corpus text categories). Every effort was made to ensure that the contributors used British English.

Sampling principles

The sampling frame of the original Brown corpus had stressed the importance of taking random text extracts from a random point in each individual text (using random number tables), so as not to create a corpus of “beginnings” or “ends”. The later FLOB corpus did not use such strict random sampling, but instead the creators tried to create as close a “match” to the texts in LOB as possible. With the BE06, matched samples were
  Baker, P. (2009) 'The BE06 Corpus of British English and recent language change.' International Journal of Corpus Linguistics. 14:3 317.

 

pie chart For balance, individual text samples were extracted from various parts of longer texts. Texts of 2000 words were included in full. Many of the texts in genres A-C and E were shorter than 2,000 words and had to be combined together to make up a 2,000 word sample.

 

A further problem involved deciding what constituted a ‘British’ text. A number of possible definitions of a British text or a British author were initially considered. For example, any text published in the UK could be considered a British text, although this did not transpire to be a helpful definition (as many texts may have first been published in other countries, or were written by authors who were not British). At the start of the corpus building project it was hypothesised that academic journals containing the word British would be a good potential source of data, although this did not prove to be the case as many contributors to such journals were American or Australian. Definitions based on the author were considered instead, although here a distinction could be made between an author who was born in the UK, an author who currently lived in the UK, and an author who (had) mainly lived in the UK.

  Baker, P. (2009) 'The BE06 Corpus of British English and recent language change.' International Journal of Corpus Linguistics. 14:3 318.