British National Corpus (BNC)

  • British National Corpus is a snapshot of British English in the early 1990s. The British National Corpus is:
    • a sample corpus: composed of text samples generally no longer than 45,000 words.
    • a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
    • a general corpus: not specifically restricted to any particular subject field, register or genre.
    • a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
    • a mixed corpus: it contains examples of both spoken and written language.

Project leader: BNC Consortium
Time of compilation: 1991-1994
Language: LModE
Period: late 20th century
Released: 1994
Project home page:


Reference Guide for the British National Corpus (XML Edition).


The BNC project was carried out and is managed by the BNC Consortium, an industrial/academic consortium led by Oxford University Press, of which the other members are major dictionary publishers Addison-Wesley Longman and Larousse Kingfisher Chambers; academic research centres at Oxford University Computing Services (OUCS), the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University, and the British Library's Research and Innovation Centre. The project was funded by the commercial partners, the Science and Engineering Council (now EPSRC) and the UK government's Department of Trade and Industry under the Joint Framework for Information Technology (JFIT) programme. Additional support was provided by the British Library and the British Academy.


Available for free for download from the Oxford Text Archive (OTA). Usage is subject to the conditions of the BNC User licence


BNC XML Edition

The full BNC contains about 100 million words: 90% written, 10% orthographically transcribed spoken text. It is annotated with word-class information (part-of-speech, simplified word class) and lemmatized. The texts also contain detailed metatextual information. It is delivered in XML format.

Full reference information about the BNC is provided in the Reference Guide for the British National Corpus (XML Edition). Information about the BNC project and the original creation of the corpus can be found at corpus creation page. The corpus can be downloaded from the OTA.

BNC Baby

BNC Baby is a subset of the BNC. It consists of four one-million word samples, each compiled as an example of a particular genre: fiction, newspapers, academic writing and spoken conversation. The texts have the same annotation as the full corpus (part of speech, meta data, etc). The Reference Guide to BNC Baby [.pdf file] offers further information about this sample, such as a description of the design and information about the way in which it is encoded.

The BNC Baby is in XML format. The corpus can be downloaded from the OTA.

BNC Sampler

The BNC Sampler is a subset of the full BNC. It comprises two samples of written and spoken material of one million words each, compiled to mirror the composition of the full BNC as far as possible. The word-class annotation of the BNC Sampler texts has been carefully checked and manually corrected. The Sampler was first created at Lancaster University during the creation of the BNC. More information about the Sampler can be found in the users reference guide for the BNC Sampler: XML Edition [.pdf file]

The BNC Sampler is in XML format. The corpus can be downloaded from the OTA.

Reference line and copyright

Copyright held by the BNC Consortium.