Background

The project was first started after a workshop by Prof. Dan McIntyre and Dr. Brian Walker on corpus linguistics at University College Roosevelt. We noted the small size of the Fiction section of the BNC and wanted to create a corpus of fiction that was larger and more representative. Over time the project became more concentrated on 19th-century fiction only given restrictions on copyright for complete texts.

Creating a representative sample of 19th century British prose fiction (in novel form) was not without challenges. Other corpora that aim to represent prose fiction from a particular time period, such as the Brown family of corpora, have used random sampling techniques whereby titles are sampled from a list of publications from a particular year or decade. This method was not practical for us because, first of all, we were unable to obtain a definitive list of British novels published during the 19th century from which to sample. Secondly, even if we had such a list of publications it is likely that it (and therefore any sample derived from it) would contain rare and hard to find novels. Therefore, any random sample would have been restricted to those novels that were accessible. A further, more significant restriction was access to machine-readable digital versions of novels. Since we did not have the resources to create our own machine-readable versions of texts, we relied completely on various on-line sources that provided such versions of novels. This means that the population of 19th century novels from which we could sample from was actually the population available online in electronic form. However, producing a list of 19th century novels available online would be a rather difficult task.

Instead, we adopted a different approach that involved hand-picking texts for inclusion in our corpus using some guiding criteria designed to make the corpus representative and balanced. Our aim was to create a balanced corpus representative of 19th century fiction that contained complete novels with publication dates spread across the whole time period. In order to achieve this representativeness, balance and spread we aimed to include roughly one text per year across the 100 years of the 19th century (therefore creating a corpus comprising 100 texts in total), and that each text should be written by a different author, with a 50/50 male/female split across the corpus. Also important for us was that authors who although not very well known now were well-read during the 19th century itself were included in the corpus as well as well-known authors and texts from the literary canon.

The corpus was constructed in a cyclical fashion, which means we created different versions of the corpus before producing the final version that is available for download on this website. The different versions of the corpus evolved as we re-thought important factors such as representativeness, balance and spread in light of feedback from colleagues at the University of Huddersfield and at PALA and IALS conferences.

The final content of the corpus is the result of the interaction between our decisions about the structure of the corpus (100 texts by 100 different authors spread across 100 years) and accessibility to trustworthy machine-readable versions of texts.