The published version of the HUM19UK corpus contains machine-readable versions of novels that have been cleaned and annotated. By cleaned we mean that, where necessary, we removed from the text any illustrations, captions, reviews, transcriber notes, and introductions and epilogues by anyone other than the text’s author.

The annotation we added is minimal, but we hope of some use. All chapter headings and numbers, as well as any other divisions in the text, such as books or parts have been placed in tags. Where multiple volumes of one text appear in the corpus, volumes are tagged as divisions. Where the electronic version of the text we used contained page numbers we also placed these in tags so that they will be ignored by corpus software. We also added a small header to each text which included: novel title; author’s name; author’s gender; year of first publication; and source of the machine-readable version of the text.

Additionally, we did not remove any sections of the text that although not part of the story told in the novel may be relevant to its interpretation, such as prefaces by the author, epigraphs and content pages (where present in the transcription). Instead, we enclosed these in angle brackets (i.e. < >) so that they will be ignored by most corpus tools but can be extracted if required for analysis.

The file name of each corpus text is its year of publication. This should allow you to easily cluster texts per decade, should you want to use only one or a selection of decades of our reference corpus.

The tags for the author’s gender should enable you to easily cluster texts according to gender.

Chapter tags allow you to extract all first chapters, last chapters, introductions, or any other combination of chapters across some or all texts.