HUM19UK Corpus

HUM19UK is the Huddersfield, Utrecht, Middelburg corpus of 19th century British fiction. It was created between 2016–2019 as a collaborative project between the University of Huddersfield (UK), Utrecht University (the Netherlands) and University College Roosevelt in Middelburg (the Netherlands). The corpus contains 100 complete novels written by 100 authors (50 male/50 female) over 100 years, with roughly 10 novels per decade. It totals 13 million words.

Time of compilation: 2016–2019
Size: 13 million words
Language: English
Number of texts/samples: 100 complete novels
Period: 1800–1899
Released: 2019
Project home page: https://www.linguisticsathuddersfield.com/hum19uk-corpus

Reference line and copyright

Everyone is allowed to use the corpus. All texts in the corpus are out of copyright based on copyright laws and were extracted from Project Gutenberg, Celebration of Women Writers, Victorian Women Writers Project, Chawton House and Public Library UK websites.

Manual

https://www.linguisticsathuddersfield.com/hum19uk-corpus

Compilers

Dr. Brian Walker, Fransina Stradling, Prof. Dan McIntyre, Elliott Land, Dr. Hazel Price (University of Huddersfield, UK)

Prof. Michael Burke (University College Roosevelt, Netherlands/University of Utrecht, Netherlands)

Availability

Open access. Freely available for download at https://www.linguisticsathuddersfield.com/hum19uk-corpus

Technical information

The published version of the HUM19UK corpus contains machine-readable versions (.txt format) of novels that have been cleaned and annotated. The file name of each corpus text is its year of publication.