Text Selection for EMEMT

The primary rationale behind the selection of individual texts has been the perceived significance of the text or its author to the history of medicine, with major authors being represented by several extracts. Secondly, the aim has been to provide sufficient and even coverage over the whole 200 year period. Availability has also influenced the selection of texts, preference having been given to texts and editions readily accessible in major libraries. Individual texts are represented by roughly 10,000 word extracts, shorter texts being included in toto. The length of the extracts means that it is possible to use EMEMT for studies of both lexical and discourse features.

The distibution of texts

In selecting texts for the corpus, the project team has been striving for consistent coverage of each category across the timeline. For some categories, such as general treatises, this proved to be impossible for lack of original material published during the early parts of the 16th century. Likewise, recipe collections exhibit a noticeable sparsity during the first quarter of the 17th century. Although recipes were published during that time, all of them were repeat editions of earlier titles.

Text categories
Category chart designed by Ville Marttila (2007) for the EMEMT Poster presented at ICAME 28 at Stratford-on-Avon.

Editing and proofreading

All texts in the EMEMT corpus were keyed-in by hand from facsimiles or originals, annotated with proprietory markup, and then proofread at least twice. The second round of proofreading was conducted by comparing a corpus edition with the original artefact at various research libraries, most importantly the British Library, Wellcome Trust Library, and Henry E. Huntington Library. All project members participated in the proofreading. Great care was taken to ensure that the corpus text matches a unique identifiable copy of a book; in some cases, a keyed-in copy had to be altered slightly because minor differences were found between the corpus edition and the copy available at a research library.