Conversion of the CEEC-400 into XML
A Manual to Accompany the XML Edition
2020
Lassi Saario
Research Unit for Variation, Contacts and Change in English (VARIENG)
Faculty of Arts
University of Helsinki
- Starting point
- Family tree of CEEC-400
- Why the mess?
- Pre-processing
- File division
- Ordering of letters
- Character encoding
- Parameter coding
- Text-level coding
- Custom codes
- Reversion of normalisation
- XML conversion
- Structure of the corpus
- Parameter coding
- Text-level coding
- Textual structure
- Special characters
- Page numbers
- Headings
- Emendations
- Comments
- Type changes
- Foreign language
- Superscripts
- Post-processing
- Endpoint
- References
CEEC-400 is a cover term for a family of corpora: the original Corpus of Early English Correspondence (CEEC), the CEEC Extension (CEECE), the CEEC Supplement (CEECSU) and their various versions. Together they cover a time span of almost 400 years from 1402 to 1800. The corpora are based on published editions of letters, which were sampled and digitised by a team of compilers at the University of Helsinki.
The original CEEC-400 was written in a custom version of the ancient COCOA (Word COunt and COncordance on Atlas) format, based on that of the Helsinki Corpus of English Texts. Different collections were often encoded by different people, mostly research assistants, interpreting instructions written by someone else. As there was no validator to check their output, the codes were exposed to errors and inconsistencies, the amount of which multiplied with each version of the corpus.
In the autumn of 2018, we were granted funding by the Faculty of Arts to convert CEEC-400 into XML so that it could be imported into more modern platforms (such as CQPweb). We soon found out that the shortcomings of the original coding would have to be detected and corrected before the conversion could take place. This is a documentation of the correction and conversion project, carried out by Lassi Saario under the supervision of Tanja Säily and Samuli Kaislaniemi.
The documentation is arranged in chronological order. The chapters 1–2 on the background of the project and pre-processing are mostly intended for internal use within VARIENG, whereas the one about the XML conversion might be more of interest to the general public.
Starting point
Family tree of CEEC-400
The CEEC corpora have a large family tree, the branches of which had grown quite far apart from each other by the time the conversion was about to start. Here’s a brief inventory of all the different corpus versions there were at that point.
First, there were the three separate basic corpora, CEEC, CEECE and CEECSU, stored on our network drive (the ‘P drive’). Each letter collection was stored in its own text file, preceded by some metadata about the collection and followed by the letters, which in these versions were lacking letter IDs or ‘L-lines’ such as <L ALLEN_001> , where ALLEN_001 is the identifier of the first letter in the Allen collection.
Second, there were two published subsets of CEEC: the CEEC Sampler (CEECS) and the Parsed CEEC (PCEEC), the latter of which was provided in three formats: plain text, tagged and parsed. The PCEEC was also stored on the P drive and had L-lines for all the letters in it, so that the letters could be unambiguously referenced in the separate metadata file.
Third, there were the normalised versions of CEEC, CEECE and CEECSU, known by the cover term SCEEC (short for Standardised-spelling Corpora of Early English Correspondence). We will call them CEEC-norm, CEECE-norm and CEECSU-norm. Unlike their non-normalised counterparts, the normalised collections did include L-lines; however, they only included a subset of the non-normalised collections. They were stored on the P drive in both tagged and untagged formats: the tagged versions included the original variants inside XML-like tags, whereas the untagged ones had no such tags.
Fourth, there was the brand new POS-Tagged CEECE (TCEECE) that we had just finished prior to this project (see the TCEECE manual). TCEECE was based on CEECE-norm that had been further normalised, converted into XML and annotated by a part-of-speech tagger.
Finally, the basic corpora CEEC, CEECE and CEECSU as well as PCEEC had also been imported into CEECer, a web-based search engine with associated metadata about the letters and their writers. As opposed to the P drive versions, all the letters in CEECer had L-lines, but the header sections that preceded each collection on P drive had been lost in the importing process.
All in all, there were various versions of the CEEC corpora that were more or less related, but not in a straightforward way. Some changes had been made to some versions that had not been synced with the parallel versions. The same letter might appear in a slightly different form and even bear a different identifier in different corpus versions. Not even the collections could be consistently individuated across the different versions. There were also many deviations from the expected COCOA coding here and there. This mess had to be sorted out before the XML conversion could begin.
Why the mess?
The history of the L-line and the letter ID plays a crucial role in understanding the genesis of unsynced corpus versions. The original P drive versions of CEEC, CEECE and CEECSU did not involve any L-lines at all. The L-line was first introduced in the PCEEC, which for copyright reasons did not include all of the letters in the CEEC. When CEEC, CEECE, CEECSU and PCEEC were imported into CEECer, L-lines were added to all letters, which resulted in some ‘half’ IDs (e.g. BROWNE_043.5 ) and other peculiarities. The normalised versions were based on the P drive versions, however, so the L-lines were added again to the normalised versions, which resulted in more errors.
Another source of errors is the person ID that appears in the ‘Q-line’ (such as <Q A 1497? T RFOX> , where RFOX is the person ID of Richard Fox, the author of the letter). When the corpora were imported into CEECer, it was discovered that different collections (particularly those in different subcorpora) might employ the same person ID for different persons, as the person ID had been derived directly from the person’s first and last name. These IDs were corrected in the CEECer metadata but not in the P drive corpora.
Pre-processing
We decided to merge the P drive and CEECer versions of CEEC, CEECE and CEECSU and the tagged versions of CEEC-norm, CEECE-norm and CEECSU-norm into one master corpus that would be kept in a Git repository where the version history could be tracked and controlled automatically. We wanted to rid the corpus of all known errata and ensure that the letter identifiers be consistent across those parallel corpora that remained. It would be the new master corpus that would then be converted into XML. What follows is a reconstruction of the pre-conversion process.
In the version history of our GitLab project, the steps were actually taken in a different order than what is presented here. Now that the process is complete, it is easy to see that the actual order was far from optimal. That is why we present here an alternative version history where the changes are made in a more logical order that is easier to follow and understand. The end product is, nevertheless, exactly the same master corpus as in the actual Git repository.
File division
We began by aligning the basic collections with their normalised versions and determining for each collection the file that would be included in the new master corpus. Whenever a collection had both a non-normalised and a (tagged) normalised version, we preferred the normalised one, as the non-normalised version could be reverted from it automatically, and the normalised one also contained L-lines which the non-normalised one did not. For each row in the table below, if the ‘final file’ column is empty, it means the final file is the normalised one—unless that is empty too, in which case the final file is the non-normalised one. The collections without a normalised file are mostly from the 15th century, which represents Late Middle English and has so far been deemed too challenging to normalise.
On the non-normalised side, the biggest collections had once been split into smaller files because of a restriction on the file size imposed by the WordCruncher application. The respective normalised collections had been merged into one file (with one exception). Even those non-normalised collections that did not have normalised counterparts were now merged into one file, so that there would be only one file for each collection.
Corpus |
Collection |
Non-normalised file |
Normalised file |
Final file |
Notes |
CEECer |
CEEC |
Allen |
FALLEN |
FALLEN |
|
|
x Allen |
|
Arundel |
FARUNDEL |
FARUNDEL |
|
|
x Arundel |
|
Bacon |
F1BACON |
FBACON |
|
|
x Bacon |
F2BACON |
F3BACON |
|
Barrington |
FBARRING |
FBARRING |
|
|
x Barrington |
|
Basire |
FBASIRE |
FBASIRE |
|
|
Basire |
|
Baxter & Eliot |
FBAXTER |
FBAXTER |
|
|
x Baxter |
|
Bentham |
FBENTHAM |
FBENTHAM |
|
|
x Bentham |
|
Brereton |
FBRERETO |
FBRERETO |
|
|
x Brereton |
|
Browne |
FBROWNE |
FBROWNE |
|
|
x Browne |
|
Bryskett |
FBRYSKET |
FBRYSKET |
|
|
x Bryskett |
|
Cecil |
FCECIL |
FCECIL |
|
|
x Cecil |
|
Cely |
FCELY |
|
|
|
x Cely |
|
Chamberlain |
FCHAMBER |
FCHAMBER |
|
|
x Chamberlain |
|
Charles |
FCHARLES |
FCHARLES |
|
|
Charles |
|
Clerk |
FCLERK |
FCLERK |
|
|
x Clerk |
|
Clifford |
FCLIFFOR |
FCLIFFO |
|
|
x Clifford |
|
Conway |
FCONWAY |
FCONWAY |
|
|
x Conway |
|
Corie |
FCORIE |
FCORIE |
|
|
x Corie |
|
Cornwallis |
FCORNWAL |
FCORNWAL |
|
|
Cornwallis |
|
Cosin |
FCOSIN |
FCOSIN |
|
|
Cosin |
|
Cromwell |
FCROMWEL |
FCROMWEL |
|
|
x Cromwell |
|
Derby |
FDERBY |
FDERBY |
|
|
x Derby |
|
Duppa |
FDUPPA |
FDUPPA |
|
|
x Duppa |
|
Edmondes |
FEDMONDE |
FEDMONDE |
|
|
x Edmondes |
|
Elyot |
FELYOT |
FELYOT |
|
|
x Elyot |
|
Essex |
FESSEX |
FESSEX |
|
|
x Essex |
|
Ferrar |
FFERRAR |
FFERRAR |
|
|
x Ferrar |
|
Ffarington |
FFFARING |
FFFARING |
|
|
x Ffarington |
|
Fitzherbert |
FFITZHER |
FFITZHER |
|
|
x Fitzherbert |
|
Fleming |
FFLEMING |
FFLEMING |
|
|
x Fleming |
|
Fox |
FFOX |
FFOX |
|
|
x Fox |
|
Gardiner |
FGARDIN |
FGARDIN |
|
|
x Gardiner |
|
Gawdy |
FGAWDY |
FGAWDY |
|
|
x Gawdy |
|
Gawdy L |
FGAWDYL |
FGAWDYL |
|
|
x Gawdy Lettice |
|
Giffard |
FGIFFARD |
FGIFFARD |
|
|
x Giffard |
|
Haddock |
FHADDOCK |
FHADDOCK |
|
|
x Haddock |
|
Hamilton |
FHAMILTO |
FHAMILTO |
|
|
Hamilton |
|
Harington |
FHARING |
FHARING |
|
|
x Harington |
|
Harley |
FHARLEY |
FHARLEY |
|
|
Harley |
|
Hart |
FHART |
FHART |
|
|
x Hart |
|
Harvey |
FHARVEY |
FHARVEY |
|
|
x Harvey |
|
Hastings |
FHASTING |
FHASTING |
|
|
x Hastings |
|
Hatton |
FHATTON |
FHATTON |
|
|
x Hatton |
|
Henry VIII |
FHENRY8 |
FHENRY8 |
|
|
x Henry VIII |
|
Henslowe |
FHENSLOW |
FHENSLOW |
|
|
Henslowe |
x Henslowe |
|
Holles |
FHOLLES |
FHOLLES |
|
|
x Holles |
|
Hoskyns |
FHOSKYNS |
FHOSKYNS |
|
|
x Hoskyns |
|
Hutton |
FHUTTON |
FHUTTON |
|
|
Hutton |
|
Johnson |
F1JOHNSO |
FJOHNSO |
|
|
x Johnson |
F2JOHNSO |
F3JOHNSO |
|
Jones |
FJONES |
FJONES |
|
|
Jones |
|
Jonson |
FJONSON |
FJONSON |
|
|
x Jonson |
|
Knyvett |
FKNYVETT |
FKNYVETT |
|
|
x Knyvett |
|
Leycester |
FLEYCEST |
FLEYCEST |
|
|
Leycester |
|
Lisle |
FLISLE |
FLISLE |
|
|
x Lisle |
|
Lowther |
FLOWTHER |
FLOWTHER |
|
|
x Lowther |
|
Marchall |
FMARCHAL |
|
|
|
Marchall |
|
Marescoe |
FMARESCO |
FMARESCO |
|
|
x Marescoe |
|
Marvell |
FMARVELL |
FMARVELL |
|
|
x Marvell |
|
Minette |
FMINETTE |
FMINETTE |
|
|
x Minette |
|
More |
FMORE |
FMORE |
|
|
x More |
|
Original 1 |
FORIGIN1 |
FORIGIN1 |
|
|
Original 1 |
|
Original 2 |
FORIGIN2 |
FORIGIN2 |
|
|
Original 2 |
|
Original 3 |
FORIGIN3 |
FORIGIN3 |
|
|
Original 3 |
|
Osborne |
FOSBORNE |
FOSBORNE |
|
|
x Osborne |
|
Oxinden |
F1OXINDE |
|
|
These were deleted because they were already included in FOXINDEN |
x Oxinden |
F2OXINDE |
FOXINDEN |
FOXINDEN |
|
|
|
Paget |
FPAGET |
FPAGET |
|
|
x Paget |
|
Parkhurst |
FPARKHUR |
FPARKHUR |
|
|
x Parkhurst |
|
Paston |
F1PASTON |
|
FPASTON |
The non-normalised collections were merged into one |
x Paston |
F2PASTON |
F3PASTON |
F4PASTON |
|
Paston K |
FPASTONK |
FPASTONK |
|
|
x Paston Katherine |
|
Pepys |
FPEPYS |
FPEPYS |
|
|
x Pepys |
|
Petty |
FPETTY |
FPETTY |
|
|
x Petty |
|
Plumpton |
FPLUMPTO |
|
|
|
Plumpton |
|
Pory |
FPORY |
FPORY |
|
|
x Pory |
|
Prideaux |
FPRIDEAU |
FPRIDEAU |
|
|
x Prideaux |
|
Rerum |
FRERUM |
|
|
|
Rerum |
|
Royal 1 |
FROYAL1 |
FROYAL1 |
|
|
Royal 1 |
|
Royal 2 |
FROYAL2 |
FROYAL2 |
|
|
Royal 2 |
x Royal 2 |
|
Royal 3 |
FROYAL3 |
FROYAL3 |
|
|
x Royal 3 |
|
Rutland |
FRUTLAND |
|
|
|
x Rutland |
|
Shillingford |
FSHILLIN |
|
|
|
Shillingford |
|
Signet |
FSIGNET |
|
|
|
x Signet |
|
Smyth |
FSMYTH |
FSMYTH |
|
|
x Smyth |
|
Stapylton |
FSTAPYLT |
FSTAPYLT |
|
|
x Stapylton |
|
Stiffkey |
FSTIFFKE |
FSTIFFKE |
|
|
x Stiffkey |
|
Stockwell |
FSTOCKWE |
FSTOCKWE |
|
|
x Stockwell |
|
Stonor |
FSTONOR |
|
|
|
Stonor |
|
Stuart |
FSTUART |
FSTUART |
|
|
x Stuart |
|
Tixall |
FTIXALL |
FTIXALL |
|
|
Tixall |
|
Verstegan |
FVERSTEG |
FVERSTEG |
|
|
x Verstegan |
|
Wentworth |
FWENTWOR |
FWENTWOR |
|
|
x Wentworth |
|
WeSa |
FWESA |
FWESA |
|
|
WeSa |
|
Wharton |
FWHARTON |
FWHARTON |
|
|
Wharton |
|
Willoughby |
FWILLOUG |
FWILLOUG |
|
|
x Willoughby |
|
Wilmot |
FWILMOT |
FWILMOT |
|
|
x Wilmot |
|
Wood |
FWOOD |
FWOOD |
|
|
x Wood |
|
Wyatt |
FWYATT |
FWYATT |
|
|
x Wyatt |
|
|
FBOHOLD |
|
|
This was deleted because it was an old version of FROYAL2 |
|
CEECE |
Addison |
FADDISON |
FADDISON |
|
|
z Addison |
|
Austen |
FAUSTEN |
FAUSTEN |
|
|
z Austen |
|
Banks |
FBANKS |
FBANKS |
|
|
z Banks |
|
Bentham J |
FBENTHAJ |
FBENTHAJ |
|
|
z Bentham Jeremy |
|
Blomefield |
FBLOMEFI |
FBLOMEFI |
|
|
z Blomefield |
|
Bolton |
FBOLTON |
FBOLTON |
|
|
z Bolton |
|
Bowrey |
FBOWREY |
FBOWREY |
|
|
z Bowrey |
|
Burney |
FBURNEY |
FBURNEY |
|
|
z Burney |
|
Burney F |
FBURNEYF |
FBURNEYF |
|
|
z Burney F |
|
Bute |
FBUTE |
FBUTE |
|
|
z Bute |
|
Carter |
FCARTER |
FCARTER |
|
|
z Carter |
|
Champion |
FCHAMPIO |
FCHAMPIO |
|
|
z Champion |
|
Clavering |
FCLAVERI |
FCLAVERI |
|
|
z Clavering |
|
Clift |
FCLIFT |
FCLIFT |
|
|
z Clift |
|
Cowper S |
FCOWPERS |
FCOWPERS |
|
|
z Cowper S |
|
Cowper W |
FCOWPERW |
FCOWPERW |
|
|
z Cowper W |
|
Crisp |
FCRISP |
FCRISP |
|
|
z Crisp |
|
Culley |
FCULLEY |
FCULLEY |
|
|
z Culley |
|
Darwin |
FDARWIN |
FDARWIN |
|
|
z Darwin |
|
Defoe |
FDEFOE |
FDEFOE |
|
|
z Defoe |
|
Dodsley |
FDODSLEY |
FDODSLEY |
|
|
z Dodsley |
|
Draper |
FDRAPER |
FDRAPER |
|
|
z Draper |
|
Dukes |
FDUKES |
FDUKES |
|
|
z Dukes |
|
Evelyn |
FEVELYN |
FEVELYN |
|
|
z Evelyn |
|
Evelyn 2 |
FEVELYN2 |
FEVELYN2 |
|
|
z Evelyn 2 |
|
Fleming 2 |
F2FLEMIN |
F2FLEMIN |
FFLEMIN2 |
The normalised collections were merged into one |
z Fleming 2 |
F3FLEMIN |
F3FLEMIN |
|
Fleming X |
FFLEMINX |
FFLEMINX |
|
|
z Fleming Extra |
|
Foundling |
FFOUNDLI |
FFOUNDLI |
|
|
z Foundling |
|
Garrick |
FGARRICK |
FGARRICK |
|
|
z Garrick |
|
Gay |
FGAY |
FGAY |
|
|
z Gay |
|
George 3 |
FGEORGE3 |
FGEORGE3 |
|
|
z George 3 |
|
George 3a |
FGEORG3A |
FGEORG3A |
|
|
z George 3A |
|
George 4 |
FGEORGE4 |
FGEORGE4 |
|
|
z George 4 |
|
Gibbon |
FGIBBON |
FGIBBON |
|
|
z Gibbon |
|
Giffard 2 |
FGIFFAR2 |
FGIFFAR2 |
|
|
z Giffard 2 |
|
Gower |
FGOWER |
FGOWER |
|
|
z Gower |
|
Gray |
FGRAY |
FGRAY |
|
|
z Gray |
|
Haddock 2 |
FHADDOC2 |
FHADDOC2 |
|
|
z Haddock 2 |
|
Hatton 2 |
fhatton2 |
fhatton2 |
FHATTON2 |
Renamed |
z Hatton 2 |
|
Henry |
FHENRY |
FHENRY |
|
|
z Henry |
|
Hurd |
FHURD |
FHURD |
|
|
z Hurd |
|
Johnson S |
FJOHNSOS |
FJOHNSOS |
|
|
z Johnson |
|
Jones W |
FJONESW |
FJONESW |
|
|
z Jones W |
|
Lennox |
FLENNOX |
FLENNOX |
|
|
z Lennox |
|
Liddell |
FLIDDELL |
FLIDDELL |
|
|
z Liddell |
|
Melbourne |
FMELBOUR |
FMELBOUR |
|
|
z Melbourne |
|
Montagu |
FMONTAGU |
FMONTAGU |
|
|
z Montagu |
|
Newdigate |
FNEWDIGA |
FNEWDIGA |
|
|
z Newdigate |
|
North |
FNORTH |
FNORTH |
|
|
z North |
|
Original 4 |
FORIGIN4 |
FORIGIN4 |
|
|
z Original 4 |
|
Pauper |
FPAUPER |
FPAUPER |
|
|
z Pauper |
|
Pepys 2 |
FPEPYS2 |
FPEPYS2 |
|
|
z Pepys 2 |
|
Pepys 3 |
FPEPYS3 |
FPEPYS3 |
|
|
z Pepys 3 |
|
Perrot |
FPERROT |
FPERROT |
|
|
z Perrot Jane |
|
Petty 2 |
FPETTY2 |
FPETTY2 |
|
|
z Petty 2 |
|
Pierce |
FPIERCE |
FPIERCE |
|
|
z Pierce |
|
Pinney |
FPINNEY |
FPINNEY |
|
|
z Pinney |
|
Piozzi |
FPIOZZI |
FPIOZZI |
|
|
z Piozzi |
|
Pitt |
FPITT |
FPITT |
|
|
z Pitt |
|
Pitt 2 |
FPITT2 |
FPITT2 |
|
|
z Pitt 2 |
|
Pope |
FPOPE |
FPOPE |
|
|
z Pope |
|
Porter |
FPORTER |
FPORTER |
|
|
z Porter |
|
Prideaux 2 |
FPRIDEA2 |
FPRIDEA2 |
|
|
z Prideaux 2 |
|
Purefoy |
FPUREFOY |
FPUREFOY |
|
|
z Purefoy |
|
Royal 4 |
FROYAL4 |
FROYAL4 |
|
|
z Royal 4 |
|
Sancho |
FSANCHO |
FSANCHO |
|
|
z Sancho |
|
Secker |
FSECKER |
FSECKER |
|
|
z Secker |
|
Stubs |
FSTUBS |
FSTUBS |
|
|
z Stubs |
|
Swift |
FSWIFT |
|
|
The normalised collection is kept separately because of differences in Sample 2 |
z Swift |
|
FSWIFT |
FSWIFT_norm |
|
Tixall 2 |
FTIXALL2 |
FTIXALL2 |
|
|
z Tixall 2 |
|
Twining |
FTWINING |
FTWINING |
|
|
z Twining |
|
Wanley |
FWANLEY |
FWANLEY |
|
|
z Wanley |
|
Warton |
FWARTON |
FWARTON |
|
|
z Warton |
|
Wedgwood |
FWEDGWOO |
FWEDGWOO |
|
|
z Wedgwood |
|
Wentworth 2 |
FWENTWO2 |
FWENTWO2 |
|
|
z Wentworth 2 |
|
Wollstonecraft |
FWOLLSTO |
FWOLLSTO |
|
|
z Wollstonecraft |
|
Young |
FYOUNG |
FYOUNG |
|
|
z Young |
CEECSU |
Arundel 2 |
FARUNDE2 |
FARUNDE2 |
|
|
y Arundel 2 |
|
Bacon D |
FBACOND |
FBACOND |
|
|
y Bacon Dorothy |
|
Bacon X |
FBACONX |
FBACONX |
|
|
y Bacon Extra |
|
Betts |
FBETTS |
FBETTS |
|
|
y Betts |
|
Cary |
FCARY |
FCARY |
|
|
y Cary |
|
Factory |
FFACTOR1 |
|
|
These were deleted because they were already included in FFACTORY |
y Factory |
FFACTOR2 |
FFACTOR3 |
FFACTORY |
FFACTORY |
|
|
|
Gardiner 2 |
FGARDIN2 |
FGARDIN2 |
|
|
y Gardiner 2 |
|
Gawdy 2 |
FGAWDY2 |
FGAWDY2 |
|
|
y Gawdy 2 |
|
Grene |
FGRENE |
FGRENE |
|
|
y Grene |
|
Knyvett 2 |
FKNYVET2 |
FKNYVET2 |
|
|
y Knyvett 2 |
|
Lisle H |
FLISLEH |
|
|
|
y Lisle H |
|
Oxinden X |
FOXINDEX |
FOXINDEX |
|
|
y Oxinden Extra |
|
Paston X |
FPASTONX |
|
|
|
y Paston Extra |
|
Plumpton 2 |
FPLUMPT2 |
|
|
|
y Plumpton 2 |
|
Ralegh |
FRALEGH |
FRALEGH |
|
|
y Ralegh |
|
Ralegh 2 |
FRALEGH2 |
FRALEGH2 |
|
|
y Ralegh 2 |
|
Symcotts |
FSYMCOTT |
|
|
|
y Symcotts |
|
Thynne |
FTHYNNE |
|
|
|
y Thynne |
|
Zouche |
FZOUCHE |
|
|
|
y Zouche |
Ordering of letters
We spotted two collections where the ordering of letters did not correspond to PCEEC and CEECer. In the other collection, it was not only letter order but also letter IDs that contradicted.
Corpus |
Collection/file |
Samples/letters |
Notes |
CEEC |
FCLIFFO |
Samples 1 and 2 |
Sample 2 (with letter IDs from 1 to 75) was moved to before sample 1 (with letter IDs from 76 to 105) so that the letter order would match that of PCEEC and CEECer |
" |
FHADDOCK |
Letters 11 and 12 |
The letter IDs and ordering of the last two letters were interchanged to match those of PCEEC and CEECer |
Character encoding
The encoding was unified to UTF-8 throughout the corpus.
The ‘old line’ refers to the line number as it was in the corpus after the changes to file division and letter order (see the two previous sections), and the ‘new line’ refers to the line number as it is in the final version of the corrected corpus.
Corpus |
Collection |
Old line(s) |
New line(s) |
Old text |
New text |
CEEC |
FBACON |
17 |
17 |
�.WORD.� OR �WORD� |
|.WORD.| OR |WORD| |
|
FHENRY8 |
4 |
4 |
Z�RICH |
ZÜRICH |
|
FHENSLOW |
4 |
4 |
KER�NEN |
KERÄNEN |
|
FMARCHAL |
3 |
3 |
KER�NEN |
KERÄNEN |
|
FROYAL2 |
10 |
10 |
K�NIGIN VON B�HMEN |
KÖNIGIN VON BÖHMEN |
|
" |
11 |
11 |
KURF�RSTEN |
KURFÜRSTEN |
|
" |
13 |
13 |
T�BINGEN |
TÜBINGEN |
|
FWILMOT |
6 |
6 |
M�LLER |
MÜLLER |
Parameter coding
The standard COCOA parameters are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. At this stage, we also made the necessary corrections to the IDs of letters, persons and collections.
Corpus |
Collection |
Old line(s) |
New line(s) |
Old text |
New text |
Notes |
CEEC |
FBARRING |
1394 |
1394 |
<SIR THOMAS BARRINGTON> |
<X SIR THOMAS BARRINGTON> |
|
|
" |
1482 |
1482 |
<SIR FRANCIS HARRIS> |
<X SIR FRANCIS HARRIS> |
|
|
FCHARLES |
11 |
11 |
<SAMPLE 1> |
<S SAMPLE 1> |
|
|
" |
162 |
162 |
<SAMPLE 2> |
<S SAMPLE 2> |
|
|
FHASTING |
1 |
1 |
<B FHASTINGS> |
<B FHASTING> |
|
|
FHENSLOW |
1 |
1 |
<B FHENSLOWE> |
<B FHENSLOW> |
|
|
" |
15 |
15 |
<L HENSLOW_001> |
<L HENSLO1_001> |
The IDs were changed to match the ones in PCEEC, CEECer and the untagged file on the P drive |
|
" |
49 |
49 |
<L HENSLOW_002> |
<L HENSLO1_002> |
|
" |
67 |
67 |
<L HENSLOW_003> |
<L HENSLO1_003> |
|
FJOHNSO |
1 |
1 |
<B F1JOHNSO> |
<B FJOHNSO> |
|
|
" |
7112–5 |
|
<B F2JOHNSO> |
|
Deleted |
|
" |
13223–6 |
|
<B F3JOHNSO> |
|
Deleted |
|
FMORE |
2695 |
2695 |
<L MORE_03> |
<L MORE_033> |
The erroneous ID remains in the published PCEEC |
|
FORIGIN2 |
909 |
909 |
<L ORIGIN2_020> |
<L ORIGIN2_019.5> |
|
|
" |
962 |
962 |
<L ORIGIN2_021> |
<L ORIGIN2_020> |
|
|
" |
1027 |
1027 |
<L ORIGIN2_022> |
<L ORIGIN2_021> |
|
|
" |
etc. |
etc. |
etc. |
etc. |
|
|
FOXINDEN |
12705 |
12705 |
<Q A 1662 FN HOXINDEN>. |
<Q A 1662 FN HOXINDEN> |
|
CEECE |
FEVELYN |
2028 |
2028 |
<SAMPLE 2> |
<S SAMPLE 2> |
|
|
" |
4525, 4583 |
4525, 4583 |
JJACKSON |
J2JACKSON |
|
|
FFLEMIN2 |
9356 |
9354 |
JBANKES |
JBANCKES |
|
|
FFOUNDLI |
5441, 5466 |
5441, 5466 |
FRUSSELL |
FRUSSELL2 |
|
|
FGEORGE4 |
1 |
1 |
<D FGEORGE4> |
<B FGEORGE4> |
|
|
FPEPYS3 |
|
1–10 |
|
<B FPEPYS3>
[^SAMPLE 1 = PARTICULAR FRIENDS. THE CORRESPONDENCE OF SAMUEL
PEPYS AND JOHN EVELYN. EDITED BY GUY DE LA BÉDOYÈRE. WOODBRIDGE:
THE BOYDELL PRESS. 1997.
SAMPLE 2 = THE LETTERS OF SAMUEL PEPYS AND HIS FAMILY CIRCLE.
EDITED BY HELEN TRUESDELL HEATH. OXFORD 1955.^]
<S SAMPLE 1> |
|
|
" |
169–70 |
180 |
|
<S SAMPLE 2> |
|
|
FPIOZZI |
236 |
236 |
EMONTAGU |
E2MONTAGU |
|
|
FPOPE |
961, 994, 1067, 1150, 1217, 1362, 1387, 1438, 1633, 1883 |
EHARLEY |
E2HARLEY |
|
|
FSWIFT |
181, 306, 566 |
189, 319, 586 |
RHARLEY |
R2HARLEY |
|
|
" |
1998, 2121, 2146, 3349 |
2042, 2168, 2194, 3419 |
HHOWARD |
HEHOWARD |
|
|
FSWIFT_norm |
194, 324, 591 |
RHARLEY |
R2HARLEY |
|
|
" |
2047, 2173, 2199, 3424 |
HHOWARD |
HEHOWARD |
|
|
" |
5141 |
5141 |
<Q A 1735? TC EGERMAIN> |
<Q A 1735 TC EGERMAIN> |
|
|
FWENTWO2 |
1409 |
1409 |
<ISABELLA WENTWORTH> |
<X ISABELLA WENTWORTH> |
|
|
" |
2634 |
2634 |
<WILLIAM BERKELEY> |
<X WILLIAM BERKELEY> |
|
|
" |
4824, 4842, 4893, 4910, 4925, 4949, 5196, 5392, 5988, 6015, 6632, 6709, 6737, 6759, 6816, 6850, 6879 |
WWENTWORTH |
W2WENTWORTH |
|
|
FYOUNG |
1943 |
1943 |
MHARLEY |
M2HARLEY |
|
CEECSU |
FFACTORY |
1 |
1 |
<B FFACTOR1> |
<B FFACTORY> |
|
|
" |
9, 560, 865, 3527, 6140, 13019, 14980, 15011 |
9, 560, 865, 3527, 6140, 13017, 14976, 15007 |
WADAMS |
WMADAMS |
|
|
" |
6663–4 |
|
<B FFACTOR2> |
|
Deleted |
|
" |
11768 |
11766 |
EWILMOT |
EDWILMOT |
|
|
" |
13242–3 |
|
<B FFACTOR3> |
|
Deleted |
|
FLISLEH |
2–3 |
3–5 |
|
[^THE LISLE LETTERS, VOLS I–V. ED. BY MURIEL ST. CLARE. CHICAGO:
UNIVERSITY OF CHICAGO PRESS. 1981.^]
|
|
|
FRALEGH2 |
880–881 |
881 |
|
<X WALTER RALEGH> |
|
In addition to the changes above, L-lines were added to those collections that lacked them.
Text-level coding
The standard COCOA text-level codes are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. There were also some code instances that were not incorrect as such but still had to be changed to meet the stricter requirements of XML.
Corpus |
Collection |
Old line(s) |
New line(s) |
Old text |
New text |
Notes |
CEEC |
FARUNDEL |
2185 |
2185 |
will send you. l |
will send you. I |
Lower case L to upper case I |
|
FBACON |
3485 |
3485 |
[{of wh} |
[ {of wh} |
|
|
" |
6313–5 |
6313–5 |
[} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS,
<P I,221>
NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF
|
[} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS, ...\]
<P I,221>
[\...NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF
|
|
|
" |
9841 |
9841 |
wor[\ship\] ] |
[wor[\ship\] ] |
|
|
" |
10488 |
10488 |
[of N[\orthumberland\] |
[of N[\orthumberland\] ] |
The error is in the edition |
|
" |
11293–5 |
11293–5 |
[his {the}
<P II,235>
office <normalised orig="beinge" auto="true">being</normalised> {the}]
|
[his {the} ]
<P II,235>
[office <normalised orig="beinge" auto="true">being</normalised> {the} ]
|
|
|
" |
11781–3 |
11781–3 |
inquired what
<P II,258>
just cause
|
inquired what]
<P II,258>
[just cause
|
|
|
" |
11956 |
11956 |
[{office} |
[ {office} |
|
|
" |
14633–5 |
14633–5 |
[The
<P III,124>
<normalised orig="countenaunce" auto="true">countenance</normalised>
|
[The]
<P III,124>
[<normalised orig="countenaunce" auto="true">countenance</normalised>
|
|
|
FCLIFFO |
3619 |
3619 |
[\ENDORSED,] |
[\ENDORSED\] |
|
|
FCORNWAL |
2565 |
2565 |
childeren w=th my self |
childeren w=th= my self |
|
|
FFLEMING |
2201 |
2201 |
for y=e good |
for y=e= good |
|
|
FHARLEY |
916 |
916 |
(^Octo: 18. 1639^.) |
(^Octo: 18. 1639.^) |
|
|
FHENSLOW |
679 |
679 |
M=ri[{s= ...{]t |
M=ri=[{=s= ...{]t |
|
|
" |
1603 |
1603 |
for yo=u to |
for yo=u= to |
|
|
FJOHNSO |
786 |
786 |
of my w{ill to{] |
of my w[{ill to{] |
|
|
" |
3747 |
3747 |
(^li mer s[{t.^) ; and{] |
(^li mer s[{t.{]^) [{; and{] |
|
|
" |
4841 |
4841 |
(^d Fl.) |
(^d Fl.^) |
|
|
" |
6329 |
6329 |
[\274. SABINE JOHNSON TO JOHN JOHNSON}] |
[\274. SABINE JOHNSON TO JOHN JOHNSON\] |
|
|
" |
9278 |
9274 |
(^lb^ ) |
(^lb^) |
|
|
FLEYCEST |
4043 |
4043 |
[{is\] |
[{is{] |
|
|
" |
6006–8 |
6006–8 |
[\I dare make
<P 342>
none of my servants
|
[\I dare make CROSSED OUT\]
<P 342>
[\none of my servants
|
|
|
FOSBORNE |
1461 |
1461 |
For M=rs Painter |
For M=rs= Painter |
|
|
FOXINDEN |
4131 |
4131 |
[} CLXXI THOMAS BARROW |
[} [\CLXXI THOMAS BARROW |
|
|
FPASTON |
923 |
944 |
[\?\]ch[{...{] |
[\?\] ch[{...{] |
|
|
" |
18099 |
18448 |
[{ [\582. FROM FRIAR JOHN BRACKLEY |
[} [\582. FROM FRIAR JOHN BRACKLEY |
|
|
FPEPYS |
4100 |
4100 |
[my sister |
[\my sister |
|
|
FWILMOT |
657 |
657 |
the w=ch I have |
the w=ch= I have |
|
CEECE |
FADDISON |
1486 |
1486 |
3O=th= July |
30=th= July |
Capital O to zero |
|
FBANKS |
2110 |
2110 |
[torn] |
[\TORN\] |
|
|
FBOWREY |
1044 |
1044 |
Colkers \CAULKERS\] |
Colkers [\CAULKERS\] |
|
|
FBURNEYF |
677–9 |
677–9 |
[\2 1/2 ILLEGIBLE
<P III,188>
LINES\]
|
[\2 1/2 ILLEGIBLE...\]
<P III,188>
[\...LINES\]
|
|
|
FDODSLEY |
453–5 |
453–5 |
the foll[{y
<P 110>
of Noblemen
|
the foll[{y{]
<P 110>
[{of Noblemen
|
|
|
" |
2782–4 |
2782–4 |
altercation about it,
<P 281>
except what might
|
altercation about it,\]
<P 281>
[\CROSSED OUT except what might
|
|
|
" |
3117 |
3117 |
so[{und w=c{]h= |
so[{und w=c={]=h= |
|
|
" |
3433 |
3433 |
(mouldering^) |
(^mouldering^) |
|
|
" |
3949 |
3949 |
[{of W=k{]m= |
[{of W=k={]=m= |
|
|
FDRAPER |
1419 |
1419 |
(27^th October^) |
27(^th October^) |
|
|
FFLEMIN2 |
4010 |
4009 |
(notwithstanding those Provocations} |
(notwithstanding those Provocations) |
|
|
" |
4035 |
4034 |
19=th came safe |
19=th= came safe |
|
|
FLIDDELL |
309 |
309 |
Capt[ain\] |
Capt[\ain\] |
|
|
" |
1007 |
1007 |
sist[\er] |
sist[\er\] |
|
|
" |
1598 |
1598 |
l(eaves\] |
l[\eaves\] |
|
|
" |
2785 |
2785 |
June l0th |
June 10th |
Lower case L to one |
|
FPAUPER |
263 |
263 |
(BERMONDSEY, LONDON} |
(BERMONDSEY, LONDON) |
|
|
FPRIDEA2 |
903 |
903 |
E[\arl\l] |
E[\arl\] |
|
|
FPUREFOY |
3802 |
3802 |
[^SIGN OMIITTED^] |
[^SIGN OMITTED^] |
|
|
FSANCHO |
1759 |
1759 |
[October 17, 1779.\] |
[\October 17, 1779.\] |
|
|
" |
2075 |
2075 |
M\inorit\]y |
M[\inorit\]y |
|
|
FSWIFT |
2666, 2709, 2823, 3031, 3304, 3405, 3448, 3535, 3580, 3714, 3819, 3910, 4232, 4348, 4380, 4581 |
2722, 2767, 2884, 3096, 3372, 3476, 3521, 3611, 3658, 3795, 3904, 3998, 4326, 4445, 4479, 4685 |
|
[^FROM ELIZABETH BERKELEY^] |
Added to the end of the line |
|
FYOUNG |
1424–6 |
1424–6 |
[\STRICKEN
<P 132>
PHRASE\]
|
[\STRICKEN...\]
<P 132>
[\...PHRASE\]
|
|
CEECSU |
FBACOND |
299–301 |
299–301 |
[\ELEVEN
<P 91>
HOURS\]
|
[\ELEVEN...\]
<P 91>
[\...HOURS\]
|
|
|
FFACTORY |
1501, 1685 |
1501, 1685 |
wacadash, |
(\wacadash\) , |
|
|
" |
1869–70 |
1869–70 |
c'nto
per c'nto,
|
(\c'nto
per c'nto\) ,
|
|
|
" |
2153 |
2153 |
Angin Sama's |
(\Angin Sama's\) |
|
|
" |
7111 |
7109 |
catabera |
(\catabera\) |
|
|
" |
7387–9 |
7385–7 |
[\4CM
<P 379>
MISSING\]
|
[\4CM...\]
<P 379>
[\...MISSING\]
|
|
|
" |
7574 |
7572 |
contors, |
(\contors\) , |
|
|
" |
8392 |
8390 |
cataberas |
(\cataberas\) |
|
|
" |
10735 |
10733 |
ditto |
(\ditto\) |
|
|
" |
10930 |
10928 |
pancado, |
(\pancado\) , |
|
|
" |
11109 |
11107 |
vizt |
(\vizt\) |
|
|
" |
14056 |
14052 |
<normalised orig="prowe" auto="true">prow</normalised>, |
(\prowe\) , |
|
|
" |
16817 |
16813 |
(\ocome\) |
ocome |
|
|
" |
17260, 17348 |
17256, 17344 |
Umbera's |
(\Umbera's\) |
|
|
" |
18147 |
18143 |
barsos |
(\barsos\) |
|
|
" |
18453 |
18449 |
deposseta |
(\deposseta\) |
|
|
FSYMCOTT |
324 |
331 |
(\fieri\). |
(\fieri\) . |
|
Custom codes
Unlike other collections in CEEC-400, the Bacon and Willoughby collections in the CEEC involved custom codes that had not been converted into COCOA text-level coding:
Custom code |
Meaning |
[you] |
Inserted words (only in Bacon) |
[so [[done]] till the afternone] |
Words inserted within an insertion (only in Bacon) |
{hard for} |
Deleted words |
The custom codes were converted into COCOA codes. The conversion was carried out partly automatically and partly manually. We omit the complete conversion tables and only give a few examples so the reader will get the idea:
Custom code |
COCOA |
river[ward] |
riverward [\ward INSERTED\] |
{[hym]} |
[\hym INSERTED THEN DELETED\] |
land{es} |
land [\FINAL es DELETED\] |
ha{ve}d |
had [\have OVERWRITTEN\] |
[Whereas it pleased your Lordship to direct your letters to Mr Sprat for {the puttinge} [[omyttinge]] James Tavernor {of} [[to be of]] the jury at Wighton, which was executed accordinglie, I have sithens {exam} inquired what] |
Whereas it pleased your Lordship to direct your letters to Mr Sprat for [\the puttinge DELETED\] omyttinge [\omyttinge INSERTED\] James Tavernor [\of DELETED\] to be of [\to be of INSERTED\] the jury at Wighton, which was executed accordinglie, I have sithens [\exam DELETED\] inquired what [\Whereas it pleased ... inquired what INSERTED\] |
Reversion of normalisation
The pre-conversion process resulted in a new hybrid corpus of normalised and non-normalised collections. The normalised collections were yet to be reverted to their non-normalised versions before the corpus would be converted into XML. The reversion was performed by ‘VardStripper’, a Java application written by Lassi Saario. In what follows, the ‘original CEEC-400’ will refer to the corpus as it was at this point, after the reversion and before the conversion.
XML conversion
The XML schema of CEEC-400 has long roots. Before we got funding for the XML conversion of the entire CEEC-400, we had already converted a version of the CEECE-norm as part of the POS tagging project that resulted in the TCEECE. The XML schema of CEEC-400 was based on that of the TCEECE, which had in turn been based on that of the Helsinki Corpus, which had again been based on the TEI standard.
The conversion from COCOA into XML was performed by our own ‘XmlConverter’, a Java application written by Lassi Saario. The resulting XML files were validated against the Document Type Definition by XmlStarlet (version 1.6.1), a command line XML toolkit developed by Mikhail Grushinskiy.
Structure of the corpus
The original corpus was divided into text files, one for each letter collection. We decided to preserve this division. Each COCOA-encoded collection file was converted into an XML-encoded file of the same name.
In the original corpus, each collection was preceded by a header followed by the individual letters. Each letter was likewise preceded by a header followed by the contents. The XML version follows the same overall structure, illustrated below.
Each XML document begins with the same two lines. The first line specifies the XML version and the character encoding. The second line defines the document type by a reference to an external DTD file. The entities are given an internal declaration as well, since omitting it would cause errors on some browsers which do not support external DTDs.
Each document has teiCollection as its root element. It is made up of a teiHeader element, containing header information about the collection, and a series of TEI elements, representing the individual letters. Each TEI element is likewise made up of a teiHeader element which includes header information about the letter, and a text element which includes the actual contents of the letter.
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE teiCollection SYSTEM "../CEEC.dtd" [
<!ENTITY ETH "Ð">
<!ENTITY eth "ð">
<!ENTITY YOGH "Ȝ">
<!ENTITY yogh "ȝ">
<!ENTITY THORN "Þ">
<!ENTITY thorn "þ">
<!ENTITY pound "£">
]>
<teiCollection xml:id="FFOX">
<teiHeader>...</teiHeader>
<TEI xml:id="FOX_001">
<teiHeader>...</teiHeader>
<text type="letter" xml:lang="eng">...</text>
</TEI>
<TEI xml:id="FOX_002">
<teiHeader>...</teiHeader>
<text type="letter" xml:lang="eng">...</text>
</TEI>
...
</teiCollection>
Parameter coding
In the original CEEC-400, each file is preceded by the identifier of the collection (same as the file name), followed by source information:
<B FFOX>
[^LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M.
ALLEN. OXFORD: CLARENDON PRESS. 1929.^]
In the XML conversion, the identifier is put in the xml:id attribute of the teiCollection opening tag, and the source information is included in a titleStmt element in the teiHeader :
<teiCollection xml:id="FFOX">
<teiHeader>
<fileDesc>
<titleStmt>LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M. ALLEN. OXFORD: CLARENDON PRESS. 1929.</titleStmt>
</fileDesc>
</teiHeader>
A letter header in the original CEEC-400 consists of an L-line, a Q-line, an X-line and a P-line:
<L FOX_001>
<Q A 1497? T RFOX>
<X RICHARD FOX>
<P 17>
- The L-line gives the letter identifier.
- The Q-line specifies the authenticity of the letter, the year of writing, the relationship between the writer of the letter and the addressee, and the identifier of the writer, respectively.
- The X-line contains the name of the writer in full. (In some collections there is an A-line instead of an X-line, but the content is the same nevertheless.)
- The P-line includes the number of the page on which the letter begins in the source edition. Similar lines appear amidst the body whenever the page changes.
The contents of the lines are included in the XML header as follows:
<TEI xml:id="FOX_001">
<!-- from the L-line -->
<teiHeader>
<fileDesc>
<titleStmt>
<title key="A 1497? T RFOX"></title>
<!-- from the Q-line -->
<author key="RICHARD FOX"></author>
<!-- from the X- (or A-) line -->
</titleStmt>
</fileDesc>
</teiHeader>
The P-line is included in the XML body along with the other P-lines. See the section on page numbers.
The S-lines like <S SAMPLE 1> that sometimes occur between letters mark samples taken from different source editions in the original corpus. They have been converted into XML comments like <!-- SAMPLE 1 --> .
Text-level coding
Textual structure
A letter body in the original CEEC-400 is divided into lines, the maximum length of which is limited to 65 characters. Some of them are P-lines that annotate page breaks; for the rest there is no fixed format. Paragraphs and sentences flow rather freely from one line to another along with code brackets for headings, emendations etc. See the example below.
[} [\705 FROM MARY HOWE TO MR RANKING IN COOPERSALE (THEYDON
GARNON), 13 JANUARY 1731\] }]
Jenaw 13 day 1731
Mr ranking this is to let you know that the doxtor have done
what he can for me but my iees are never the better but rather
wors i ame to be discharge=d= next wandsday i
hope you will be so kind as to send me word how i must come home
by next wandsday morning so with humble sarvis to you and your
good wife
sir I hope you will exquese me in wrighting of a letter but i
did not know no other way So i rest your humble sarvant
mary how patient in
[\CONTINUED CROSSWISE IN LEFT-HAND MARGIN\] peter ward
Unfortunately, the line and page divisions that are so explicit in the original CEEC-400 are irrelevant for the purposes of linguistic research, especially as they do not reflect those of the original manuscript. Much more relevant is the paragraph division, which is also much more implicit. Lines that start paragraphs are usually indented with three whitespaces, but not always: sometimes the only clue of the line starting a paragraph is the previous line being shorter than usual. Matters are further complicated by the fact that P-lines sometimes appear in the middle of a paragraph and sometimes between paragraphs. Code brackets often appear in the middle of a paragraph, but sometimes they continue across paragraph breaks, and sometimes they even seem to form paragraphs of their own.
To recognise such delicate divisions may be easy for a human eye, but it is far from easy for a computer. We wanted to try it anyway. The rule of thumb that we gave to our converter is that a line starts a new paragraph if it is indented or if the previous line is shorter than 40 characters. Lines that were recognised to form a paragraph were then merged and, given that the paragraph in question was indeed a proper paragraph (as opposed to a single P-line or a bunch of code), put inside a p element. Code bracket sequences that continued across paragraph or page breaks were split at the breaking point in order to guarantee the sanity of the element tree.
Special characters
Grave accent symbols that annotated accents (not only grave but acute ones and circumflexes as well) and tildes that annotated abbreviations in the original CEEC-400 remain in the XML edition. Certain special characters have been converted into XML entities according to the following table.
Source edition |
Original corpus |
XML corpus |
Description |
& |
& |
& |
ampersand |
Ð |
+D |
Ð |
upper case eth |
ð |
+d |
ð |
lower case eth |
Ȝ |
+G |
&YOGH; |
upper case yogh |
ȝ |
+g |
&yogh; |
lower case yogh |
Þ |
+T |
Þ |
upper case thorn |
þ |
+t |
þ |
lower case thorn |
£ |
+L |
£ |
pound sign |
Page numbers
Page changes were annotated as P-lines, e.g. <P 45> , in the original CEEC-400. They are converted into pb elements, the n attribute of which contains the page number, e.g. <pb n="45"> . Note that pb elements may appear inside as well as outside p elements.
Headings
Headings, annotated with the code [}...}] in the original CEEC-400, are annotated with the code <head>...</head> in the XML edition.
Note that most headings have been added by either editors or compilers, i.e. they are double-coded such as
[} [\98. TO FANNY BURNEY\] }]
where the inner brackets stand for the editor’s or compiler’s remark. The double-coding is preserved in the XML conversion so that the given example is converted into
<head> <note resp="editor" value="98. TO FANNY BURNEY" /> </head>
See the section on comments for more information.
Emendations
Emendations are annotated with the code [{...{] in the original CEEC-400. These have been converted into supplied elements in the XML version. When the emendation consists of complete words, we simply put the content of the brackets in between the XML tags. In e.g. the following passage,
I turned so sick that I [{could{] hardly speak
the [{could{] is converted into
<supplied>could</supplied>
The case is a bit trickier when the emendation contains partial words, as in w[{ife and son{] . These kinds of emendations we have converted so that in between the XML tags there is the final amended expression, while the original code is included in an orig attribute:
<supplied range="1,10" orig="w[{ife and son{]">wife and son</supplied>
The ‘range’ attribute specifies the extension of the emendation. The expression in between the XML tags is indexed so that the first character has the index 0, the second has the index 1 etc. The first number in the range value is the index of the first character in the range, and the second number is the index of the first character not in the range. Note that whitespaces do not count as characters here. When there are several ranges in the same expression, they are delimited by a semicolon:
<supplied range="7,13;15,16" orig="felysch[{yp of Ho{]ll[{a{]ndars">felyschyp of Hollandars</supplied>
The original version of CEEC-400 contains two types of comments added to the body text. Comments by compilers of the corpus are annotated with the code [^...^] , e.g. [^LIST OF NAMES OMITTED^] . Comments by editors of source editions are annotated with the code [\...\] , e.g. [\TORN\] .
In the TEI XML edition of the Helsinki Corpus, both codes are converted into a note element. The author of the comment is specified by a resp attribute which points to his/her name in the document header. For our purposes, however, it is sufficient to separate the editors’ comments from the compilers’ and not specify the individual commentator. The attribute is simply given the value compiler for compilers’ comments and editor for editors’ comments.
Comments, whether they are written by editors or compilers, are actually used for two different purposes. One is a ‘proper’ comment, such as the two previous examples. The other is more like an emendation, as in
an order [\was made\] at his Lordship's instance
The difference between the two kinds is that a proper comment is a comment about the surrounding text, whereas an emendation-like comment is more like a part of the text. They are rather easily distinguished by the fact that a proper comment usually involves several consecutive upper case letters while an emendation-like comment does not. This holds true even when a proper comment contains text that is parallel to the preceding text, as in
it will [\be DELETED\] come safe hither
When the brackets contain a proper comment, we put their contents in an attribute inside an XML tag:
<note resp="compiler" value="LIST OF NAMES OMITTED" />
<note resp="editor" value="be DELETED" />
When the brackets are used to annotate emendations, we follow the same principle as with the [{...{] code explained above. Emendations of complete words are encoded as
<note resp="editor">was made</note>
whereas emendations that contain partial words are encoded as
<note resp="editor" range="3,7" orig="Jan[\uary\]">January</note>
Type changes
Changes of typeface in the printed source editions were annotated as (^...^) in the original corpus. In the XML edition, they are annotated as <hi rend="type">...</hi> . (This usually corresponds to an underlined passage in the original letter.)
When the change of typeface concerns partial words as in Theo(^log^) , the original coding is preserved in the orig attribute of the hi element as in
<hi rend="type" range="4,7" orig="Theo(^log^)">Theolog</hi>
Foreign language
Passages in foreign language were annotated with the code (\...\) in the original CEEC-400. In the XML edition, they are annotated with the code <foreign>...</foreign> .
Superscripts
Superscripts in the original corpus were put in between two equality signs, such as
=vi=
w=ch=
p=r=ferm=t=
In the XML corpus, they are annotated as
<hi rend="sup">vi</hi>
<hi rend="sup" range="1,3" orig="w=ch=">wch</hi>
<hi rend="sup" range="1,2;6,7" orig="p=r=ferm=t=">prfermt</hi>
Post-processing
There were two abbreviations in the Dodsley collection where the superscript code had been used to encode a superscript inside another superscript. These instances had to be hard coded into the converter in order for them to be converted correctly, or else the converter would have taken them to consist of two consecutive superscripts:
COCOA |
XML |
=Jun=r=.= |
<hi rend="sup" range="0,5" orig="=Jun=r=.="><hi rend="sup" range="3,4" orig="Jun=r=.">Junr.</hi></hi> |
=Jun=r== |
<hi rend="sup" range="0,4" orig="=Jun=r=="><hi rend="sup" range="3,4" orig="Jun=r=">Junr</hi></hi> |
Endpoint
At the time of writing this manual, we are currently adapting the new XML corpus for our own CQPweb server in collaboration with Lancaster University. Once the remaining (15th-century) collections have been normalised, we will consider the possibility of converting the normalised CEEC-400 into an even more comprehensive XML version that would provide access to both the original and normalised variants.
References
For more information on the CEEC corpora, see the front page.
|
|