Conversion of the CEEC-400 into XML

A Manual to Accompany the XML Edition

2020

Lassi Saario
Research Unit for Variation, Contacts and Change in English (VARIENG)
Faculty of Arts
University of Helsinki

CEEC-400 is a cover term for a family of corpora: the original Corpus of Early English Correspondence (CEEC), the CEEC Extension (CEECE), the CEEC Supplement (CEECSU) and their various versions. Together they cover a time span of almost 400 years from 1402 to 1800. The corpora are based on published editions of letters, which were sampled and digitised by a team of compilers at the University of Helsinki.

The original CEEC-400 was written in a custom version of the ancient COCOA (Word COunt and COncordance on Atlas) format, based on that of the Helsinki Corpus of English Texts. Different collections were often encoded by different people, mostly research assistants, interpreting instructions written by someone else. As there was no validator to check their output, the codes were exposed to errors and inconsistencies, the amount of which multiplied with each version of the corpus.

In the autumn of 2018, we were granted funding by the Faculty of Arts to convert CEEC-400 into XML so that it could be imported into more modern platforms (such as CQPweb). We soon found out that the shortcomings of the original coding would have to be detected and corrected before the conversion could take place. This is a documentation of the correction and conversion project, carried out by Lassi Saario under the supervision of Tanja Säily and Samuli Kaislaniemi.

The documentation is arranged in chronological order. The chapters 1–2 on the background of the project and pre-processing are mostly intended for internal use within VARIENG, whereas the one about the XML conversion might be more of interest to the general public.

Starting point

Family tree of CEEC-400

The CEEC corpora have a large family tree, the branches of which had grown quite far apart from each other by the time the conversion was about to start. Here’s a brief inventory of all the different corpus versions there were at that point.

First, there were the three separate basic corpora, CEEC, CEECE and CEECSU, stored on our network drive (the ‘P drive’). Each letter collection was stored in its own text file, preceded by some metadata about the collection and followed by the letters, which in these versions were lacking letter IDs or ‘L-lines’ such as <L ALLEN_001>, where ALLEN_001 is the identifier of the first letter in the Allen collection.

Second, there were two published subsets of CEEC: the CEEC Sampler (CEECS) and the Parsed CEEC (PCEEC), the latter of which was provided in three formats: plain text, tagged and parsed. The PCEEC was also stored on the P drive and had L-lines for all the letters in it, so that the letters could be unambiguously referenced in the separate metadata file.

Third, there were the normalised versions of CEEC, CEECE and CEECSU, known by the cover term SCEEC (short for Standardised-spelling Corpora of Early English Correspondence). We will call them CEEC-norm, CEECE-norm and CEECSU-norm. Unlike their non-normalised counterparts, the normalised collections did include L-lines; however, they only included a subset of the non-normalised collections. They were stored on the P drive in both tagged and untagged formats: the tagged versions included the original variants inside XML-like tags, whereas the untagged ones had no such tags.

Fourth, there was the brand new POS-Tagged CEECE (TCEECE) that we had just finished prior to this project (see the TCEECE manual). TCEECE was based on CEECE-norm that had been further normalised, converted into XML and annotated by a part-of-speech tagger.

Finally, the basic corpora CEEC, CEECE and CEECSU as well as PCEEC had also been imported into CEECer, a web-based search engine with associated metadata about the letters and their writers. As opposed to the P drive versions, all the letters in CEECer had L-lines, but the header sections that preceded each collection on P drive had been lost in the importing process.

All in all, there were various versions of the CEEC corpora that were more or less related, but not in a straightforward way. Some changes had been made to some versions that had not been synced with the parallel versions. The same letter might appear in a slightly different form and even bear a different identifier in different corpus versions. Not even the collections could be consistently individuated across the different versions. There were also many deviations from the expected COCOA coding here and there. This mess had to be sorted out before the XML conversion could begin.

Why the mess?

The history of the L-line and the letter ID plays a crucial role in understanding the genesis of unsynced corpus versions. The original P drive versions of CEEC, CEECE and CEECSU did not involve any L-lines at all. The L-line was first introduced in the PCEEC, which for copyright reasons did not include all of the letters in the CEEC. When CEEC, CEECE, CEECSU and PCEEC were imported into CEECer, L-lines were added to all letters, which resulted in some ‘half’ IDs (e.g. BROWNE_043.5) and other peculiarities. The normalised versions were based on the P drive versions, however, so the L-lines were added again to the normalised versions, which resulted in more errors.

Another source of errors is the person ID that appears in the ‘Q-line’ (such as <Q A 1497? T RFOX>, where RFOX is the person ID of Richard Fox, the author of the letter). When the corpora were imported into CEECer, it was discovered that different collections (particularly those in different subcorpora) might employ the same person ID for different persons, as the person ID had been derived directly from the person’s first and last name. These IDs were corrected in the CEECer metadata but not in the P drive corpora.

Pre-processing

We decided to merge the P drive and CEECer versions of CEEC, CEECE and CEECSU and the tagged versions of CEEC-norm, CEECE-norm and CEECSU-norm into one master corpus that would be kept in a Git repository where the version history could be tracked and controlled automatically. We wanted to rid the corpus of all known errata and ensure that the letter identifiers be consistent across those parallel corpora that remained. It would be the new master corpus that would then be converted into XML. What follows is a reconstruction of the pre-conversion process.

In the version history of our GitLab project, the steps were actually taken in a different order than what is presented here. Now that the process is complete, it is easy to see that the actual order was far from optimal. That is why we present here an alternative version history where the changes are made in a more logical order that is easier to follow and understand. The end product is, nevertheless, exactly the same master corpus as in the actual Git repository.

File division

We began by aligning the basic collections with their normalised versions and determining for each collection the file that would be included in the new master corpus. Whenever a collection had both a non-normalised and a (tagged) normalised version, we preferred the normalised one, as the non-normalised version could be reverted from it automatically, and the normalised one also contained L-lines which the non-normalised one did not. For each row in the table below, if the ‘final file’ column is empty, it means the final file is the normalised one—unless that is empty too, in which case the final file is the non-normalised one. The collections without a normalised file are mostly from the 15th century, which represents Late Middle English and has so far been deemed too challenging to normalise.

On the non-normalised side, the biggest collections had once been split into smaller files because of a restriction on the file size imposed by the WordCruncher application. The respective normalised collections had been merged into one file (with one exception). Even those non-normalised collections that did not have normalised counterparts were now merged into one file, so that there would be only one file for each collection.

Corpus	Collection	Non-normalised file	Normalised file	Final file	Notes	CEECer
CEEC	Allen	FALLEN	FALLEN			x Allen
	Arundel	FARUNDEL	FARUNDEL			x Arundel
	Bacon	F1BACON	FBACON			x Bacon
		F2BACON
		F3BACON
	Barrington	FBARRING	FBARRING			x Barrington
	Basire	FBASIRE	FBASIRE			Basire
	Baxter & Eliot	FBAXTER	FBAXTER			x Baxter
	Bentham	FBENTHAM	FBENTHAM			x Bentham
	Brereton	FBRERETO	FBRERETO			x Brereton
	Browne	FBROWNE	FBROWNE			x Browne
	Bryskett	FBRYSKET	FBRYSKET			x Bryskett
	Cecil	FCECIL	FCECIL			x Cecil
	Cely	FCELY				x Cely
	Chamberlain	FCHAMBER	FCHAMBER			x Chamberlain
	Charles	FCHARLES	FCHARLES			Charles
	Clerk	FCLERK	FCLERK			x Clerk
	Clifford	FCLIFFOR	FCLIFFO			x Clifford
	Conway	FCONWAY	FCONWAY			x Conway
	Corie	FCORIE	FCORIE			x Corie
	Cornwallis	FCORNWAL	FCORNWAL			Cornwallis
	Cosin	FCOSIN	FCOSIN			Cosin
	Cromwell	FCROMWEL	FCROMWEL			x Cromwell
	Derby	FDERBY	FDERBY			x Derby
	Duppa	FDUPPA	FDUPPA			x Duppa
	Edmondes	FEDMONDE	FEDMONDE			x Edmondes
	Elyot	FELYOT	FELYOT			x Elyot
	Essex	FESSEX	FESSEX			x Essex
	Ferrar	FFERRAR	FFERRAR			x Ferrar
	Ffarington	FFFARING	FFFARING			x Ffarington
	Fitzherbert	FFITZHER	FFITZHER			x Fitzherbert
	Fleming	FFLEMING	FFLEMING			x Fleming
	Fox	FFOX	FFOX			x Fox
	Gardiner	FGARDIN	FGARDIN			x Gardiner
	Gawdy	FGAWDY	FGAWDY			x Gawdy
	Gawdy L	FGAWDYL	FGAWDYL			x Gawdy Lettice
	Giffard	FGIFFARD	FGIFFARD			x Giffard
	Haddock	FHADDOCK	FHADDOCK			x Haddock
	Hamilton	FHAMILTO	FHAMILTO			Hamilton
	Harington	FHARING	FHARING			x Harington
	Harley	FHARLEY	FHARLEY			Harley
	Hart	FHART	FHART			x Hart
	Harvey	FHARVEY	FHARVEY			x Harvey
	Hastings	FHASTING	FHASTING			x Hastings
	Hatton	FHATTON	FHATTON			x Hatton
	Henry VIII	FHENRY8	FHENRY8			x Henry VIII
	Henslowe	FHENSLOW	FHENSLOW			Henslowe
	Henslowe	FHENSLOW	FHENSLOW			x Henslowe
	Holles	FHOLLES	FHOLLES			x Holles
	Hoskyns	FHOSKYNS	FHOSKYNS			x Hoskyns
	Hutton	FHUTTON	FHUTTON			Hutton
	Johnson	F1JOHNSO	FJOHNSO			x Johnson
		F2JOHNSO
		F3JOHNSO
	Jones	FJONES	FJONES			Jones
	Jonson	FJONSON	FJONSON			x Jonson
	Knyvett	FKNYVETT	FKNYVETT			x Knyvett
	Leycester	FLEYCEST	FLEYCEST			Leycester
	Lisle	FLISLE	FLISLE			x Lisle
	Lowther	FLOWTHER	FLOWTHER			x Lowther
	Marchall	FMARCHAL				Marchall
	Marescoe	FMARESCO	FMARESCO			x Marescoe
	Marvell	FMARVELL	FMARVELL			x Marvell
	Minette	FMINETTE	FMINETTE			x Minette
	More	FMORE	FMORE			x More
	Original 1	FORIGIN1	FORIGIN1			Original 1
	Original 2	FORIGIN2	FORIGIN2			Original 2
	Original 3	FORIGIN3	FORIGIN3			Original 3
	Osborne	FOSBORNE	FOSBORNE			x Osborne
	Oxinden	F1OXINDE			These were deleted because they were already included in FOXINDEN	x Oxinden
		F2OXINDE
		FOXINDEN	FOXINDEN
	Paget	FPAGET	FPAGET			x Paget
	Parkhurst	FPARKHUR	FPARKHUR			x Parkhurst
	Paston	F1PASTON		FPASTON	The non-normalised collections were merged into one	x Paston
		F2PASTON
		F3PASTON
		F4PASTON
	Paston K	FPASTONK	FPASTONK			x Paston Katherine
	Pepys	FPEPYS	FPEPYS			x Pepys
	Petty	FPETTY	FPETTY			x Petty
	Plumpton	FPLUMPTO				Plumpton
	Pory	FPORY	FPORY			x Pory
	Prideaux	FPRIDEAU	FPRIDEAU			x Prideaux
	Rerum	FRERUM				Rerum
	Royal 1	FROYAL1	FROYAL1			Royal 1
	Royal 2	FROYAL2	FROYAL2			Royal 2
	Royal 2	FROYAL2	FROYAL2			x Royal 2
	Royal 3	FROYAL3	FROYAL3			x Royal 3
	Rutland	FRUTLAND				x Rutland
	Shillingford	FSHILLIN				Shillingford
	Signet	FSIGNET				x Signet
	Smyth	FSMYTH	FSMYTH			x Smyth
	Stapylton	FSTAPYLT	FSTAPYLT			x Stapylton
	Stiffkey	FSTIFFKE	FSTIFFKE			x Stiffkey
	Stockwell	FSTOCKWE	FSTOCKWE			x Stockwell
	Stonor	FSTONOR				Stonor
	Stuart	FSTUART	FSTUART			x Stuart
	Tixall	FTIXALL	FTIXALL			Tixall
	Verstegan	FVERSTEG	FVERSTEG			x Verstegan
	Wentworth	FWENTWOR	FWENTWOR			x Wentworth
	WeSa	FWESA	FWESA			WeSa
	Wharton	FWHARTON	FWHARTON			Wharton
	Willoughby	FWILLOUG	FWILLOUG			x Willoughby
	Wilmot	FWILMOT	FWILMOT			x Wilmot
	Wood	FWOOD	FWOOD			x Wood
	Wyatt	FWYATT	FWYATT			x Wyatt
		FBOHOLD			This was deleted because it was an old version of FROYAL2
CEECE	Addison	FADDISON	FADDISON			z Addison
	Austen	FAUSTEN	FAUSTEN			z Austen
	Banks	FBANKS	FBANKS			z Banks
	Bentham J	FBENTHAJ	FBENTHAJ			z Bentham Jeremy
	Blomefield	FBLOMEFI	FBLOMEFI			z Blomefield
	Bolton	FBOLTON	FBOLTON			z Bolton
	Bowrey	FBOWREY	FBOWREY			z Bowrey
	Burney	FBURNEY	FBURNEY			z Burney
	Burney F	FBURNEYF	FBURNEYF			z Burney F
	Bute	FBUTE	FBUTE			z Bute
	Carter	FCARTER	FCARTER			z Carter
	Champion	FCHAMPIO	FCHAMPIO			z Champion
	Clavering	FCLAVERI	FCLAVERI			z Clavering
	Clift	FCLIFT	FCLIFT			z Clift
	Cowper S	FCOWPERS	FCOWPERS			z Cowper S
	Cowper W	FCOWPERW	FCOWPERW			z Cowper W
	Crisp	FCRISP	FCRISP			z Crisp
	Culley	FCULLEY	FCULLEY			z Culley
	Darwin	FDARWIN	FDARWIN			z Darwin
	Defoe	FDEFOE	FDEFOE			z Defoe
	Dodsley	FDODSLEY	FDODSLEY			z Dodsley
	Draper	FDRAPER	FDRAPER			z Draper
	Dukes	FDUKES	FDUKES			z Dukes
	Evelyn	FEVELYN	FEVELYN			z Evelyn
	Evelyn 2	FEVELYN2	FEVELYN2			z Evelyn 2
	Fleming 2	F2FLEMIN	F2FLEMIN	FFLEMIN2	The normalised collections were merged into one	z Fleming 2
	Fleming 2	F3FLEMIN	F3FLEMIN	FFLEMIN2	The normalised collections were merged into one	z Fleming 2
	Fleming X	FFLEMINX	FFLEMINX			z Fleming Extra
	Foundling	FFOUNDLI	FFOUNDLI			z Foundling
	Garrick	FGARRICK	FGARRICK			z Garrick
	Gay	FGAY	FGAY			z Gay
	George 3	FGEORGE3	FGEORGE3			z George 3
	George 3a	FGEORG3A	FGEORG3A			z George 3A
	George 4	FGEORGE4	FGEORGE4			z George 4
	Gibbon	FGIBBON	FGIBBON			z Gibbon
	Giffard 2	FGIFFAR2	FGIFFAR2			z Giffard 2
	Gower	FGOWER	FGOWER			z Gower
	Gray	FGRAY	FGRAY			z Gray
	Haddock 2	FHADDOC2	FHADDOC2			z Haddock 2
	Hatton 2	fhatton2	fhatton2	FHATTON2	Renamed	z Hatton 2
	Henry	FHENRY	FHENRY			z Henry
	Hurd	FHURD	FHURD			z Hurd
	Johnson S	FJOHNSOS	FJOHNSOS			z Johnson
	Jones W	FJONESW	FJONESW			z Jones W
	Lennox	FLENNOX	FLENNOX			z Lennox
	Liddell	FLIDDELL	FLIDDELL			z Liddell
	Melbourne	FMELBOUR	FMELBOUR			z Melbourne
	Montagu	FMONTAGU	FMONTAGU			z Montagu
	Newdigate	FNEWDIGA	FNEWDIGA			z Newdigate
	North	FNORTH	FNORTH			z North
	Original 4	FORIGIN4	FORIGIN4			z Original 4
	Pauper	FPAUPER	FPAUPER			z Pauper
	Pepys 2	FPEPYS2	FPEPYS2			z Pepys 2
	Pepys 3	FPEPYS3	FPEPYS3			z Pepys 3
	Perrot	FPERROT	FPERROT			z Perrot Jane
	Petty 2	FPETTY2	FPETTY2			z Petty 2
	Pierce	FPIERCE	FPIERCE			z Pierce
	Pinney	FPINNEY	FPINNEY			z Pinney
	Piozzi	FPIOZZI	FPIOZZI			z Piozzi
	Pitt	FPITT	FPITT			z Pitt
	Pitt 2	FPITT2	FPITT2			z Pitt 2
	Pope	FPOPE	FPOPE			z Pope
	Porter	FPORTER	FPORTER			z Porter
	Prideaux 2	FPRIDEA2	FPRIDEA2			z Prideaux 2
	Purefoy	FPUREFOY	FPUREFOY			z Purefoy
	Royal 4	FROYAL4	FROYAL4			z Royal 4
	Sancho	FSANCHO	FSANCHO			z Sancho
	Secker	FSECKER	FSECKER			z Secker
	Stubs	FSTUBS	FSTUBS			z Stubs
	Swift	FSWIFT			The normalised collection is kept separately because of differences in Sample 2	z Swift
	Swift		FSWIFT	FSWIFT_norm		z Swift
	Tixall 2	FTIXALL2	FTIXALL2			z Tixall 2
	Twining	FTWINING	FTWINING			z Twining
	Wanley	FWANLEY	FWANLEY			z Wanley
	Warton	FWARTON	FWARTON			z Warton
	Wedgwood	FWEDGWOO	FWEDGWOO			z Wedgwood
	Wentworth 2	FWENTWO2	FWENTWO2			z Wentworth 2
	Wollstonecraft	FWOLLSTO	FWOLLSTO			z Wollstonecraft
	Young	FYOUNG	FYOUNG			z Young
CEECSU	Arundel 2	FARUNDE2	FARUNDE2			y Arundel 2
	Bacon D	FBACOND	FBACOND			y Bacon Dorothy
	Bacon X	FBACONX	FBACONX			y Bacon Extra
	Betts	FBETTS	FBETTS			y Betts
	Cary	FCARY	FCARY			y Cary
	Factory	FFACTOR1			These were deleted because they were already included in FFACTORY	y Factory
		FFACTOR2
		FFACTOR3
		FFACTORY	FFACTORY
	Gardiner 2	FGARDIN2	FGARDIN2			y Gardiner 2
	Gawdy 2	FGAWDY2	FGAWDY2			y Gawdy 2
	Grene	FGRENE	FGRENE			y Grene
	Knyvett 2	FKNYVET2	FKNYVET2			y Knyvett 2
	Lisle H	FLISLEH				y Lisle H
	Oxinden X	FOXINDEX	FOXINDEX			y Oxinden Extra
	Paston X	FPASTONX				y Paston Extra
	Plumpton 2	FPLUMPT2				y Plumpton 2
	Ralegh	FRALEGH	FRALEGH			y Ralegh
	Ralegh 2	FRALEGH2	FRALEGH2			y Ralegh 2
	Symcotts	FSYMCOTT				y Symcotts
	Thynne	FTHYNNE				y Thynne
	Zouche	FZOUCHE				y Zouche

Ordering of letters

We spotted two collections where the ordering of letters did not correspond to PCEEC and CEECer. In the other collection, it was not only letter order but also letter IDs that contradicted.

Corpus	Collection/file	Samples/letters	Notes
CEEC	FCLIFFO	Samples 1 and 2	Sample 2 (with letter IDs from 1 to 75) was moved to before sample 1 (with letter IDs from 76 to 105) so that the letter order would match that of PCEEC and CEECer
"	FHADDOCK	Letters 11 and 12	The letter IDs and ordering of the last two letters were interchanged to match those of PCEEC and CEECer

Character encoding

The encoding was unified to UTF-8 throughout the corpus.

The ‘old line’ refers to the line number as it was in the corpus after the changes to file division and letter order (see the two previous sections), and the ‘new line’ refers to the line number as it is in the final version of the corrected corpus.

Corpus	Collection	Old line(s)	New line(s)	Old text	New text
CEEC	FBACON	17	17	`�.WORD.� OR �WORD�`	`\|.WORD.\| OR \|WORD\|`
	FHENRY8	4	4	`Z�RICH`	`ZÜRICH`
	FHENSLOW	4	4	`KER�NEN`	`KERÄNEN`
	FMARCHAL	3	3	`KER�NEN`	`KERÄNEN`
	FROYAL2	10	10	`K�NIGIN VON B�HMEN`	`KÖNIGIN VON BÖHMEN`
	"	11	11	`KURF�RSTEN`	`KURFÜRSTEN`
	"	13	13	`T�BINGEN`	`TÜBINGEN`
	FWILMOT	6	6	`M�LLER`	`MÜLLER`

Parameter coding

The standard COCOA parameters are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. At this stage, we also made the necessary corrections to the IDs of letters, persons and collections.

Corpus	Collection	Old line(s)	New line(s)	Old text	New text	Notes
CEEC	FBARRING	1394	1394	`<SIR THOMAS BARRINGTON>`	`<X SIR THOMAS BARRINGTON>`
	"	1482	1482	`<SIR FRANCIS HARRIS>`	`<X SIR FRANCIS HARRIS>`
	FCHARLES	11	11	`<SAMPLE 1>`	`<S SAMPLE 1>`
	"	162	162	`<SAMPLE 2>`	`<S SAMPLE 2>`
	FHASTING	1	1	`<B FHASTINGS>`	`<B FHASTING>`
	FHENSLOW	1	1	`<B FHENSLOWE>`	`<B FHENSLOW>`
	"	15	15	`<L HENSLOW_001>`	`<L HENSLO1_001>`	The IDs were changed to match the ones in PCEEC, CEECer and the untagged file on the P drive
	"	49	49	`<L HENSLOW_002>`	`<L HENSLO1_002>`
	"	67	67	`<L HENSLOW_003>`	`<L HENSLO1_003>`
	FJOHNSO	1	1	`<B F1JOHNSO>`	`<B FJOHNSO>`
	"	7112–5		`<B F2JOHNSO>`		Deleted
	"	13223–6		`<B F3JOHNSO>`		Deleted
	FMORE	2695	2695	`<L MORE_03>`	`<L MORE_033>`	The erroneous ID remains in the published PCEEC
	FORIGIN2	909	909	`<L ORIGIN2_020>`	`<L ORIGIN2_019.5>`
	"	962	962	`<L ORIGIN2_021>`	`<L ORIGIN2_020>`
	"	1027	1027	`<L ORIGIN2_022>`	`<L ORIGIN2_021>`
	"	etc.	etc.	etc.	etc.
	FOXINDEN	12705	12705	`<Q A 1662 FN HOXINDEN>.`	`<Q A 1662 FN HOXINDEN>`
CEECE	FEVELYN	2028	2028	`<SAMPLE 2>`	`<S SAMPLE 2>`
	"	4525, 4583	4525, 4583	`JJACKSON`	`J2JACKSON`
	FFLEMIN2	9356	9354	`JBANKES`	`JBANCKES`
	FFOUNDLI	5441, 5466	5441, 5466	`FRUSSELL`	`FRUSSELL2`
	FGEORGE4	1	1	`<D FGEORGE4>`	`<B FGEORGE4>`
	FPEPYS3		1–10		`<B FPEPYS3> [^SAMPLE 1 = PARTICULAR FRIENDS. THE CORRESPONDENCE OF SAMUEL PEPYS AND JOHN EVELYN. EDITED BY GUY DE LA BÉDOYÈRE. WOODBRIDGE: THE BOYDELL PRESS. 1997. SAMPLE 2 = THE LETTERS OF SAMUEL PEPYS AND HIS FAMILY CIRCLE. EDITED BY HELEN TRUESDELL HEATH. OXFORD 1955.^] <S SAMPLE 1>`
	"	169–70	180		`<S SAMPLE 2>`
	FPIOZZI	236	236	`EMONTAGU`	`E2MONTAGU`
	FPOPE	961, 994, 1067, 1150, 1217, 1362, 1387, 1438, 1633, 1883		`EHARLEY`	`E2HARLEY`
	FSWIFT	181, 306, 566	189, 319, 586	`RHARLEY`	`R2HARLEY`
	"	1998, 2121, 2146, 3349	2042, 2168, 2194, 3419	`HHOWARD`	`HEHOWARD`
	FSWIFT_norm	194, 324, 591		`RHARLEY`	`R2HARLEY`
	"	2047, 2173, 2199, 3424		`HHOWARD`	`HEHOWARD`
	"	5141	5141	`<Q A 1735? TC EGERMAIN>`	`<Q A 1735 TC EGERMAIN>`
	FWENTWO2	1409	1409	`<ISABELLA WENTWORTH>`	`<X ISABELLA WENTWORTH>`
	"	2634	2634	`<WILLIAM BERKELEY>`	`<X WILLIAM BERKELEY>`
	"	4824, 4842, 4893, 4910, 4925, 4949, 5196, 5392, 5988, 6015, 6632, 6709, 6737, 6759, 6816, 6850, 6879		`WWENTWORTH`	`W2WENTWORTH`
	FYOUNG	1943	1943	`MHARLEY`	`M2HARLEY`
CEECSU	FFACTORY	1	1	`<B FFACTOR1>`	`<B FFACTORY>`
	"	9, 560, 865, 3527, 6140, 13019, 14980, 15011	9, 560, 865, 3527, 6140, 13017, 14976, 15007	`WADAMS`	`WMADAMS`
	"	6663–4		`<B FFACTOR2>`		Deleted
	"	11768	11766	`EWILMOT`	`EDWILMOT`
	"	13242–3		`<B FFACTOR3>`		Deleted
	FLISLEH	2–3	3–5		`[^THE LISLE LETTERS, VOLS I–V. ED. BY MURIEL ST. CLARE. CHICAGO: UNIVERSITY OF CHICAGO PRESS. 1981.^]`
	FRALEGH2	880–881	881		`<X WALTER RALEGH>`

In addition to the changes above, L-lines were added to those collections that lacked them.

Text-level coding

The standard COCOA text-level codes are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. There were also some code instances that were not incorrect as such but still had to be changed to meet the stricter requirements of XML.

Corpus	Collection	Old line(s)	New line(s)	Old text	New text	Notes
CEEC	FARUNDEL	2185	2185	`will send you. l`	`will send you. I`	Lower case L to upper case I
	FBACON	3485	3485	`[{of wh}`	`[ {of wh}`
	"	6313–5	6313–5	`[} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS, <P I,221> NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF`	`[} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS, ...\] <P I,221> [\...NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF`
	"	9841	9841	`wor[\ship\] ]`	`[wor[\ship\] ]`
	"	10488	10488	`[of N[\orthumberland\]`	`[of N[\orthumberland\] ]`	The error is in the edition
	"	11293–5	11293–5	`[his {the} <P II,235> office <normalised orig="beinge" auto="true">being</normalised> {the}]`	`[his {the} ] <P II,235> [office <normalised orig="beinge" auto="true">being</normalised> {the} ]`
	"	11781–3	11781–3	`inquired what <P II,258> just cause`	`inquired what] <P II,258> [just cause`
	"	11956	11956	`[{office}`	`[ {office}`
	"	14633–5	14633–5	`[The <P III,124> <normalised orig="countenaunce" auto="true">countenance</normalised>`	`[The] <P III,124> [<normalised orig="countenaunce" auto="true">countenance</normalised>`
	FCLIFFO	3619	3619	`[\ENDORSED,]`	`[\ENDORSED\]`
	FCORNWAL	2565	2565	`childeren w=th my self`	`childeren w=th= my self`
	FFLEMING	2201	2201	`for y=e good`	`for y=e= good`
	FHARLEY	916	916	`(^Octo: 18. 1639^.)`	`(^Octo: 18. 1639.^)`
	FHENSLOW	679	679	`M=ri[{s= ...{]t`	`M=ri=[{=s= ...{]t`
	"	1603	1603	`for yo=u to`	`for yo=u= to`
	FJOHNSO	786	786	`of my w{ill to{]`	`of my w[{ill to{]`
	"	3747	3747	`(^li mer s[{t.^) ; and{]`	`(^li mer s[{t.{]^) [{; and{]`
	"	4841	4841	`(^d Fl.)`	`(^d Fl.^)`
	"	6329	6329	`[\274. SABINE JOHNSON TO JOHN JOHNSON}]`	`[\274. SABINE JOHNSON TO JOHN JOHNSON\]`
	"	9278	9274	`(^lb^ )`	`(^lb^)`
	FLEYCEST	4043	4043	`[{is\]`	`[{is{]`
	"	6006–8	6006–8	`[\I dare make <P 342> none of my servants`	`[\I dare make CROSSED OUT\] <P 342> [\none of my servants`
	FOSBORNE	1461	1461	`For M=rs Painter`	`For M=rs= Painter`
	FOXINDEN	4131	4131	`[} CLXXI THOMAS BARROW`	`[} [\CLXXI THOMAS BARROW`
	FPASTON	923	944	`[\?\]ch[{...{]`	`[\?\] ch[{...{]`
	"	18099	18448	`[{ [\582. FROM FRIAR JOHN BRACKLEY`	`[} [\582. FROM FRIAR JOHN BRACKLEY`
	FPEPYS	4100	4100	`[my sister`	`[\my sister`
	FWILMOT	657	657	`the w=ch I have`	`the w=ch= I have`
CEECE	FADDISON	1486	1486	`3O=th= July`	`30=th= July`	Capital O to zero
	FBANKS	2110	2110	`[torn]`	`[\TORN\]`
	FBOWREY	1044	1044	`Colkers \CAULKERS\]`	`Colkers [\CAULKERS\]`
	FBURNEYF	677–9	677–9	`[\2 1/2 ILLEGIBLE <P III,188> LINES\]`	`[\2 1/2 ILLEGIBLE...\] <P III,188> [\...LINES\]`
	FDODSLEY	453–5	453–5	`the foll[{y <P 110> of Noblemen`	`the foll[{y{] <P 110> [{of Noblemen`
	"	2782–4	2782–4	`altercation about it, <P 281> except what might`	`altercation about it,\] <P 281> [\CROSSED OUT except what might`
	"	3117	3117	`so[{und w=c{]h=`	`so[{und w=c={]=h=`
	"	3433	3433	`(mouldering^)`	`(^mouldering^)`
	"	3949	3949	`[{of W=k{]m=`	`[{of W=k={]=m=`
	FDRAPER	1419	1419	`(27^th October^)`	`27(^th October^)`
	FFLEMIN2	4010	4009	`(notwithstanding those Provocations}`	`(notwithstanding those Provocations)`
	"	4035	4034	`19=th came safe`	`19=th= came safe`
	FLIDDELL	309	309	`Capt[ain\]`	`Capt[\ain\]`
	"	1007	1007	`sist[\er]`	`sist[\er\]`
	"	1598	1598	`l(eaves\]`	`l[\eaves\]`
	"	2785	2785	`June l0th`	`June 10th`	Lower case L to one
	FPAUPER	263	263	`(BERMONDSEY, LONDON}`	`(BERMONDSEY, LONDON)`
	FPRIDEA2	903	903	`E[\arl\l]`	`E[\arl\]`
	FPUREFOY	3802	3802	`[^SIGN OMIITTED^]`	`[^SIGN OMITTED^]`
	FSANCHO	1759	1759	`[October 17, 1779.\]`	`[\October 17, 1779.\]`
	"	2075	2075	`M\inorit\]y`	`M[\inorit\]y`
	FSWIFT	2666, 2709, 2823, 3031, 3304, 3405, 3448, 3535, 3580, 3714, 3819, 3910, 4232, 4348, 4380, 4581	2722, 2767, 2884, 3096, 3372, 3476, 3521, 3611, 3658, 3795, 3904, 3998, 4326, 4445, 4479, 4685		`[^FROM ELIZABETH BERKELEY^]`	Added to the end of the line
	FYOUNG	1424–6	1424–6	`[\STRICKEN <P 132> PHRASE\]`	`[\STRICKEN...\] <P 132> [\...PHRASE\]`
CEECSU	FBACOND	299–301	299–301	`[\ELEVEN <P 91> HOURS\]`	`[\ELEVEN...\] <P 91> [\...HOURS\]`
	FFACTORY	1501, 1685	1501, 1685	`wacadash,`	`(\wacadash\) ,`
	"	1869–70	1869–70	`c'nto per c'nto,`	`(\c'nto per c'nto\) ,`
	"	2153	2153	`Angin Sama's`	`(\Angin Sama's\)`
	"	7111	7109	`catabera`	`(\catabera\)`
	"	7387–9	7385–7	`[\4CM <P 379> MISSING\]`	`[\4CM...\] <P 379> [\...MISSING\]`
	"	7574	7572	`contors,`	`(\contors\) ,`
	"	8392	8390	`cataberas`	`(\cataberas\)`
	"	10735	10733	`ditto`	`(\ditto\)`
	"	10930	10928	`pancado,`	`(\pancado\) ,`
	"	11109	11107	`vizt`	`(\vizt\)`
	"	14056	14052	`<normalised orig="prowe" auto="true">prow</normalised>,`	`(\prowe\) ,`
	"	16817	16813	`(\ocome\)`	`ocome`
	"	17260, 17348	17256, 17344	`Umbera's`	`(\Umbera's\)`
	"	18147	18143	`barsos`	`(\barsos\)`
	"	18453	18449	`deposseta`	`(\deposseta\)`
	FSYMCOTT	324	331	`(\fieri\).`	`(\fieri\) .`

Custom codes

Unlike other collections in CEEC-400, the Bacon and Willoughby collections in the CEEC involved custom codes that had not been converted into COCOA text-level coding:

Custom code	Meaning
`[you]`	Inserted words (only in Bacon)
`[so [[done]] till the afternone]`	Words inserted within an insertion (only in Bacon)
`{hard for}`	Deleted words

The custom codes were converted into COCOA codes. The conversion was carried out partly automatically and partly manually. We omit the complete conversion tables and only give a few examples so the reader will get the idea:

Custom code	COCOA
`river[ward]`	`riverward [\ward INSERTED\]`
`{[hym]}`	`[\hym INSERTED THEN DELETED\]`
`land{es}`	`land [\FINAL es DELETED\]`
`ha{ve}d`	`had [\have OVERWRITTEN\]`
`[Whereas it pleased your Lordship to direct your letters to Mr Sprat for {the puttinge} [[omyttinge]] James Tavernor {of} [[to be of]] the jury at Wighton, which was executed accordinglie, I have sithens {exam} inquired what]`	`Whereas it pleased your Lordship to direct your letters to Mr Sprat for [\the puttinge DELETED\] omyttinge [\omyttinge INSERTED\] James Tavernor [\of DELETED\] to be of [\to be of INSERTED\] the jury at Wighton, which was executed accordinglie, I have sithens [\exam DELETED\] inquired what [\Whereas it pleased ... inquired what INSERTED\]`

Reversion of normalisation

The pre-conversion process resulted in a new hybrid corpus of normalised and non-normalised collections. The normalised collections were yet to be reverted to their non-normalised versions before the corpus would be converted into XML. The reversion was performed by ‘VardStripper’, a Java application written by Lassi Saario. In what follows, the ‘original CEEC-400’ will refer to the corpus as it was at this point, after the reversion and before the conversion.

XML conversion

The XML schema of CEEC-400 has long roots. Before we got funding for the XML conversion of the entire CEEC-400, we had already converted a version of the CEECE-norm as part of the POS tagging project that resulted in the TCEECE. The XML schema of CEEC-400 was based on that of the TCEECE, which had in turn been based on that of the Helsinki Corpus, which had again been based on the TEI standard.

The conversion from COCOA into XML was performed by our own ‘XmlConverter’, a Java application written by Lassi Saario. The resulting XML files were validated against the Document Type Definition by XmlStarlet (version 1.6.1), a command line XML toolkit developed by Mikhail Grushinskiy.

Structure of the corpus

The original corpus was divided into text files, one for each letter collection. We decided to preserve this division. Each COCOA-encoded collection file was converted into an XML-encoded file of the same name.

In the original corpus, each collection was preceded by a header followed by the individual letters. Each letter was likewise preceded by a header followed by the contents. The XML version follows the same overall structure, illustrated below.

Each XML document begins with the same two lines. The first line specifies the XML version and the character encoding. The second line defines the document type by a reference to an external DTD file. The entities are given an internal declaration as well, since omitting it would cause errors on some browsers which do not support external DTDs.

Each document has teiCollection as its root element. It is made up of a teiHeader element, containing header information about the collection, and a series of TEI elements, representing the individual letters. Each TEI element is likewise made up of a teiHeader element which includes header information about the letter, and a text element which includes the actual contents of the letter.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE teiCollection SYSTEM "../CEEC.dtd" [
	<!ENTITY ETH "&#208;">
	<!ENTITY eth "&#240;">
	<!ENTITY YOGH "&#540;">
	<!ENTITY yogh "&#541;">
	<!ENTITY THORN "&#222;">
	<!ENTITY thorn "&#254;">
	<!ENTITY pound "&#163;">
]>
<teiCollection xml:id="FFOX">
	<teiHeader>...</teiHeader>
	<TEI xml:id="FOX_001">
		<teiHeader>...</teiHeader>
		<text type="letter" xml:lang="eng">...</text>
	</TEI>
	<TEI xml:id="FOX_002">
		<teiHeader>...</teiHeader>
		<text type="letter" xml:lang="eng">...</text>
	</TEI>
	...
</teiCollection>

Parameter coding

In the original CEEC-400, each file is preceded by the identifier of the collection (same as the file name), followed by source information:

<B FFOX>

[^LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M.
ALLEN. OXFORD: CLARENDON PRESS. 1929.^]

In the XML conversion, the identifier is put in the xml:id attribute of the teiCollection opening tag, and the source information is included in a titleStmt element in the teiHeader:

<teiCollection xml:id="FFOX">
	<teiHeader>
		<fileDesc>
			<titleStmt>LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M. ALLEN. OXFORD: CLARENDON PRESS. 1929.</titleStmt>
		</fileDesc>
	</teiHeader>

A letter header in the original CEEC-400 consists of an L-line, a Q-line, an X-line and a P-line:

<L FOX_001>
<Q A 1497? T RFOX>
<X RICHARD FOX>
<P 17>

The L-line gives the letter identifier.
The Q-line specifies the authenticity of the letter, the year of writing, the relationship between the writer of the letter and the addressee, and the identifier of the writer, respectively.
The X-line contains the name of the writer in full. (In some collections there is an A-line instead of an X-line, but the content is the same nevertheless.)
The P-line includes the number of the page on which the letter begins in the source edition. Similar lines appear amidst the body whenever the page changes.

The contents of the lines are included in the XML header as follows:

<TEI xml:id="FOX_001">
<!-- from the L-line -->
	<teiHeader>
		<fileDesc>
			<titleStmt>
				<title key="A 1497? T RFOX"></title>
				<!-- from the Q-line -->
				<author key="RICHARD FOX"></author>
				<!-- from the X- (or A-) line -->
			</titleStmt>
		</fileDesc>
	</teiHeader>

The P-line is included in the XML body along with the other P-lines. See the section on page numbers.

The S-lines like <S SAMPLE 1> that sometimes occur between letters mark samples taken from different source editions in the original corpus. They have been converted into XML comments like .

Text-level coding

Textual structure

A letter body in the original CEEC-400 is divided into lines, the maximum length of which is limited to 65 characters. Some of them are P-lines that annotate page breaks; for the rest there is no fixed format. Paragraphs and sentences flow rather freely from one line to another along with code brackets for headings, emendations etc. See the example below.

[} [\705 FROM MARY HOWE TO MR RANKING IN COOPERSALE (THEYDON 
GARNON), 13 JANUARY 1731\] }]
Jenaw 13 day 1731
Mr ranking this is to let you know that the doxtor have done 
what he can for me but my iees are never the better but rather
wors i ame to be discharge=d= next wandsday i 
hope you will be so kind as to send me word how i must come home
by next wandsday morning so with humble sarvis to you and your 
good wife
   sir I hope you will exquese me in wrighting of a letter but i
did not know no other way So i rest your humble sarvant
   mary how patient in
[\CONTINUED CROSSWISE IN LEFT-HAND MARGIN\] peter ward

Unfortunately, the line and page divisions that are so explicit in the original CEEC-400 are irrelevant for the purposes of linguistic research, especially as they do not reflect those of the original manuscript. Much more relevant is the paragraph division, which is also much more implicit. Lines that start paragraphs are usually indented with three whitespaces, but not always: sometimes the only clue of the line starting a paragraph is the previous line being shorter than usual. Matters are further complicated by the fact that P-lines sometimes appear in the middle of a paragraph and sometimes between paragraphs. Code brackets often appear in the middle of a paragraph, but sometimes they continue across paragraph breaks, and sometimes they even seem to form paragraphs of their own.

To recognise such delicate divisions may be easy for a human eye, but it is far from easy for a computer. We wanted to try it anyway. The rule of thumb that we gave to our converter is that a line starts a new paragraph if it is indented or if the previous line is shorter than 40 characters. Lines that were recognised to form a paragraph were then merged and, given that the paragraph in question was indeed a proper paragraph (as opposed to a single P-line or a bunch of code), put inside a p element. Code bracket sequences that continued across paragraph or page breaks were split at the breaking point in order to guarantee the sanity of the element tree.

Special characters

Grave accent symbols that annotated accents (not only grave but acute ones and circumflexes as well) and tildes that annotated abbreviations in the original CEEC-400 remain in the XML edition. Certain special characters have been converted into XML entities according to the following table.

Source edition	Original corpus	XML corpus	Description
&	`&`	`&`	ampersand
Ð	`+D`	`Ð`	upper case eth
ð	`+d`	`ð`	lower case eth
Ȝ	`+G`	`&YOGH;`	upper case yogh
ȝ	`+g`	`&yogh;`	lower case yogh
Þ	`+T`	`Þ`	upper case thorn
þ	`+t`	`þ`	lower case thorn
£	`+L`	`£`	pound sign

Page numbers

Page changes were annotated as P-lines, e.g. <P 45>, in the original CEEC-400. They are converted into pb elements, the n attribute of which contains the page number, e.g. <pb n="45">. Note that pb elements may appear inside as well as outside p elements.

Headings

Headings, annotated with the code [}...}] in the original CEEC-400, are annotated with the code <head>...</head> in the XML edition.

Note that most headings have been added by either editors or compilers, i.e. they are double-coded such as

[} [\98. TO FANNY BURNEY\] }]

where the inner brackets stand for the editor’s or compiler’s remark. The double-coding is preserved in the XML conversion so that the given example is converted into

<head> <note resp="editor" value="98. TO FANNY BURNEY" /> </head>

See the section on comments for more information.

Emendations

Emendations are annotated with the code [{...{] in the original CEEC-400. These have been converted into supplied elements in the XML version. When the emendation consists of complete words, we simply put the content of the brackets in between the XML tags. In e.g. the following passage,

I turned so sick that I [{could{] hardly speak

the [{could{] is converted into

<supplied>could</supplied>

The case is a bit trickier when the emendation contains partial words, as in w[{ife and son{]. These kinds of emendations we have converted so that in between the XML tags there is the final amended expression, while the original code is included in an orig attribute:

<supplied range="1,10" orig="w[{ife and son{]">wife and son</supplied>

The ‘range’ attribute specifies the extension of the emendation. The expression in between the XML tags is indexed so that the first character has the index 0, the second has the index 1 etc. The first number in the range value is the index of the first character in the range, and the second number is the index of the first character not in the range. Note that whitespaces do not count as characters here. When there are several ranges in the same expression, they are delimited by a semicolon:

<supplied range="7,13;15,16" orig="felysch[{yp of Ho{]ll[{a{]ndars">felyschyp of Hollandars</supplied>

Comments

The original version of CEEC-400 contains two types of comments added to the body text. Comments by compilers of the corpus are annotated with the code [^...^], e.g. [^LIST OF NAMES OMITTED^]. Comments by editors of source editions are annotated with the code [\...\], e.g. [\TORN\].

In the TEI XML edition of the Helsinki Corpus, both codes are converted into a note element. The author of the comment is specified by a resp attribute which points to his/her name in the document header. For our purposes, however, it is sufficient to separate the editors’ comments from the compilers’ and not specify the individual commentator. The attribute is simply given the value compiler for compilers’ comments and editor for editors’ comments.

Comments, whether they are written by editors or compilers, are actually used for two different purposes. One is a ‘proper’ comment, such as the two previous examples. The other is more like an emendation, as in

an order [\was made\] at his Lordship's instance

The difference between the two kinds is that a proper comment is a comment about the surrounding text, whereas an emendation-like comment is more like a part of the text. They are rather easily distinguished by the fact that a proper comment usually involves several consecutive upper case letters while an emendation-like comment does not. This holds true even when a proper comment contains text that is parallel to the preceding text, as in

it will [\be DELETED\] come safe hither

When the brackets contain a proper comment, we put their contents in an attribute inside an XML tag:

<note resp="compiler" value="LIST OF NAMES OMITTED" />
<note resp="editor" value="be DELETED" />

When the brackets are used to annotate emendations, we follow the same principle as with the [{...{] code explained above. Emendations of complete words are encoded as

<note resp="editor">was made</note>

whereas emendations that contain partial words are encoded as

<note resp="editor" range="3,7" orig="Jan[\uary\]">January</note>

Type changes

Changes of typeface in the printed source editions were annotated as (^...^) in the original corpus. In the XML edition, they are annotated as <hi rend="type">...</hi>. (This usually corresponds to an underlined passage in the original letter.)

When the change of typeface concerns partial words as in Theo(^log^), the original coding is preserved in the orig attribute of the hi element as in

<hi rend="type" range="4,7" orig="Theo(^log^)">Theolog</hi>

Foreign language

Passages in foreign language were annotated with the code (\...\) in the original CEEC-400. In the XML edition, they are annotated with the code <foreign>...</foreign>.

Superscripts

Superscripts in the original corpus were put in between two equality signs, such as

=vi=
w=ch=
p=r=ferm=t=

In the XML corpus, they are annotated as

<hi rend="sup">vi</hi>
<hi rend="sup" range="1,3" orig="w=ch=">wch</hi>
<hi rend="sup" range="1,2;6,7" orig="p=r=ferm=t=">prfermt</hi>

Post-processing

There were two abbreviations in the Dodsley collection where the superscript code had been used to encode a superscript inside another superscript. These instances had to be hard coded into the converter in order for them to be converted correctly, or else the converter would have taken them to consist of two consecutive superscripts:

COCOA	XML
`=Jun=r=.=`	`<hi rend="sup" range="0,5" orig="=Jun=r=.="><hi rend="sup" range="3,4" orig="Jun=r=.">Junr.</hi></hi>`
`=Jun=r==`	`<hi rend="sup" range="0,4" orig="=Jun=r=="><hi rend="sup" range="3,4" orig="Jun=r=">Junr</hi></hi>`

Endpoint

At the time of writing this manual, we are currently adapting the new XML corpus for our own CQPweb server in collaboration with Lancaster University. Once the remaining (15th-century) collections have been normalised, we will consider the possibility of converting the normalised CEEC-400 into an even more comprehensive XML version that would provide access to both the original and normalised variants.

References

For more information on the CEEC corpora, see the front page.