Conversion of the CEEC-400 into XML

A Manual to Accompany the XML Edition

2020

Lassi Saario
Research Unit for Variation, Contacts and Change in English (VARIENG)
Faculty of Arts
University of Helsinki

  1. Starting point
    1. Family tree of CEEC-400
    2. Why the mess?
  2. Pre-processing
    1. File division
    2. Ordering of letters
    3. Character encoding
    4. Parameter coding
    5. Text-level coding
    6. Custom codes
    7. Reversion of normalisation
  3. XML conversion
    1. Structure of the corpus
    2. Parameter coding
    3. Text-level coding
      1. Textual structure
      2. Special characters
      3. Page numbers
      4. Headings
      5. Emendations
      6. Comments
      7. Type changes
      8. Foreign language
      9. Superscripts
  4. Post-processing
  5. Endpoint
  6. References

CEEC-400 is a cover term for a family of corpora: the original Corpus of Early English Correspondence (CEEC), the CEEC Extension (CEECE), the CEEC Supplement (CEECSU) and their various versions. Together they cover a time span of almost 400 years from 1402 to 1800. The corpora are based on published editions of letters, which were sampled and digitised by a team of compilers at the University of Helsinki.

The original CEEC-400 was written in a custom version of the ancient COCOA (Word COunt and COncordance on Atlas) format, based on that of the Helsinki Corpus of English Texts. Different collections were often encoded by different people, mostly research assistants, interpreting instructions written by someone else. As there was no validator to check their output, the codes were exposed to errors and inconsistencies, the amount of which multiplied with each version of the corpus.

In the autumn of 2018, we were granted funding by the Faculty of Arts to convert CEEC-400 into XML so that it could be imported into more modern platforms (such as CQPweb). We soon found out that the shortcomings of the original coding would have to be detected and corrected before the conversion could take place. This is a documentation of the correction and conversion project, carried out by Lassi Saario under the supervision of Tanja Säily and Samuli Kaislaniemi.

The documentation is arranged in chronological order. The chapters 1–2 on the background of the project and pre-processing are mostly intended for internal use within VARIENG, whereas the one about the XML conversion might be more of interest to the general public.

Starting point

Family tree of CEEC-400

The CEEC corpora have a large family tree, the branches of which had grown quite far apart from each other by the time the conversion was about to start. Here’s a brief inventory of all the different corpus versions there were at that point.

First, there were the three separate basic corpora, CEEC, CEECE and CEECSU, stored on our network drive (the ‘P drive’). Each letter collection was stored in its own text file, preceded by some metadata about the collection and followed by the letters, which in these versions were lacking letter IDs or ‘L-lines’ such as <L ALLEN_001>, where ALLEN_001 is the identifier of the first letter in the Allen collection.

Second, there were two published subsets of CEEC: the CEEC Sampler (CEECS) and the Parsed CEEC (PCEEC), the latter of which was provided in three formats: plain text, tagged and parsed. The PCEEC was also stored on the P drive and had L-lines for all the letters in it, so that the letters could be unambiguously referenced in the separate metadata file.

Third, there were the normalised versions of CEEC, CEECE and CEECSU, known by the cover term SCEEC (short for Standardised-spelling Corpora of Early English Correspondence). We will call them CEEC-norm, CEECE-norm and CEECSU-norm. Unlike their non-normalised counterparts, the normalised collections did include L-lines; however, they only included a subset of the non-normalised collections. They were stored on the P drive in both tagged and untagged formats: the tagged versions included the original variants inside XML-like tags, whereas the untagged ones had no such tags.

Fourth, there was the brand new POS-Tagged CEECE (TCEECE) that we had just finished prior to this project (see the TCEECE manual). TCEECE was based on CEECE-norm that had been further normalised, converted into XML and annotated by a part-of-speech tagger.

Finally, the basic corpora CEEC, CEECE and CEECSU as well as PCEEC had also been imported into CEECer, a web-based search engine with associated metadata about the letters and their writers. As opposed to the P drive versions, all the letters in CEECer had L-lines, but the header sections that preceded each collection on P drive had been lost in the importing process.

All in all, there were various versions of the CEEC corpora that were more or less related, but not in a straightforward way. Some changes had been made to some versions that had not been synced with the parallel versions. The same letter might appear in a slightly different form and even bear a different identifier in different corpus versions. Not even the collections could be consistently individuated across the different versions. There were also many deviations from the expected COCOA coding here and there. This mess had to be sorted out before the XML conversion could begin.

Why the mess?

The history of the L-line and the letter ID plays a crucial role in understanding the genesis of unsynced corpus versions. The original P drive versions of CEEC, CEECE and CEECSU did not involve any L-lines at all. The L-line was first introduced in the PCEEC, which for copyright reasons did not include all of the letters in the CEEC. When CEEC, CEECE, CEECSU and PCEEC were imported into CEECer, L-lines were added to all letters, which resulted in some ‘half’ IDs (e.g. BROWNE_043.5) and other peculiarities. The normalised versions were based on the P drive versions, however, so the L-lines were added again to the normalised versions, which resulted in more errors.

Another source of errors is the person ID that appears in the ‘Q-line’ (such as <Q A 1497? T RFOX>, where RFOX is the person ID of Richard Fox, the author of the letter). When the corpora were imported into CEECer, it was discovered that different collections (particularly those in different subcorpora) might employ the same person ID for different persons, as the person ID had been derived directly from the person’s first and last name. These IDs were corrected in the CEECer metadata but not in the P drive corpora.

Pre-processing

We decided to merge the P drive and CEECer versions of CEEC, CEECE and CEECSU and the tagged versions of CEEC-norm, CEECE-norm and CEECSU-norm into one master corpus that would be kept in a Git repository where the version history could be tracked and controlled automatically. We wanted to rid the corpus of all known errata and ensure that the letter identifiers be consistent across those parallel corpora that remained. It would be the new master corpus that would then be converted into XML. What follows is a reconstruction of the pre-conversion process.

In the version history of our GitLab project, the steps were actually taken in a different order than what is presented here. Now that the process is complete, it is easy to see that the actual order was far from optimal. That is why we present here an alternative version history where the changes are made in a more logical order that is easier to follow and understand. The end product is, nevertheless, exactly the same master corpus as in the actual Git repository.

File division

We began by aligning the basic collections with their normalised versions and determining for each collection the file that would be included in the new master corpus. Whenever a collection had both a non-normalised and a (tagged) normalised version, we preferred the normalised one, as the non-normalised version could be reverted from it automatically, and the normalised one also contained L-lines which the non-normalised one did not. For each row in the table below, if the ‘final file’ column is empty, it means the final file is the normalised one—unless that is empty too, in which case the final file is the non-normalised one. The collections without a normalised file are mostly from the 15th century, which represents Late Middle English and has so far been deemed too challenging to normalise.

On the non-normalised side, the biggest collections had once been split into smaller files because of a restriction on the file size imposed by the WordCruncher application. The respective normalised collections had been merged into one file (with one exception). Even those non-normalised collections that did not have normalised counterparts were now merged into one file, so that there would be only one file for each collection.

Corpus Collection Non-normalised file Normalised file Final file Notes CEECer
CEEC Allen FALLEN FALLEN x Allen
Arundel FARUNDEL FARUNDEL x Arundel
Bacon F1BACON FBACON x Bacon
F2BACON
F3BACON
Barrington FBARRING FBARRING x Barrington
Basire FBASIRE FBASIRE Basire
Baxter & Eliot FBAXTER FBAXTER x Baxter
Bentham FBENTHAM FBENTHAM x Bentham
Brereton FBRERETO FBRERETO x Brereton
Browne FBROWNE FBROWNE x Browne
Bryskett FBRYSKET FBRYSKET x Bryskett
Cecil FCECIL FCECIL x Cecil
Cely FCELY x Cely
Chamberlain FCHAMBER FCHAMBER x Chamberlain
Charles FCHARLES FCHARLES Charles
Clerk FCLERK FCLERK x Clerk
Clifford FCLIFFOR FCLIFFO x Clifford
Conway FCONWAY FCONWAY x Conway
Corie FCORIE FCORIE x Corie
Cornwallis FCORNWAL FCORNWAL Cornwallis
Cosin FCOSIN FCOSIN Cosin
Cromwell FCROMWEL FCROMWEL x Cromwell
Derby FDERBY FDERBY x Derby
Duppa FDUPPA FDUPPA x Duppa
Edmondes FEDMONDE FEDMONDE x Edmondes
Elyot FELYOT FELYOT x Elyot
Essex FESSEX FESSEX x Essex
Ferrar FFERRAR FFERRAR x Ferrar
Ffarington FFFARING FFFARING x Ffarington
Fitzherbert FFITZHER FFITZHER x Fitzherbert
Fleming FFLEMING FFLEMING x Fleming
Fox FFOX FFOX x Fox
Gardiner FGARDIN FGARDIN x Gardiner
Gawdy FGAWDY FGAWDY x Gawdy
Gawdy L FGAWDYL FGAWDYL x Gawdy Lettice
Giffard FGIFFARD FGIFFARD x Giffard
Haddock FHADDOCK FHADDOCK x Haddock
Hamilton FHAMILTO FHAMILTO Hamilton
Harington FHARING FHARING x Harington
Harley FHARLEY FHARLEY Harley
Hart FHART FHART x Hart
Harvey FHARVEY FHARVEY x Harvey
Hastings FHASTING FHASTING x Hastings
Hatton FHATTON FHATTON x Hatton
Henry VIII FHENRY8 FHENRY8 x Henry VIII
Henslowe FHENSLOW FHENSLOW Henslowe
x Henslowe
Holles FHOLLES FHOLLES x Holles
Hoskyns FHOSKYNS FHOSKYNS x Hoskyns
Hutton FHUTTON FHUTTON Hutton
Johnson F1JOHNSO FJOHNSO x Johnson
F2JOHNSO
F3JOHNSO
Jones FJONES FJONES Jones
Jonson FJONSON FJONSON x Jonson
Knyvett FKNYVETT FKNYVETT x Knyvett
Leycester FLEYCEST FLEYCEST Leycester
Lisle FLISLE FLISLE x Lisle
Lowther FLOWTHER FLOWTHER x Lowther
Marchall FMARCHAL Marchall
Marescoe FMARESCO FMARESCO x Marescoe
Marvell FMARVELL FMARVELL x Marvell
Minette FMINETTE FMINETTE x Minette
More FMORE FMORE x More
Original 1 FORIGIN1 FORIGIN1 Original 1
Original 2 FORIGIN2 FORIGIN2 Original 2
Original 3 FORIGIN3 FORIGIN3 Original 3
Osborne FOSBORNE FOSBORNE x Osborne
Oxinden F1OXINDE These were deleted because they were already included in FOXINDEN x Oxinden
F2OXINDE
FOXINDEN FOXINDEN
Paget FPAGET FPAGET x Paget
Parkhurst FPARKHUR FPARKHUR x Parkhurst
Paston F1PASTON FPASTON The non-normalised collections were merged into one x Paston
F2PASTON
F3PASTON
F4PASTON
Paston K FPASTONK FPASTONK x Paston Katherine
Pepys FPEPYS FPEPYS x Pepys
Petty FPETTY FPETTY x Petty
Plumpton FPLUMPTO Plumpton
Pory FPORY FPORY x Pory
Prideaux FPRIDEAU FPRIDEAU x Prideaux
Rerum FRERUM Rerum
Royal 1 FROYAL1 FROYAL1 Royal 1
Royal 2 FROYAL2 FROYAL2 Royal 2
x Royal 2
Royal 3 FROYAL3 FROYAL3 x Royal 3
Rutland FRUTLAND x Rutland
Shillingford FSHILLIN Shillingford
Signet FSIGNET x Signet
Smyth FSMYTH FSMYTH x Smyth
Stapylton FSTAPYLT FSTAPYLT x Stapylton
Stiffkey FSTIFFKE FSTIFFKE x Stiffkey
Stockwell FSTOCKWE FSTOCKWE x Stockwell
Stonor FSTONOR Stonor
Stuart FSTUART FSTUART x Stuart
Tixall FTIXALL FTIXALL Tixall
Verstegan FVERSTEG FVERSTEG x Verstegan
Wentworth FWENTWOR FWENTWOR x Wentworth
WeSa FWESA FWESA WeSa
Wharton FWHARTON FWHARTON Wharton
Willoughby FWILLOUG FWILLOUG x Willoughby
Wilmot FWILMOT FWILMOT x Wilmot
Wood FWOOD FWOOD x Wood
Wyatt FWYATT FWYATT x Wyatt
FBOHOLD This was deleted because it was an old version of FROYAL2
CEECE Addison FADDISON FADDISON z Addison
Austen FAUSTEN FAUSTEN z Austen
Banks FBANKS FBANKS z Banks
Bentham J FBENTHAJ FBENTHAJ z Bentham Jeremy
Blomefield FBLOMEFI FBLOMEFI z Blomefield
Bolton FBOLTON FBOLTON z Bolton
Bowrey FBOWREY FBOWREY z Bowrey
Burney FBURNEY FBURNEY z Burney
Burney F FBURNEYF FBURNEYF z Burney F
Bute FBUTE FBUTE z Bute
Carter FCARTER FCARTER z Carter
Champion FCHAMPIO FCHAMPIO z Champion
Clavering FCLAVERI FCLAVERI z Clavering
Clift FCLIFT FCLIFT z Clift
Cowper S FCOWPERS FCOWPERS z Cowper S
Cowper W FCOWPERW FCOWPERW z Cowper W
Crisp FCRISP FCRISP z Crisp
Culley FCULLEY FCULLEY z Culley
Darwin FDARWIN FDARWIN z Darwin
Defoe FDEFOE FDEFOE z Defoe
Dodsley FDODSLEY FDODSLEY z Dodsley
Draper FDRAPER FDRAPER z Draper
Dukes FDUKES FDUKES z Dukes
Evelyn FEVELYN FEVELYN z Evelyn
Evelyn 2 FEVELYN2 FEVELYN2 z Evelyn 2
Fleming 2 F2FLEMIN F2FLEMIN FFLEMIN2 The normalised collections were merged into one z Fleming 2
F3FLEMIN F3FLEMIN
Fleming X FFLEMINX FFLEMINX z Fleming Extra
Foundling FFOUNDLI FFOUNDLI z Foundling
Garrick FGARRICK FGARRICK z Garrick
Gay FGAY FGAY z Gay
George 3 FGEORGE3 FGEORGE3 z George 3
George 3a FGEORG3A FGEORG3A z George 3A
George 4 FGEORGE4 FGEORGE4 z George 4
Gibbon FGIBBON FGIBBON z Gibbon
Giffard 2 FGIFFAR2 FGIFFAR2 z Giffard 2
Gower FGOWER FGOWER z Gower
Gray FGRAY FGRAY z Gray
Haddock 2 FHADDOC2 FHADDOC2 z Haddock 2
Hatton 2 fhatton2 fhatton2 FHATTON2 Renamed z Hatton 2
Henry FHENRY FHENRY z Henry
Hurd FHURD FHURD z Hurd
Johnson S FJOHNSOS FJOHNSOS z Johnson
Jones W FJONESW FJONESW z Jones W
Lennox FLENNOX FLENNOX z Lennox
Liddell FLIDDELL FLIDDELL z Liddell
Melbourne FMELBOUR FMELBOUR z Melbourne
Montagu FMONTAGU FMONTAGU z Montagu
Newdigate FNEWDIGA FNEWDIGA z Newdigate
North FNORTH FNORTH z North
Original 4 FORIGIN4 FORIGIN4 z Original 4
Pauper FPAUPER FPAUPER z Pauper
Pepys 2 FPEPYS2 FPEPYS2 z Pepys 2
Pepys 3 FPEPYS3 FPEPYS3 z Pepys 3
Perrot FPERROT FPERROT z Perrot Jane
Petty 2 FPETTY2 FPETTY2 z Petty 2
Pierce FPIERCE FPIERCE z Pierce
Pinney FPINNEY FPINNEY z Pinney
Piozzi FPIOZZI FPIOZZI z Piozzi
Pitt FPITT FPITT z Pitt
Pitt 2 FPITT2 FPITT2 z Pitt 2
Pope FPOPE FPOPE z Pope
Porter FPORTER FPORTER z Porter
Prideaux 2 FPRIDEA2 FPRIDEA2 z Prideaux 2
Purefoy FPUREFOY FPUREFOY z Purefoy
Royal 4 FROYAL4 FROYAL4 z Royal 4
Sancho FSANCHO FSANCHO z Sancho
Secker FSECKER FSECKER z Secker
Stubs FSTUBS FSTUBS z Stubs
Swift FSWIFT The normalised collection is kept separately because of differences in Sample 2 z Swift
FSWIFT FSWIFT_norm
Tixall 2 FTIXALL2 FTIXALL2 z Tixall 2
Twining FTWINING FTWINING z Twining
Wanley FWANLEY FWANLEY z Wanley
Warton FWARTON FWARTON z Warton
Wedgwood FWEDGWOO FWEDGWOO z Wedgwood
Wentworth 2 FWENTWO2 FWENTWO2 z Wentworth 2
Wollstonecraft FWOLLSTO FWOLLSTO z Wollstonecraft
Young FYOUNG FYOUNG z Young
CEECSU Arundel 2 FARUNDE2 FARUNDE2 y Arundel 2
Bacon D FBACOND FBACOND y Bacon Dorothy
Bacon X FBACONX FBACONX y Bacon Extra
Betts FBETTS FBETTS y Betts
Cary FCARY FCARY y Cary
Factory FFACTOR1 These were deleted because they were already included in FFACTORY y Factory
FFACTOR2
FFACTOR3
FFACTORY FFACTORY
Gardiner 2 FGARDIN2 FGARDIN2 y Gardiner 2
Gawdy 2 FGAWDY2 FGAWDY2 y Gawdy 2
Grene FGRENE FGRENE y Grene
Knyvett 2 FKNYVET2 FKNYVET2 y Knyvett 2
Lisle H FLISLEH y Lisle H
Oxinden X FOXINDEX FOXINDEX y Oxinden Extra
Paston X FPASTONX y Paston Extra
Plumpton 2 FPLUMPT2 y Plumpton 2
Ralegh FRALEGH FRALEGH y Ralegh
Ralegh 2 FRALEGH2 FRALEGH2 y Ralegh 2
Symcotts FSYMCOTT y Symcotts
Thynne FTHYNNE y Thynne
Zouche FZOUCHE y Zouche

Ordering of letters

We spotted two collections where the ordering of letters did not correspond to PCEEC and CEECer. In the other collection, it was not only letter order but also letter IDs that contradicted.

Corpus Collection/file Samples/letters Notes
CEEC FCLIFFO Samples 1 and 2 Sample 2 (with letter IDs from 1 to 75) was moved to before sample 1 (with letter IDs from 76 to 105) so that the letter order would match that of PCEEC and CEECer
" FHADDOCK Letters 11 and 12 The letter IDs and ordering of the last two letters were interchanged to match those of PCEEC and CEECer

Character encoding

The encoding was unified to UTF-8 throughout the corpus.

The ‘old line’ refers to the line number as it was in the corpus after the changes to file division and letter order (see the two previous sections), and the ‘new line’ refers to the line number as it is in the final version of the corrected corpus.

Corpus Collection Old line(s) New line(s) Old text New text
CEEC FBACON 17 17 �.WORD.� OR �WORD� |.WORD.| OR |WORD|
FHENRY8 4 4 Z�RICH ZÜRICH
FHENSLOW 4 4 KER�NEN KERÄNEN
FMARCHAL 3 3 KER�NEN KERÄNEN
FROYAL2 10 10 K�NIGIN VON B�HMEN KÖNIGIN VON BÖHMEN
" 11 11 KURF�RSTEN KURFÜRSTEN
" 13 13 T�BINGEN TÜBINGEN
FWILMOT 6 6 M�LLER MÜLLER

Parameter coding

The standard COCOA parameters are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. At this stage, we also made the necessary corrections to the IDs of letters, persons and collections.

Corpus Collection Old line(s) New line(s) Old text New text Notes
CEEC FBARRING 1394 1394 <SIR THOMAS BARRINGTON> <X SIR THOMAS BARRINGTON>
" 1482 1482 <SIR FRANCIS HARRIS> <X SIR FRANCIS HARRIS>
FCHARLES 11 11 <SAMPLE 1> <S SAMPLE 1>
" 162 162 <SAMPLE 2> <S SAMPLE 2>
FHASTING 1 1 <B FHASTINGS> <B FHASTING>
FHENSLOW 1 1 <B FHENSLOWE> <B FHENSLOW>
" 15 15 <L HENSLOW_001> <L HENSLO1_001> The IDs were changed to match the ones in PCEEC, CEECer and the untagged file on the P drive
" 49 49 <L HENSLOW_002> <L HENSLO1_002>
" 67 67 <L HENSLOW_003> <L HENSLO1_003>
FJOHNSO 1 1 <B F1JOHNSO> <B FJOHNSO>
" 7112–5 <B F2JOHNSO> Deleted
" 13223–6 <B F3JOHNSO> Deleted
FMORE 2695 2695 <L MORE_03> <L MORE_033> The erroneous ID remains in the published PCEEC
FORIGIN2 909 909 <L ORIGIN2_020> <L ORIGIN2_019.5>
" 962 962 <L ORIGIN2_021> <L ORIGIN2_020>
" 1027 1027 <L ORIGIN2_022> <L ORIGIN2_021>
" etc. etc. etc. etc.
FOXINDEN 12705 12705 <Q A 1662 FN HOXINDEN>. <Q A 1662 FN HOXINDEN>
CEECE FEVELYN 2028 2028 <SAMPLE 2> <S SAMPLE 2>
" 4525, 4583 4525, 4583 JJACKSON J2JACKSON
FFLEMIN2 9356 9354 JBANKES JBANCKES
FFOUNDLI 5441, 5466 5441, 5466 FRUSSELL FRUSSELL2
FGEORGE4 1 1 <D FGEORGE4> <B FGEORGE4>
FPEPYS3 1–10 <B FPEPYS3>

[^SAMPLE 1 = PARTICULAR FRIENDS. THE CORRESPONDENCE OF SAMUEL
PEPYS AND JOHN EVELYN. EDITED BY GUY DE LA BÉDOYÈRE. WOODBRIDGE:
THE BOYDELL PRESS. 1997.

SAMPLE 2 = THE LETTERS OF SAMUEL PEPYS AND HIS FAMILY CIRCLE.
EDITED BY HELEN TRUESDELL HEATH. OXFORD 1955.^]

<S SAMPLE 1>
" 169–70 180 <S SAMPLE 2>
FPIOZZI 236 236 EMONTAGU E2MONTAGU
FPOPE 961, 994, 1067, 1150, 1217, 1362, 1387, 1438, 1633, 1883 EHARLEY E2HARLEY
FSWIFT 181, 306, 566 189, 319, 586 RHARLEY R2HARLEY
" 1998, 2121, 2146, 3349 2042, 2168, 2194, 3419 HHOWARD HEHOWARD
FSWIFT_norm 194, 324, 591 RHARLEY R2HARLEY
" 2047, 2173, 2199, 3424 HHOWARD HEHOWARD
" 5141 5141 <Q A 1735? TC EGERMAIN> <Q A 1735 TC EGERMAIN>
FWENTWO2 1409 1409 <ISABELLA WENTWORTH> <X ISABELLA WENTWORTH>
" 2634 2634 <WILLIAM BERKELEY> <X WILLIAM BERKELEY>
" 4824, 4842, 4893, 4910, 4925, 4949, 5196, 5392, 5988, 6015, 6632, 6709, 6737, 6759, 6816, 6850, 6879 WWENTWORTH W2WENTWORTH
FYOUNG 1943 1943 MHARLEY M2HARLEY
CEECSU FFACTORY 1 1 <B FFACTOR1> <B FFACTORY>
" 9, 560, 865, 3527, 6140, 13019, 14980, 15011 9, 560, 865, 3527, 6140, 13017, 14976, 15007 WADAMS WMADAMS
" 6663–4 <B FFACTOR2> Deleted
" 11768 11766 EWILMOT EDWILMOT
" 13242–3 <B FFACTOR3> Deleted
FLISLEH 2–3 3–5 [^THE LISLE LETTERS, VOLS I–V. ED. BY MURIEL ST. CLARE. CHICAGO:
UNIVERSITY OF CHICAGO PRESS. 1981.^]
FRALEGH2 880–881 881 <X WALTER RALEGH>

In addition to the changes above, L-lines were added to those collections that lacked them.

Text-level coding

The standard COCOA text-level codes are documented in e.g. the CEECS manual and also below. Deviations from the standard had to be corrected so that the coding could be automatically converted into XML. There were also some code instances that were not incorrect as such but still had to be changed to meet the stricter requirements of XML.

Corpus Collection Old line(s) New line(s) Old text New text Notes
CEEC FARUNDEL 2185 2185 will send you. l will send you. I Lower case L to upper case I
FBACON 3485 3485 [{of wh} [ {of wh}
" 6313–5 6313–5 [} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS,
<P I,221>
NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF
[} [\PRIVY COUNCIL TO SIR CHRISTOPHER HEYDON, SIR WILLIAM BUTTS, ...\]
<P I,221>
[\...NATHANIEL BACON AND RALPH SHELTON, COMMISSIONERS IN A CASE OF
" 9841 9841 wor[\ship\] ] [wor[\ship\] ]
" 10488 10488 [of N[\orthumberland\] [of N[\orthumberland\] ] The error is in the edition
" 11293–5 11293–5 [his {the}
<P II,235>
office <normalised orig="beinge" auto="true">being</normalised> {the}]
[his {the} ]
<P II,235>
[office <normalised orig="beinge" auto="true">being</normalised> {the} ]
" 11781–3 11781–3 inquired what
<P II,258>
just cause
inquired what]
<P II,258>
[just cause
" 11956 11956 [{office} [ {office}
" 14633–5 14633–5 [The
<P III,124>
<normalised orig="countenaunce" auto="true">countenance</normalised>
[The]
<P III,124>
[<normalised orig="countenaunce" auto="true">countenance</normalised>
FCLIFFO 3619 3619 [\ENDORSED,] [\ENDORSED\]
FCORNWAL 2565 2565 childeren w=th my self childeren w=th= my self
FFLEMING 2201 2201 for y=e good for y=e= good
FHARLEY 916 916 (^Octo: 18. 1639^.) (^Octo: 18. 1639.^)
FHENSLOW 679 679 M=ri[{s= ...{]t M=ri=[{=s= ...{]t
" 1603 1603 for yo=u to for yo=u= to
FJOHNSO 786 786 of my w{ill to{] of my w[{ill to{]
" 3747 3747 (^li mer s[{t.^) ; and{] (^li mer s[{t.{]^) [{; and{]
" 4841 4841 (^d Fl.) (^d Fl.^)
" 6329 6329 [\274. SABINE JOHNSON TO JOHN JOHNSON}] [\274. SABINE JOHNSON TO JOHN JOHNSON\]
" 9278 9274 (^lb^ ) (^lb^)
FLEYCEST 4043 4043 [{is\] [{is{]
" 6006–8 6006–8 [\I dare make
<P 342>
none of my servants
[\I dare make CROSSED OUT\]
<P 342>
[\none of my servants
FOSBORNE 1461 1461 For M=rs Painter For M=rs= Painter
FOXINDEN 4131 4131 [} CLXXI THOMAS BARROW [} [\CLXXI THOMAS BARROW
FPASTON 923 944 [\?\]ch[{...{] [\?\] ch[{...{]
" 18099 18448 [{ [\582. FROM FRIAR JOHN BRACKLEY [} [\582. FROM FRIAR JOHN BRACKLEY
FPEPYS 4100 4100 [my sister [\my sister
FWILMOT 657 657 the w=ch I have the w=ch= I have
CEECE FADDISON 1486 1486 3O=th= July 30=th= July Capital O to zero
FBANKS 2110 2110 [torn] [\TORN\]
FBOWREY 1044 1044 Colkers \CAULKERS\] Colkers [\CAULKERS\]
FBURNEYF 677–9 677–9 [\2 1/2 ILLEGIBLE
<P III,188>
LINES\]
[\2 1/2 ILLEGIBLE...\]
<P III,188>
[\...LINES\]
FDODSLEY 453–5 453–5 the foll[{y
<P 110>
of Noblemen
the foll[{y{]
<P 110>
[{of Noblemen
" 2782–4 2782–4 altercation about it,
<P 281>
except what might
altercation about it,\]
<P 281>
[\CROSSED OUT except what might
" 3117 3117 so[{und w=c{]h= so[{und w=c={]=h=
" 3433 3433 (mouldering^) (^mouldering^)
" 3949 3949 [{of W=k{]m= [{of W=k={]=m=
FDRAPER 1419 1419 (27^th October^) 27(^th October^)
FFLEMIN2 4010 4009 (notwithstanding those Provocations} (notwithstanding those Provocations)
" 4035 4034 19=th came safe 19=th= came safe
FLIDDELL 309 309 Capt[ain\] Capt[\ain\]
" 1007 1007 sist[\er] sist[\er\]
" 1598 1598 l(eaves\] l[\eaves\]
" 2785 2785 June l0th June 10th Lower case L to one
FPAUPER 263 263 (BERMONDSEY, LONDON} (BERMONDSEY, LONDON)
FPRIDEA2 903 903 E[\arl\l] E[\arl\]
FPUREFOY 3802 3802 [^SIGN OMIITTED^] [^SIGN OMITTED^]
FSANCHO 1759 1759 [October 17, 1779.\] [\October 17, 1779.\]
" 2075 2075 M\inorit\]y M[\inorit\]y
FSWIFT 2666, 2709, 2823, 3031, 3304, 3405, 3448, 3535, 3580, 3714, 3819, 3910, 4232, 4348, 4380, 4581 2722, 2767, 2884, 3096, 3372, 3476, 3521, 3611, 3658, 3795, 3904, 3998, 4326, 4445, 4479, 4685 [^FROM ELIZABETH
BERKELEY^]
Added to the end of the line
FYOUNG 1424–6 1424–6 [\STRICKEN
<P 132>
PHRASE\]
[\STRICKEN...\]
<P 132>
[\...PHRASE\]
CEECSU FBACOND 299–301 299–301 [\ELEVEN
<P 91>
HOURS\]
[\ELEVEN...\]
<P 91>
[\...HOURS\]
FFACTORY 1501, 1685 1501, 1685 wacadash, (\wacadash\) ,
" 1869–70 1869–70 c'nto
per c'nto,
(\c'nto
per c'nto\) ,
" 2153 2153 Angin Sama's (\Angin Sama's\)
" 7111 7109 catabera (\catabera\)
" 7387–9 7385–7 [\4CM
<P 379>
MISSING\]
[\4CM...\]
<P 379>
[\...MISSING\]
" 7574 7572 contors, (\contors\) ,
" 8392 8390 cataberas (\cataberas\)
" 10735 10733 ditto (\ditto\)
" 10930 10928 pancado, (\pancado\) ,
" 11109 11107 vizt (\vizt\)
" 14056 14052 <normalised orig="prowe" auto="true">prow</normalised>, (\prowe\) ,
" 16817 16813 (\ocome\) ocome
" 17260, 17348 17256, 17344 Umbera's (\Umbera's\)
" 18147 18143 barsos (\barsos\)
" 18453 18449 deposseta (\deposseta\)
FSYMCOTT 324 331 (\fieri\). (\fieri\) .

Custom codes

Unlike other collections in CEEC-400, the Bacon and Willoughby collections in the CEEC involved custom codes that had not been converted into COCOA text-level coding:

Custom code Meaning
[you] Inserted words (only in Bacon)
[so [[done]] till the afternone] Words inserted within an insertion (only in Bacon)
{hard for} Deleted words

The custom codes were converted into COCOA codes. The conversion was carried out partly automatically and partly manually. We omit the complete conversion tables and only give a few examples so the reader will get the idea:

Custom code COCOA
river[ward] riverward [\ward INSERTED\]
{[hym]} [\hym INSERTED THEN DELETED\]
land{es} land [\FINAL es DELETED\]
ha{ve}d had [\have OVERWRITTEN\]
[Whereas it pleased your Lordship to direct your letters to Mr Sprat for {the puttinge} [[omyttinge]] James Tavernor {of} [[to be of]] the jury at Wighton, which was executed accordinglie, I have sithens {exam} inquired what] Whereas it pleased your Lordship to direct your letters to Mr Sprat for [\the puttinge DELETED\] omyttinge [\omyttinge INSERTED\] James Tavernor [\of DELETED\] to be of [\to be of INSERTED\] the jury at Wighton, which was executed accordinglie, I have sithens [\exam DELETED\] inquired what [\Whereas it pleased ... inquired what INSERTED\]

Reversion of normalisation

The pre-conversion process resulted in a new hybrid corpus of normalised and non-normalised collections. The normalised collections were yet to be reverted to their non-normalised versions before the corpus would be converted into XML. The reversion was performed by ‘VardStripper’, a Java application written by Lassi Saario. In what follows, the ‘original CEEC-400’ will refer to the corpus as it was at this point, after the reversion and before the conversion.

XML conversion

The XML schema of CEEC-400 has long roots. Before we got funding for the XML conversion of the entire CEEC-400, we had already converted a version of the CEECE-norm as part of the POS tagging project that resulted in the TCEECE. The XML schema of CEEC-400 was based on that of the TCEECE, which had in turn been based on that of the Helsinki Corpus, which had again been based on the TEI standard.

The conversion from COCOA into XML was performed by our own ‘XmlConverter’, a Java application written by Lassi Saario. The resulting XML files were validated against the Document Type Definition by XmlStarlet (version 1.6.1), a command line XML toolkit developed by Mikhail Grushinskiy.

Structure of the corpus

The original corpus was divided into text files, one for each letter collection. We decided to preserve this division. Each COCOA-encoded collection file was converted into an XML-encoded file of the same name.

In the original corpus, each collection was preceded by a header followed by the individual letters. Each letter was likewise preceded by a header followed by the contents. The XML version follows the same overall structure, illustrated below.

Each XML document begins with the same two lines. The first line specifies the XML version and the character encoding. The second line defines the document type by a reference to an external DTD file. The entities are given an internal declaration as well, since omitting it would cause errors on some browsers which do not support external DTDs.

Each document has teiCollection as its root element. It is made up of a teiHeader element, containing header information about the collection, and a series of TEI elements, representing the individual letters. Each TEI element is likewise made up of a teiHeader element which includes header information about the letter, and a text element which includes the actual contents of the letter.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE teiCollection SYSTEM "../CEEC.dtd" [
	<!ENTITY ETH "&#208;">
	<!ENTITY eth "&#240;">
	<!ENTITY YOGH "&#540;">
	<!ENTITY yogh "&#541;">
	<!ENTITY THORN "&#222;">
	<!ENTITY thorn "&#254;">
	<!ENTITY pound "&#163;">
]>
<teiCollection xml:id="FFOX">
	<teiHeader>...</teiHeader>
	<TEI xml:id="FOX_001">
		<teiHeader>...</teiHeader>
		<text type="letter" xml:lang="eng">...</text>
	</TEI>
	<TEI xml:id="FOX_002">
		<teiHeader>...</teiHeader>
		<text type="letter" xml:lang="eng">...</text>
	</TEI>
	...
</teiCollection>

Parameter coding

In the original CEEC-400, each file is preceded by the identifier of the collection (same as the file name), followed by source information:

<B FFOX>

[^LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M.
ALLEN. OXFORD: CLARENDON PRESS. 1929.^]

In the XML conversion, the identifier is put in the xml:id attribute of the teiCollection opening tag, and the source information is included in a titleStmt element in the teiHeader:

<teiCollection xml:id="FFOX">
	<teiHeader>
		<fileDesc>
			<titleStmt>LETTERS OF RICHARD FOX 1486-1527. EDITED BY P. S. AND H. M. ALLEN. OXFORD: CLARENDON PRESS. 1929.</titleStmt>
		</fileDesc>
	</teiHeader>

A letter header in the original CEEC-400 consists of an L-line, a Q-line, an X-line and a P-line:

<L FOX_001>
<Q A 1497? T RFOX>
<X RICHARD FOX>
<P 17>
  • The L-line gives the letter identifier.
  • The Q-line specifies the authenticity of the letter, the year of writing, the relationship between the writer of the letter and the addressee, and the identifier of the writer, respectively.
  • The X-line contains the name of the writer in full. (In some collections there is an A-line instead of an X-line, but the content is the same nevertheless.)
  • The P-line includes the number of the page on which the letter begins in the source edition. Similar lines appear amidst the body whenever the page changes.

The contents of the lines are included in the XML header as follows:

<TEI xml:id="FOX_001">
<!-- from the L-line -->
	<teiHeader>
		<fileDesc>
			<titleStmt>
				<title key="A 1497? T RFOX"></title>
				<!-- from the Q-line -->
				<author key="RICHARD FOX"></author>
				<!-- from the X- (or A-) line -->
			</titleStmt>
		</fileDesc>
	</teiHeader>

The P-line is included in the XML body along with the other P-lines. See the section on page numbers.

The S-lines like <S SAMPLE 1> that sometimes occur between letters mark samples taken from different source editions in the original corpus. They have been converted into XML comments like <!-- SAMPLE 1 -->.

Text-level coding

Textual structure

A letter body in the original CEEC-400 is divided into lines, the maximum length of which is limited to 65 characters. Some of them are P-lines that annotate page breaks; for the rest there is no fixed format. Paragraphs and sentences flow rather freely from one line to another along with code brackets for headings, emendations etc. See the example below.

[} [\705 FROM MARY HOWE TO MR RANKING IN COOPERSALE (THEYDON 
GARNON), 13 JANUARY 1731\] }]
Jenaw 13 day 1731
Mr ranking this is to let you know that the doxtor have done 
what he can for me but my iees are never the better but rather
wors i ame to be discharge=d= next wandsday i 
hope you will be so kind as to send me word how i must come home
by next wandsday morning so with humble sarvis to you and your 
good wife
   sir I hope you will exquese me in wrighting of a letter but i
did not know no other way So i rest your humble sarvant
   mary how patient in
[\CONTINUED CROSSWISE IN LEFT-HAND MARGIN\] peter ward

Unfortunately, the line and page divisions that are so explicit in the original CEEC-400 are irrelevant for the purposes of linguistic research, especially as they do not reflect those of the original manuscript. Much more relevant is the paragraph division, which is also much more implicit. Lines that start paragraphs are usually indented with three whitespaces, but not always: sometimes the only clue of the line starting a paragraph is the previous line being shorter than usual. Matters are further complicated by the fact that P-lines sometimes appear in the middle of a paragraph and sometimes between paragraphs. Code brackets often appear in the middle of a paragraph, but sometimes they continue across paragraph breaks, and sometimes they even seem to form paragraphs of their own.

To recognise such delicate divisions may be easy for a human eye, but it is far from easy for a computer. We wanted to try it anyway. The rule of thumb that we gave to our converter is that a line starts a new paragraph if it is indented or if the previous line is shorter than 40 characters. Lines that were recognised to form a paragraph were then merged and, given that the paragraph in question was indeed a proper paragraph (as opposed to a single P-line or a bunch of code), put inside a p element. Code bracket sequences that continued across paragraph or page breaks were split at the breaking point in order to guarantee the sanity of the element tree.

Special characters

Grave accent symbols that annotated accents (not only grave but acute ones and circumflexes as well) and tildes that annotated abbreviations in the original CEEC-400 remain in the XML edition. Certain special characters have been converted into XML entities according to the following table.

Source edition Original corpus XML corpus Description
& & &amp; ampersand
Ð +D &ETH; upper case eth
ð +d &eth; lower case eth
Ȝ +G &YOGH; upper case yogh
ȝ +g &yogh; lower case yogh
Þ +T &THORN; upper case thorn
þ +t &thorn; lower case thorn
£ +L &pound; pound sign

Page numbers

Page changes were annotated as P-lines, e.g. <P 45>, in the original CEEC-400. They are converted into pb elements, the n attribute of which contains the page number, e.g. <pb n="45">. Note that pb elements may appear inside as well as outside p elements.

Headings

Headings, annotated with the code [}...}] in the original CEEC-400, are annotated with the code <head>...</head> in the XML edition.

Note that most headings have been added by either editors or compilers, i.e. they are double-coded such as

[} [\98. TO FANNY BURNEY\] }]

where the inner brackets stand for the editor’s or compiler’s remark. The double-coding is preserved in the XML conversion so that the given example is converted into

<head> <note resp="editor" value="98. TO FANNY BURNEY" /> </head>

See the section on comments for more information.

Emendations

Emendations are annotated with the code [{...{] in the original CEEC-400. These have been converted into supplied elements in the XML version. When the emendation consists of complete words, we simply put the content of the brackets in between the XML tags. In e.g. the following passage,

I turned so sick that I [{could{] hardly speak

the [{could{] is converted into

<supplied>could</supplied>

The case is a bit trickier when the emendation contains partial words, as in w[{ife and son{]. These kinds of emendations we have converted so that in between the XML tags there is the final amended expression, while the original code is included in an orig attribute:

<supplied range="1,10" orig="w[{ife and son{]">wife and son</supplied>

The ‘range’ attribute specifies the extension of the emendation. The expression in between the XML tags is indexed so that the first character has the index 0, the second has the index 1 etc. The first number in the range value is the index of the first character in the range, and the second number is the index of the first character not in the range. Note that whitespaces do not count as characters here. When there are several ranges in the same expression, they are delimited by a semicolon:

<supplied range="7,13;15,16" orig="felysch[{yp of Ho{]ll[{a{]ndars">felyschyp of Hollandars</supplied>

Comments

The original version of CEEC-400 contains two types of comments added to the body text. Comments by compilers of the corpus are annotated with the code [^...^], e.g. [^LIST OF NAMES OMITTED^]. Comments by editors of source editions are annotated with the code [\...\], e.g. [\TORN\].

In the TEI XML edition of the Helsinki Corpus, both codes are converted into a note element. The author of the comment is specified by a resp attribute which points to his/her name in the document header. For our purposes, however, it is sufficient to separate the editors’ comments from the compilers’ and not specify the individual commentator. The attribute is simply given the value compiler for compilers’ comments and editor for editors’ comments.

Comments, whether they are written by editors or compilers, are actually used for two different purposes. One is a ‘proper’ comment, such as the two previous examples. The other is more like an emendation, as in

an order [\was made\] at his Lordship's instance

The difference between the two kinds is that a proper comment is a comment about the surrounding text, whereas an emendation-like comment is more like a part of the text. They are rather easily distinguished by the fact that a proper comment usually involves several consecutive upper case letters while an emendation-like comment does not. This holds true even when a proper comment contains text that is parallel to the preceding text, as in

it will [\be DELETED\] come safe hither

When the brackets contain a proper comment, we put their contents in an attribute inside an XML tag:

<note resp="compiler" value="LIST OF NAMES OMITTED" />
<note resp="editor" value="be DELETED" />

When the brackets are used to annotate emendations, we follow the same principle as with the [{...{] code explained above. Emendations of complete words are encoded as

<note resp="editor">was made</note>

whereas emendations that contain partial words are encoded as

<note resp="editor" range="3,7" orig="Jan[\uary\]">January</note>

Type changes

Changes of typeface in the printed source editions were annotated as (^...^) in the original corpus. In the XML edition, they are annotated as <hi rend="type">...</hi>. (This usually corresponds to an underlined passage in the original letter.)

When the change of typeface concerns partial words as in Theo(^log^), the original coding is preserved in the orig attribute of the hi element as in

<hi rend="type" range="4,7" orig="Theo(^log^)">Theolog</hi>

Foreign language

Passages in foreign language were annotated with the code (\...\) in the original CEEC-400. In the XML edition, they are annotated with the code <foreign>...</foreign>.

Superscripts

Superscripts in the original corpus were put in between two equality signs, such as
=vi=
w=ch=
p=r=ferm=t=

In the XML corpus, they are annotated as

<hi rend="sup">vi</hi>
<hi rend="sup" range="1,3" orig="w=ch=">wch</hi>
<hi rend="sup" range="1,2;6,7" orig="p=r=ferm=t=">prfermt</hi>

Post-processing

There were two abbreviations in the Dodsley collection where the superscript code had been used to encode a superscript inside another superscript. These instances had to be hard coded into the converter in order for them to be converted correctly, or else the converter would have taken them to consist of two consecutive superscripts:

COCOA XML
=Jun=r=.= <hi rend="sup" range="0,5" orig="=Jun=r=.="><hi rend="sup" range="3,4" orig="Jun=r=.">Junr.</hi></hi>
=Jun=r== <hi rend="sup" range="0,4" orig="=Jun=r=="><hi rend="sup" range="3,4" orig="Jun=r=">Junr</hi></hi>

Endpoint

At the time of writing this manual, we are currently adapting the new XML corpus for our own CQPweb server in collaboration with Lancaster University. Once the remaining (15th-century) collections have been normalised, we will consider the possibility of converting the normalised CEEC-400 into an even more comprehensive XML version that would provide access to both the original and normalised variants.

References

For more information on the CEEC corpora, see the front page.