Key to the normalisation of TCEECE

Normalisation was carried out in two cycles. The first cycle was performed on the standardised (‘VARDed’) version of the CEECE. The second cycle was performed on the output of the first cycle, that is, the searches and replacements in the second cycle were made on a corpus that had already undergone the first cycle.

First cycle

Here, the number of replacements is the number of the instances that were normalised in the CEECE. The files also include instances from the CEECSU. They have been subtracted from the number of replacements given here.

Code name File Number of replacements
pre1 abbreviations-ceece+ceecsu_MH.xlsx 5,710
pre2 ceece+ceecsu_v-for-u-replacements.xls 4
pre3 ceece+ceecsu_verbs-d+t-replacements2_MH.xls 878
pre4 ceece+ceecsu_ye_article-only-replacements2.xls 1,229

Second cycle

Each ‘code name’ corresponds to one search in the corpus. The ‘regex’ is the regular expression that the search was performed with. The number of replacements states how many of the hits were normalised; the remaining ones were discarded as irrelevant.

Group Code name Regex Number of hits Number of replacements
Abbreviations abbr1
(?i)\b(Jan|Janr|Janry|Jany|Feb|Febr|Febry|Feby|Febuary|Feburary|Mar|Ap|Apr|Aprl|Apprill|Jun|Jul|Jully|Aug|Augoust|Augst|Augt|Augu|Sep|Sepbr|Sepr|Sept|Septb|Septbr|Septem|Septr|7br|Oct|Octbr|Octeber|Octo|Octob|Octobr|Octr|Nov|Novebr|Novemb|Novemr|Novm|Novmber|Novr|9br|Dec|Decbr|Decem|Decemb|Decemr|Decr)\b[\.:]?
1,826 1,313
abbr2
(?i)\b(Hond|Honble|Sr|Capt|shod|cd|wld|wd|sh|cod|yt)\b[\.:]?
729 557
abbr3
(?i)\b(aaffectionate|afectionate|afeitionate|aff|affe|affec|affecate|affect|affectionet|affectionett|affectionte|affectonate|affecttionate)\b[\.:]?
155 99
Total 2,710 1,969
Redundant punctuation punc1
(?i)\b(January|February|March|April|May|June|July|August|September|October|November|December)\b[\.:]
550 363
punc2
(?i)\b(Sir|Lord|Captain|Princess|Duchess|Brother|Cousin|dear|affectionate|yours|your|which|that|would|could|should|the)\b[\.:]
2,429 1,686
punc3
(?i)\d+(st|nd|rd|th)[\.:] \d+
98 98
Total 3,077 2,147
Indefinite pronouns and adverbs indef1
(?i)\b((every|some|any|no)[ \-](body|thing)|(every|some|any)[ \-]one)\b
1,952 1,861
indef2
(?i)\b(evry)[ \-]?(body|thing|one)\b
5 5
indef3
(?i)\b(every|some|any|no)[ \-]?(bodey|bodie|boddey|think|thin)\b
17 8
indef4
(?i)\b(every|evry|some|any|no)[ \-]?th\.
6 6
indef5
(?i)\b(every|evry|some|any|no)[ \-]?(body|bodey|bodie|boddey|thing|think|thin|one)s\b
62 17
indef6
(?i)\b((every|some|any|no)[ \-]where|some[ \-]times)\b
116 115
Total 2,158 2,012
Reflexive pronouns refl1
(?i)\b((my|your|him|her|it|one)[ \-]self|(our|your|them)[ \-]selves)\b
1,242 1,242
refl2
(?i)\b(thy|you|his|herr|its|ones|one's|won's|their)[ \-]?(self|selves)\b
30 26
refl3
(?i)\b(my|your|him|her|it|one|our|them)[ \-]?(selfe|selffe?|selfes)\b
13 12
Total 1,285 1,280
Miscellaneous misc1
\b(mon|tues|wednes|thurs|fri|satur|sun)days?\b
220 220
misc2
\b(january|february|april|june|july|august|september|october|november|december)\b
41 35
misc3
(?i)\b(dos|don|cant|wont)\b
1,512 387
misc4
(?i)\btho\b[\.:]?
800 793
misc5
\bi\b
114 44
misc6
\bLie\b[\.:]?
324 316
Total 3,011 1,795

The first and second cycle together amount to 17,024 replacements in total.