Normalisation was carried out in two cycles. The first cycle was performed on the standardised (‘VARDed’) version of the CEECE. The second cycle was performed on the output of the first cycle, that is, the searches and replacements in the second cycle were made on a corpus that had already undergone the first cycle.
Here, the number of replacements is the number of the instances that were normalised in the CEECE. The files also include instances from the CEECSU. They have been subtracted from the number of replacements given here.
Code name | File | Number of replacements |
---|---|---|
pre1 | abbreviations-ceece+ceecsu_MH.xlsx | 5,710 |
pre2 | ceece+ceecsu_v-for-u-replacements.xls | 4 |
pre3 | ceece+ceecsu_verbs-d+t-replacements2_MH.xls | 878 |
pre4 | ceece+ceecsu_ye_article-only-replacements2.xls | 1,229 |
Each ‘code name’ corresponds to one search in the corpus. The ‘regex’ is the regular expression that the search was performed with. The number of replacements states how many of the hits were normalised; the remaining ones were discarded as irrelevant.
Group | Code name | Regex | Number of hits | Number of replacements |
---|---|---|---|---|
Abbreviations | abbr1 | (?i)\b(Jan|Janr|Janry|Jany|Feb|Febr|Febry|Feby|Febuary|Feburary|Mar|Ap|Apr|Aprl|Apprill|Jun|Jul|Jully|Aug|Augoust|Augst|Augt|Augu|Sep|Sepbr|Sepr|Sept|Septb|Septbr|Septem|Septr|7br|Oct|Octbr|Octeber|Octo|Octob|Octobr|Octr|Nov|Novebr|Novemb|Novemr|Novm|Novmber|Novr|9br|Dec|Decbr|Decem|Decemb|Decemr|Decr)\b[\.:]? |
1,826 | 1,313 |
abbr2 | (?i)\b(Hond|Honble|Sr|Capt|shod|cd|wld|wd|sh|cod|yt)\b[\.:]? |
729 | 557 | |
abbr3 | (?i)\b(aaffectionate|afectionate|afeitionate|aff|affe|affec|affecate|affect|affectionet|affectionett|affectionte|affectonate|affecttionate)\b[\.:]? |
155 | 99 | |
Total | 2,710 | 1,969 | ||
Redundant punctuation | punc1 | (?i)\b(January|February|March|April|May|June|July|August|September|October|November|December)\b[\.:] |
550 | 363 |
punc2 | (?i)\b(Sir|Lord|Captain|Princess|Duchess|Brother|Cousin|dear|affectionate|yours|your|which|that|would|could|should|the)\b[\.:] |
2,429 | 1,686 | |
punc3 | (?i)\d+(st|nd|rd|th)[\.:] \d+ |
98 | 98 | |
Total | 3,077 | 2,147 | ||
Indefinite pronouns and adverbs | indef1 | (?i)\b((every|some|any|no)[ \-](body|thing)|(every|some|any)[ \-]one)\b |
1,952 | 1,861 |
indef2 | (?i)\b(evry)[ \-]?(body|thing|one)\b |
5 | 5 | |
indef3 | (?i)\b(every|some|any|no)[ \-]?(bodey|bodie|boddey|think|thin)\b |
17 | 8 | |
indef4 | (?i)\b(every|evry|some|any|no)[ \-]?th\. |
6 | 6 | |
indef5 | (?i)\b(every|evry|some|any|no)[ \-]?(body|bodey|bodie|boddey|thing|think|thin|one)s\b |
62 | 17 | |
indef6 | (?i)\b((every|some|any|no)[ \-]where|some[ \-]times)\b |
116 | 115 | |
Total | 2,158 | 2,012 | ||
Reflexive pronouns | refl1 | (?i)\b((my|your|him|her|it|one)[ \-]self|(our|your|them)[ \-]selves)\b |
1,242 | 1,242 |
refl2 | (?i)\b(thy|you|his|herr|its|ones|one's|won's|their)[ \-]?(self|selves)\b |
30 | 26 | |
refl3 | (?i)\b(my|your|him|her|it|one|our|them)[ \-]?(selfe|selffe?|selfes)\b |
13 | 12 | |
Total | 1,285 | 1,280 | ||
Miscellaneous | misc1 | \b(mon|tues|wednes|thurs|fri|satur|sun)days?\b |
220 | 220 |
misc2 | \b(january|february|april|june|july|august|september|october|november|december)\b |
41 | 35 | |
misc3 | (?i)\b(dos|don|cant|wont)\b |
1,512 | 387 | |
misc4 | (?i)\btho\b[\.:]? |
800 | 793 | |
misc5 | \bi\b |
114 | 44 | |
misc6 | \bLie\b[\.:]? |
324 | 316 | |
Total | 3,011 | 1,795 |
The first and second cycle together amount to 17,024 replacements in total.