Key to the normalisation of TCEECE

Normalisation was carried out in two cycles. The first cycle was performed on the standardised (‘VARDed’) version of the CEECE. The second cycle was performed on the output of the first cycle, that is, the searches and replacements in the second cycle were made on a corpus that had already undergone the first cycle.

First cycle

Here, the number of replacements is the number of the instances that were normalised in the CEECE. The files also include instances from the CEECSU. They have been subtracted from the number of replacements given here.

Code name	File	Number of replacements
pre1	abbreviations-ceece+ceecsu_MH.xlsx	5,710
pre2	ceece+ceecsu_v-for-u-replacements.xls	4
pre3	ceece+ceecsu_verbs-d+t-replacements2_MH.xls	878
pre4	ceece+ceecsu_ye_article-only-replacements2.xls	1,229

Second cycle

Each ‘code name’ corresponds to one search in the corpus. The ‘regex’ is the regular expression that the search was performed with. The number of replacements states how many of the hits were normalised; the remaining ones were discarded as irrelevant.

Group	Code name	Regex	Number of hits	Number of replacements
Abbreviations	abbr1	(?i)\b(Jan\|Janr\|Janry\|Jany\|Feb\|Febr\|Febry\|Feby\|Febuary\|Feburary\|Mar\|Ap\|Apr\|Aprl\|Apprill\|Jun\|Jul\|Jully\|Aug\|Augoust\|Augst\|Augt\|Augu\|Sep\|Sepbr\|Sepr\|Sept\|Septb\|Septbr\|Septem\|Septr\|7br\|Oct\|Octbr\|Octeber\|Octo\|Octob\|Octobr\|Octr\|Nov\|Novebr\|Novemb\|Novemr\|Novm\|Novmber\|Novr\|9br\|Dec\|Decbr\|Decem\|Decemb\|Decemr\|Decr)\b[\.:]?	1,826	1,313
	abbr2	(?i)\b(Hond\|Honble\|Sr\|Capt\|shod\|cd\|wld\|wd\|sh\|cod\|yt)\b[\.:]?	729	557
	abbr3	(?i)\b(aaffectionate\|afectionate\|afeitionate\|aff\|affe\|affec\|affecate\|affect\|affectionet\|affectionett\|affectionte\|affectonate\|affecttionate)\b[\.:]?	155	99
	Total		2,710	1,969
Redundant punctuation	punc1	(?i)\b(January\|February\|March\|April\|May\|June\|July\|August\|September\|October\|November\|December)\b[\.:]	550	363
	punc2	(?i)\b(Sir\|Lord\|Captain\|Princess\|Duchess\|Brother\|Cousin\|dear\|affectionate\|yours\|your\|which\|that\|would\|could\|should\|the)\b[\.:]	2,429	1,686
	punc3	(?i)\d+(st\|nd\|rd\|th)[\.:] \d+	98	98
	Total		3,077	2,147
Indefinite pronouns and adverbs	indef1	(?i)\b((every\|some\|any\|no)[ \-](body\|thing)\|(every\|some\|any)[ \-]one)\b	1,952	1,861
	indef2	(?i)\b(evry)[ \-]?(body\|thing\|one)\b	5	5
	indef3	(?i)\b(every\|some\|any\|no)[ \-]?(bodey\|bodie\|boddey\|think\|thin)\b	17	8
	indef4	(?i)\b(every\|evry\|some\|any\|no)[ \-]?th\.	6	6
	indef5	(?i)\b(every\|evry\|some\|any\|no)[ \-]?(body\|bodey\|bodie\|boddey\|thing\|think\|thin\|one)s\b	62	17
	indef6	(?i)\b((every\|some\|any\|no)[ \-]where\|some[ \-]times)\b	116	115
	Total		2,158	2,012
Reflexive pronouns	refl1	(?i)\b((my\|your\|him\|her\|it\|one)[ \-]self\|(our\|your\|them)[ \-]selves)\b	1,242	1,242
	refl2	(?i)\b(thy\|you\|his\|herr\|its\|ones\|one's\|won's\|their)[ \-]?(self\|selves)\b	30	26
	refl3	(?i)\b(my\|your\|him\|her\|it\|one\|our\|them)[ \-]?(selfe\|selffe?\|selfes)\b	13	12
	Total		1,285	1,280
Miscellaneous	misc1	\b(mon\|tues\|wednes\|thurs\|fri\|satur\|sun)days?\b	220	220
	misc2	\b(january\|february\|april\|june\|july\|august\|september\|october\|november\|december)\b	41	35
	misc3	(?i)\b(dos\|don\|cant\|wont)\b	1,512	387
	misc4	(?i)\btho\b[\.:]?	800	793
	misc5	\bi\b	114	44
	misc6	\bLie\b[\.:]?	324	316
	Total		3,011	1,795

The first and second cycle together amount to 17,024 replacements in total.