Annotation

(Source: F-LOB manual, original version)

The purpose of any coding system is to produce an ASCII text that maintains as much of the information of the original text. Instead of using the rather complicated coding system of the LOB corpus, we used a simplified version of SGML-based mark-up codes that were drawn up for the coding of the International Corpus of English (ICE) subcorpora (see Nelson 1996 for details). For example, all types of typeface change, like underlined, bold, italics, etc. were subsumed under one general typeface-change code. The mark-up codes are enclosed in angular brackets. They consist of an opening tag (e.g. <quote_>) and a closing tag (e.g. <quote/>). If a feature extends only over one word a vertical stroke is used (e.g. <quote|>). A list of all the mark-up codes used in the F-LOB corpus is included in the manual.

In addition to those mark-up tags that help to represent the microstructure of the original (i.e. those indicating a typeface change or the beginning of a new paragraph), the ICE mark-up includes codes that help to interpret rather than represent the original text (i.e. the marking of non-English text or transliterations of Greek or Hebrew text).

In order to ensure that the corpus text would be as ‘readable’ as possible, the use of mark-up symbols was kept to a minimum. In particular, we tried to avoid the use of double codes. If a non-English word in the original text was set in italics it was only coded as non-English (<foreign_>word<foreign/>) and not as (<tf_><foreign_>word<foreign/><tf/>).

The main stages in the production of the POS-tagged versions of the Brown corpora were

Conversion of corpus markup
Tokenization
Initial tag assignment
Tag selection (disambiguation)
Idiomtagging
Template Tagger (1)
Template Tagger (2)
Post-editing

These stages of tagging and post-editing are described in more detail in the manual for the POS-tagged Brown corpora.

The table below summarizes the stages in the development of the four core Brown corpora.

Table 2. The evolution of the Brown family of corpora. (Source: Manual of the POS-tagged Brown corpora).

	Brown	LOB	Frown	F-LOB
Period sampled	1961	1961	1992	1991
Text samples collected in	1963-64	1970-78	1992-96	1991-96
Text samples collected by	Francis, Kucera and associates	Johansson, Leech, Atwell, Garside and associates	Mair and associates	Mair and associates
Original tagset	"the Brown-tagset"	CLAWS 1	C8	C8
Original tagger	TAGGIT Greene and Rubin (1971)	CLAWS1 (Marshall 1983)	CLAWS4 (Leech et al 1994) and Template Tagger (Fligelstone et al. 1997)	CLAWS4 and Template Tagger
C8 version produced by*	automatic re-tagging	automatic mapping of the CLAWS 1-tags onto C8	automatic tagging and manual post-editing	automatic tagging and manual post-editing
Post-editing of C8 version	none	earlier, pre-mapping post-edited version available	completed (Freiburg, 2006)	completed (Freiburg, 2003)
*) All automatically C8-tagged versions of corpora were produced by Nicholas Smith at Lancaster University.

References

Nelson, Gerald. 1996. "Markup Systems." In Greenbaum, Sidney, ed. Comparing English Worldwide - The International Corpus of English. Oxford: Clarendon. pp. 36-53.

The Freiburg-LOB Corpus of British English

Annotation

References