Annotation
(Source: F-LOB manual, original version)
The purpose of any coding system is to produce an ASCII text that maintains as much of the information of the original text. Instead of using the rather complicated coding system of the LOB corpus, we used a simplified version of SGML-based mark-up codes that were drawn up for the coding of the International Corpus of English (ICE) subcorpora (see Nelson 1996 for details). For example, all types of typeface change, like underlined, bold, italics, etc. were subsumed under one general typeface-change code. The mark-up codes are enclosed in angular brackets. They consist of an opening tag (e.g. <quote_>) and a closing tag (e.g. <quote/>). If a feature extends only over one word a vertical stroke is used (e.g. <quote|>). A list of all the mark-up codes used in the Frown corpus is included in the manual.
In addition to those mark-up tags that help to represent the microstructure of the original (i.e. those indicating a typeface change or the beginning of a new paragraph), the ICE mark-up includes codes that help to interpret rather than represent the original text (i.e. the marking of non-English text or transliterations of Greek or Hebrew text).
In order to ensure that the corpus text would be as ‘readable’ as possible, the use of mark-up symbols was kept to a minimum. In particular, we tried to avoid the use of double codes. If a non-English word in the original text was set in italics it was only coded as non-English (<foreign_>word<foreign/>) and not as (<tf_><foreign_>word<foreign/><tf/>).
The main stages in the production of the POS-tagged versions of the Brown corpora were
- Conversion of corpus markup
- Tokenization
- Initial tag assignment
- Tag selection (disambiguation)
- Idiomtagging
- Template Tagger (1)
- Template Tagger (2)
- Post-editing
These stages of tagging and post-editing are described in more detail in the manual for the POS-tagged Brown corpora.
The table below summarizes the stages in the development of the four core Brown corpora.
Table 2. The evolution of the Brown family of corpora. (Source: Manual of the POS-tagged Brown corpora).
|
Brown |
LOB |
Frown |
F-LOB |
Period sampled |
1961 |
1961 |
1992 |
1991 |
Text samples
collected in |
1963-64 |
1970-78 |
1992-96 |
1991-96 |
Text samples
collected by |
Francis, Kucera and associates |
Johansson, Leech, Atwell, Garside and associates |
Mair and associates |
Mair and associates |
Original tagset |
"the Brown-tagset" |
CLAWS 1 |
C8 |
C8 |
Original tagger |
TAGGIT Greene and Rubin (1971) |
CLAWS1 (Marshall 1983)
|
CLAWS4 (Leech et al 1994) and Template Tagger (Fligelstone et al. 1997) |
CLAWS4 and Template Tagger |
C8 version
produced by* |
automatic
re-tagging |
automatic mapping of the CLAWS 1-tags
onto C8 |
automatic tagging and manual post-editing |
automatic tagging and manual post-editing |
Post-editing of C8 version |
none |
earlier, pre-mapping post-edited version available |
completed (Freiburg, 2006) |
completed (Freiburg, 2003) |
*) All automatically C8-tagged versions of corpora were produced by Nicholas Smith at Lancaster University. |
References
Nelson, Gerald. 1996. "Markup Systems." In Greenbaum, Sidney, ed. Comparing English Worldwide - The International Corpus of English. Oxford: Clarendon. pp. 36-53. |