Annotation

(Source: F-LOB manual, original version)

The purpose of any coding system is to produce an ASCII text that maintains as much of the information of the original text. Instead of using the rather complicated coding system of the LOB corpus, we used a simplified version of SGML-based mark-up codes that were drawn up for the coding of the International Corpus of English (ICE) subcorpora (see Nelson 1996 for details). For example, all types of typeface change, like underlined, bold, italics, etc. were subsumed under one general typeface-change code. The mark-up codes are enclosed in angular brackets. They consist of an opening tag (e.g. <quote_>) and a closing tag (e.g. <quote/>). If a feature extends only over one word a vertical stroke is used (e.g. <quote|>). A list of all the mark-up codes used in the Frown corpus is included in the manual.

In addition to those mark-up tags that help to represent the microstructure of the original (i.e. those indicating a typeface change or the beginning of a new paragraph), the ICE mark-up includes codes that help to interpret rather than represent the original text (i.e. the marking of non-English text or transliterations of Greek or Hebrew text).

In order to ensure that the corpus text would be as ‘readable’ as possible, the use of mark-up symbols was kept to a minimum. In particular, we tried to avoid the use of double codes. If a non-English word in the original text was set in italics it was only coded as non-English (<foreign_>word<foreign/>) and not as (<tf_><foreign_>word<foreign/><tf/>).

The main stages in the production of the POS-tagged versions of the Brown corpora were

  1. Conversion of corpus markup
  2. Tokenization
  3. Initial tag assignment
  4. Tag selection (disambiguation)
  5. Idiomtagging
  6. Template Tagger (1)
  7. Template Tagger (2)
  8. Post-editing

These stages of tagging and post-editing are described in more detail in the manual for the POS-tagged Brown corpora.

The table below summarizes the stages in the development of the four core Brown corpora.

Table 2. The evolution of the Brown family of corpora. (Source: Manual of the POS-tagged Brown corpora).

 

Brown

LOB

Frown

F-LOB

Period sampled

1961

1961

1992

1991

Text samples
collected in

1963-64

1970-78

1992-96

1991-96

Text samples
collected by

Francis, Kucera and associates

Johansson, Leech, Atwell, Garside and associates

Mair and associates

Mair and associates

Original tagset

"the Brown-tagset"

CLAWS 1

C8

C8

Original tagger

TAGGIT
Greene and Rubin (1971)

CLAWS1
(Marshall 1983)

 

CLAWS4 (Leech et al 1994) and Template Tagger (Fligelstone et al. 1997)

CLAWS4 and Template Tagger

C8 version
produced by*

automatic
re-tagging

automatic mapping of the CLAWS 1-tags
onto C8

automatic tagging and manual post-editing

automatic tagging and manual post-editing

Post-editing of C8 version

none

earlier, pre-mapping post-edited version available

completed (Freiburg, 2006)

completed (Freiburg, 2003)

*) All automatically C8-tagged versions of corpora were produced by Nicholas Smith at Lancaster University.

References

Nelson, Gerald. 1996. "Markup Systems." In Greenbaum, Sidney, ed. Comparing English Worldwide - The International Corpus of English. Oxford: Clarendon. pp. 36-53.