Versions of the corpus

Six versions of the Corpus are available. All contain the same basic text, but they differ in typography and format.

(1) Form A. This is the original form of the Corpus, as it was prepared in 1963-64. The limitations of computer printing facilities at that time required that it use an elaborate coding procedure, which is described in the manual.

(2) Form B. This is the "stripped" version, from which all punctuation symbols and codes except hyphens, apostrophes, and symbols for formulas and ellipses have been omitted. It is especially useful for those who are interested in individual words, and was used in the preparation of the frequency tables in Kucera and Francis, Computational Analysis of Present-Day American English (Providence: Brown University Press, 1967).

(3) Form C. This is the "tagged" version, which makes use of a partially stripped text in which only proper name capitalisation and those punctuation marks which are of grammatical significance have been retained. Each individual word (token) in this version has been given a grammatical tag from a list of 81, each specifying a particular word-class.

(4) Bergen Form I. This version and the following were prepared at the Norwegian Computational Center for Humanistic Research (NAVF's EDB-senter for humanistisk forskning) at the University of Bergen under the direction of Dr. Jostein Hauge. Both contain upper- and lower-case letters, regular punctuation marks, and a minimum of special codes. In this version, typographic information is preserved and the same line division is used as in the original version except that words at the end of the line are never divided.

(5) Bergen Form II. In this version typographical information is somewhat reduced and a new longer line is used.

(6) Brown MARC Form. This version was prepared at Stanford University. It is designed to be compatible with two commonly used research techniques which are appropriate for large textual corpora:

(1) searching for and retrieving full-sentence citations using single words or word + context as retrieval criteria;
(2) generating KWIC-form concordances which can be organised according to varying arrangements of a keyword plus its preceding or following verbal context.

This is thus a variable-length record format, using the sentence as a single record.

The Brown Corpus

Versions of the corpus