Basic structure

(Source: the Brown Corpus manual)

The Corpus is divided into 500 samples of 2000+ words each. Each sample begins at the beginning of a sentence but not necessarily of a paragraph or other larger division, and each ends at the first sentence ending after 2000 words. The samples represent a wide range of styles and varieties of prose. Verse was not included on the ground that it presents special linguistic problems different from those of prose. (Short verse passages quoted in prose samples are kept, however.) Drama was excluded as being the imaginative recreation of spoken discourse, rather than true written discourse. Fiction was included, but no samples were admitted which consisted of more than 50% dialogue. Samples were chosen for their representative quality rather than for any subjectively determined excellence. The use of the word standard in the title of the Corpus does not in any way mean that it is put forward as "standard English"; it merely expresses the hope that this corpus will be used for comparative studies where it is important to use the same body of data. Since the preparation and input of data is a major bottleneck in computer work, the intent was to make available a carefully chosen and prepared body of material of considerable size in standardised format. The corpus may further prove to be standard in setting the pattern for the preparation and presentation of further bodies of data in English or in other languages.

The selection procedure was in two phases: an initial subjective classification and decision as to how many samples of each category would be used, followed by a random selection of the actual samples within each category. In most categories the holding of the Brown University Library and the Providence Athenaeum were treated as the universe from which the random selections were made. But for certain categories it was necessary to go beyond these two cellections. For the daily press, for example, the list of American newspapers of which the New York Public Library keeps microfilms files was used (with the addition of the Providence Journal). Certain categories of chiefly ephemeral material necessitated rather arbitrary decisions; some periodical materials in the categories Skills and Hobbies and Popular Lore were chosen from the contents of one of the largest second-hand magazine stores in New York City.

Table 1. Text categories in the Brown Corpus.

 

Genre group

Category

Content of category

No. of texts

I. Informative
prose (374)
(see Figure 2)

Press (88)

A

Reportage

44

 

 

B

Editorial

27

 

 

C

Review

17

 

General Prose (206)

D

Religion

17

 

 

E

Skills, trades and hobbies

36

 

 

F

Popular lore

48

 

 

G

Belles lettres, biographies, essays

75

 

 

H

Miscellaneous

30

 

Learned (80)

J

Science

80

II. Imaginative
prose (126)
(see Figure 3)

Fiction (126)

K

General fiction

29

 

 

L

Mystery and detective Fiction

24

 

 

M

Science fiction

6

 

 

N

Adventure and Western

29

 

 

P

Romance and love story

29

 

 

R

Humor

9

Total

 

 

 

500

Genres represented in the Brown Corpus.

Figure 1. Genres represented in the Brown Corpus.

Informative prose in the Brown Corpus.

Figure 2. Informative prose in the Brown Corpus.

Imaginative prose in the Brown Corpus.

Figure 3. Imaginative prose in the Brown Corpus.