Annotation

The annotation guidelines for the PPCME2 are described in an online manual (Beatrice Santorini, March 2005) and are based on an earlier version, written by Ann Taylor in connection with the second edition of the Penn-Helsinki Parsed Corpus of Middle English. The current version supersedes the earlier one and is intended to apply to the following corpora:

the Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME2)
the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME)
the Parsed Corpus of Early English Correspondence (PCEEC)

There are slight annotation differences among the PPCME2, the PPCEME, and the PCEEC. We hope to minimize these differences further in subsequent editions of the corpora.

Philosophy and goals

(Source: the Annotation manual for the PPCME2, PPCEME, and PCEEC, General introduction)

Our primary goal has been to create an annotation system that facilitates automated searches, not to give a correct linguistic analysis of each sentence. For instance, if a construction can be found unambiguously through a combination of properties of a bracketed sentence, our annotation may not contain all of the structure that a full phrase structure diagram of the sentence would have.

We have tried to plan our system so that at each stage of the annotation, information is added in a monotonic way. In particular, we want any future revisions of the bracketed structures always to add information, never to change it. This goal requires us to avoid subjective judgments since they are extremely error-prone. So, for example, we do not distinguish adjectival from verbal passive participles, nor do we attempt to implement the argument-adjunct distinction.

As many categories as possible should have clear meanings so that unclear cases can be relegated to a small number of categories of residual cases. The price of making most categories homogeneous is that these residual categories will not be. In future revisions of the corpus, it may be possible to divide some of these residual categories into homogeneous subcategories.

As much as possible, we have avoided making decisions that would be controversial, whether with regard to text interpretation or to linguistic theory. In doubtful cases, we either avoid specifying structure, or we use default rules to decide the case for search purposes. An example of the first strategy concerns VPs. These are normally not indicated in the corpus, since VP boundaries are normally indeterminate. This is clearly the case in Middle English, which allows scrambling and where the internal structure of the VP is variable and changing. But even in modern English, there are many cases in which it is not clear whether some phrase attaches as a daughter of VP or higher up in the tree. An example of the second strategy concerns PP attachment. Whenever it is unclear where a PP attaches, we attach it by default as high as possible.

File formats

Each text in the corpus comes in three different formats, each with a characteristic filename extension:

text (.txt)
part-of-speech (POS) tagged (.pos)
parsed (.psd)

Text files (.txt)

Text files have the extension .txt. Besides the text, they contain Helsinki text level codes, converted into HTML type codes. The original page layout is not retained. Rather, the text is divided into tokens, which generally correspond to a main clause together with any subordinate clauses that it contains. Each token is associated with a token ID, enclosed in parentheses, which contains the name of the file, a page reference to the printed text (possibly including a volume reference), and a running token number that locates the token within the computer file. Tokens may also consist entirely of text level codes. Such tokens do not have IDs, but they are counted by the token counter, which can lead to gaps in the running token numbers. Punctuation in text files is separated from the words in order to simplify searches.

<P_2>

<heading>

I . (CMMALORY,2.3)

Merlin (CMMALORY,2.4)

</heading>

HIT befel in the dayes of Uther Pendragon , when he was kynge of all
Englond and so regned , that there was a myghty duke in Cornewaill that
helde warre ageynst hym long tyme . (CMMALORY,2.6)

and the duke was called the duke of Tyntagil . (CMMALORY,2.7)

And so by meanes kynge Uther send for this duk chargyng hym to brynge
his wyf with hym . (CMMALORY,2.8)

for she was called a fair lady and a passynge wyse . (CMMALORY,2.9)

and her name was called Igrayne . (CMMALORY,2.10)

So whan the duke and his wyf were comyn unto the kynge , by the meanes
of grete lordes they were accorded bothe . (CMMALORY,2.11)

Part-of-speech (POS) tagged files (.pos)

Part-of-speech (POS) tagged texts have the extension .pos. They contain the material in the text files with a POS tag added to each word. Editorial material is given the tag CODE. Text elements are separated from their POS tags by an underscore. The text is divided into tokens in the same way as in the text files. Also, as in the text files, tokens consisting entirely of CODE material do not receive a token ID, but are counted by the token counter.

<P_2>_CODE

<heading>_CODE

I_NUM ._. CMMALORY,2.3_ID

Merlin_NPR CMMALORY,2.4_ID

</heading>_CODE

HIT_PRO befel_VBD in_P the_D dayes_NS of_P Uther_NPR Pendragon_NPR ,_,
when_P he_PRO was_BED kynge_N of_P all_Q Englond_NPR and_CONJ so_ADV
regned_VBD ,_, that_C there_EX was_BED a_D myghty_ADJ duke_N in_P
Cornewaill_NPR that_C helde_VBD warre_N ageynst_P hym_PRO long_ADJ
tyme_N ,_. CMMALORY,2.6_ID

and_CONJ the_D duke_N was_BED called_VAN the_D duke_N of_P Tyntagil_NPR
._. CMMALORY,2.7_ID

And_CONJ so_ADV by_P meanes_NS kynge_NPR Uther_NPR send_VBD for_P
this_D duk_N chargyng_VAG hym_PRO to_TO brynge_VB his_PRO$ wyf_N with_P
hym_PRO ,_. CMMALORY,2.8_ID

for_CONJ she_PRO was_BED called_VAN a_D fair_ADJ lady_N and_CONJ a_D
passynge_ADV wyse_ADJ ,_. CMMALORY,2.9_ID

and_CONJ her_PRO$ name_N was_BED called_VAN Igrayne_NPR ._.
CMMALORY,2.10_ID

So_ADV whan_P the_D duke_N and_CONJ his_PRO$ wyf_N were_BED comyn_VBN
unto_P the_D kynge_N ,_, by_P the_D meanes_NS of_P grete_ADJ lordes_NS
they_PRO were_BED accorded_VAN bothe_Q ._. CMMALORY,2.11_ID

Parsed files (.psd)

Parsed files have the extension .psd. They contain a labelled bracketing of the text, with the first set of labelled parentheses around a word repeating the information from the POS-tagged files. The division into tokens in the parsed files is the same as in the text and POS files. Each token is enclosed with its ID in a set of unlabelled parentheses.

( (CODE <P_2>))

( (CODE <heading>))

( (NUMP (NUM I) 
        (E_S .)) 
  (ID CMMALORY,2.3))

( (NP (NPR Merlin)) 
  (ID CMMALORY,2.4))

( (CODE </heading>))

( (IP-MAT (NP-SBJ-1 (PRO HIT))
          (VBD befel)
          (PP (P in)
              (NP (D the) (NS dayes)
                  (PP (P of)
                      (NP (NPR Uther) (NPR Pendragon)))))
          (, ,)
          (PP (P when)
              (CP-ADV (C 0)
                      (IP-SUB (IP-SUB (NP-SBJ (PRO he))
                                      (BED was)
                                      (NP-OB1 (N kynge)
                                              (PP (P of)
                                                  (NP (Q all) (NPR Englond)))))
                              (CONJP (CONJ and)
                                     (IP-SUB (NP-SBJ *con*) 
                                             (ADVP (ADV so))
                                             (VBD regned))))))
          (, ,)
          (CP-THT-1 (C that)
                    (IP-SUB (NP-SBJ-2 (EX there))
                            (BED was)
                            (NP-2 (D a) (ADJ myghty) (N duke)
                                  (CP-REL *ICH*-3))
                            (PP (P in)
                                (NP (NPR Cornewaill)))
                            (CP-REL-3 (WNP-4 0)
                                      (C that)
                                      (IP-SUB (NP-SBJ *T*-4)
                                              (VBD helde)
                                              (NP-OB1 (N warre))
                                              (PP (P ageynst)
                                                  (NP (PRO hym)))
                                              (NP-MSR (ADJ long) (N tyme))))))
          (E_S ,)) 
  (ID CMMALORY,2.6))

( (IP-MAT (CONJ and)
          (NP-SBJ-1 (D the) (N duke))
          (BED was)
          (VAN called)
          (IP-SMC (NP-SBJ *-1)
                  (NP-OB1 (D the) (N duke)
                          (PP (P of)
                              (NP (NPR Tyntagil)))))
          (E_S .)) 
  (ID CMMALORY,2.7))

( (IP-MAT (CONJ And)
          (ADVP (ADV so))
          (PP (P by)
              (NP (NS meanes)))
          (NP-SBJ (NPR kynge) (NPR Uther))
          (VBD send)
          (PP (P for)
              (NP (D this) (N duk)))
          (IP-PPL (VAG chargyng)
                  (NP-OB1 (PRO hym))
                  (IP-INF (TO to)
                          (VB brynge)
                          (NP-OB1 (PRO$ his) (N wyf))
                          (PP (P with)
                              (NP (PRO hym)))))
          (E_S ,)) 
  (ID CMMALORY,2.8))