Transcription & annotation

(Source: MICASE manual)

The construction of MICASE was based on guidelines established by the Text Encoding Initiative (TEI) and files were originally marked up in SGML. We have subsequently updated the corpus DTD (Document Type Definition) and converted all the files to the XML format.

At present, only the orthographically transcribed version of the corpus is available. Future releases of MICASE may have various kinds of linguistic annotation added: parts of speech, lemmas, and discourse-pragmatic categories. The part-of-speech tagger used on MICASE for in-house use is the CLAWS tagger developed at Lancaster University, UK.

The MICASE orthographic transcription conventions and mark-up system are intended to allow for ease of readability, while including enough detail to ensure adequate comprehension from the text of the transcript alone. To this end, we use standard orthography in the case of most words, except for select situations where standard conventions may cause confusion, and for a limited number of lexicalized abbreviations and grammatical constructions (e.g., cuz, gonna, hafta, sorta, and several others). We do not use standard punctuation, but instead mark pauses of varying lengths with commas, periods, and ellipses. We also use question marks to identify phrases that function pragmatically as questions.

All backchannel cues and hesitation or filler words were transcribed using a set number of normalized orthographic representations that disregard minor phonetic variations. These, like overlaps and interruptions, are transcribed in a way that illustrates their sequential occurrence, but still indicates which speaker holds the floor.

We originally used a customized set of SGML tags adapted from the TEI conventions, which have since been converted to XML. Additionally, all the speaker demographic information and recording information is tagged in the header. Our transcripts were first created using Author Editor, an SGML text editing program. After the XML con-version, we used XMLSpy.

A complete description of the spelling, transcription, and mark-up conventions is provided on the MICASE website.