Corpus of Video-Mediated English as a Lingua Franca Conversations (ViMELF)

ViMELF, the Corpus of Video-Mediated English as a Lingua Franca Conversations, contains 20 Skype conversations between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes. The corpus comprises 113,670 words in the plain text version and 152,472 items in the annotated version. The transcripts are available as .docx and .txt files; the videos in MPEG4 format. Several versions are available: the fully annotated pragmatic version as text and XML, a lexical version, and a POS-tagged version. Sociolinguistic background information of participants is also provided.

Project leader: Prof. Stefan Diemer, Trier University of Applied Sciences
Time of compilation: 2012–2018
Size: 152,472 words
Language: English
Number of texts/samples: 20
Period: 2012–2015
Released: 2018
Contact email: case@umwelt-campus.de
Project home page: http://umwelt-campus.de/case

ViMELF in numbers:

20 Conversations
Conversation length: 744.5 min total, ca. 12.5 hours of conversations
Average conversation length: 37.23 min.
Words/Tokens: 113670 (plain text), 154472 (annotated version)
Participants: 40 (20 SB, 5 FL, 5 HE, 5 ST, 5 SF)
Medium: Video both sides: 11, video one side: 3, audio: 6

Reference line and copyright

ViMELF. 2018. Corpus of Video-Mediated English as a Lingua Franca Conversations. Birkenfeld: Trier University of Applied Sciences. Version 1.0. The CASE project [umwelt-campus.de/case].

ViMELF – Corpus of Video-Mediated English as a Lingua Franca Conversations. © The CASE Project, Trier University of Applied Sciences, compilers: Stefan Diemer, Marie-Louise Brunner, Caroline Collet, Selina Schmidt.

Manual

Corpus description: https://www.umwelt-campus.de/ucb/index.php?id=12246

General information on data: https://www.umwelt-campus.de/ucb/index.php?id=11349

Transcription conventions: https://umwelt-campus.de/case-conventions

Compilers

Project coordination: Stefan Diemer & Marie-Louise Brunner

Transcription and proofreading: Janine Dieterle, Julian Laudwein, Sina Burghardt

Availability

The corpus is freely downloadable for non-commercial research purposes upon free subscription. More information and subscription.

Technical information

Four versions of the corpus are available:

CASE transcription (as docx, rtf and txt): the basic version produced by manual transcription. CASE transcription conventions include spoken language features beyond the words, such as prosodic, paralinguistic and non-verbal features.
XML version (xml): a version of the annotated CASE transcription encapsulating the original information in a machine-readable form – this version is produced with XTranscript (Gee 2018)
Lexical version (lex): For the lexical version all annotation is removed - this version is produced with XTranscript (Gee 2018)
Part-of-speech tagged version (pos): a POS-tagged version of the lexical version, produced with the CLAWS POS tagger (C7 tagset).
Video/audio files are provided in MPEG4 format

Associated projects

Corpus of Academic Spoken English (CASE)

CoRD Entry submitted on May 30, 2018 by Stefan Diemer.