Corpus Resource Database submissions guidelines

VARIENG welcomes you to submit information on your own corpus to the database. This service is provided completely free of charge for the benefit of the corpus linguistic community, students in the field and the general public.

What kind of corpora may be submitted?

We list English language corpora representing any region or period. The corpora may be of any size or register, commercial or open access; we also accept corpora that are still work in progress or unavailable for copyright or other reasons.

How to submit corpus information?

Provide as much information as you can; if using the submission form, long stretches of text can be prepared beforehand and pasted into the relevant fields of the form.

Please note that the electronic form can not be saved on your computer. If you find yourself unable to complete your submission at one go, please make a note of that in the last field of the electronic form. You can submit the rest of the information at another time, without having to fill in the information you have already submitted.

If your corpus is described somewhere online, or in an article or other type of publication, we can put together a basic description for you. Simply let us know where to look, and who to contact when the page is ready for preview. We will still need you to approve the page, and we would of course welcome any additional information and requests for revision from you. Note that we cannot reproduce previously published content without the proper publication rights, but basic information can of course be reproduced. We'll put together basic tables and charts, but will not verify the data or do any statistical work. We cannot be asked to compile a bibliography for your corpus.

Issues of publication rights

Do not forget to fill in your own contact information; we can not publish information on any corpora without valid contact information. If your corpus was compiled by a team consisting of several people, please try and make sure all of the compilers agree with the information you are thinking of submitting. We endeavour to publish the text you submit 'as is', but retain the right to do slight copy editing. If we feel major revisions are necessary, we will be in touch with you. Non-native speakers of English should consider having their submissions language checked. You are free to submit additions, revisions, and deletion at any time.

Please note that only the compilers and/or copyright holders of a corpus may submit primary information to CoRD. If you are neither of the above, but feel you have information which shoud be included, you are encouraged to fill in our feedback form instead. We are always happy to receive such information and will take any suggestions for additions or alteration into consideration.

What to include in your submission?

For an idea of how entries are constructed and what kind of information may be included, please take a look at some existing corpus descriptions as well as the Corpus Finder. All entries should provide the same basic information, mostly displayed on the front page of the description and partly used for sorting and filtering corpora in the Corpus Finder. It may be helpful to the user of CoRD if entries follow roughly the same structure (introduction, basic structure, background, bibliography – each with subheadings as needed), but descriptions may vary according to the type of corpus and availability of information. The corpus submission form, described below, covers all the basic information needed and should be used as a guideline for email submissions as well.

Submitter data

We need your contact information so that we can ask you to approve the finished entry before publishing it. The submitter(s)/editor(s) of the entry will be credited on the front page of the entry, but your email address will not be shown.

  • Title
  • Name
  • Institution
  • Email
  • Editors of this submission (if other than the submitter)

Corpus information: the basics

This information can be elaborated on in the next section.

  • Name of corpus
  • Abbreviation
  • Start year: the date of the earliest text(s) in the corpus
  • End year: the date of the latest text(s) in the corpus
  • Size in words
  • Number of texts/samples
  • Type of corpus: please indicate whether the corpus is multigenre or genre-specific, i.e. whether it covers several genres or is limited to one genre such as, for example, medical texts.
  • Register of texts in the corpus
    • written
    • spoken
    • both
  • Annotation: please mention all that apply, also ‘none’ if a plain version is available
    • POS-tagged: part-of-speech annotation
    • parsed: syntactic annotation
    • other: annotation of, e.g., discursive features, text structure, phonetic features, orthography, etc.
    • none
  • Format
    • CD/DVD: The corpus is distributed on a disc.
    • download: The corpus can be downloaded from the internet.
    • online: The corpus is accessible online without downloading.
    • on-site: The corpus can only be accessed locally.
    • not available
  • Availability
    • open access: The corpus can be freely used by anyone.
    • free subscription: The corpus is free to use but requires a subscription.
    • license required: A paid subscription is required.
    • commercial
    • in preparation
    • not available: The corpus is not available to outside users e.g. for copyright reasons.
  • Compilation started in: the year when work on the corpus began
  • Compilation finished in: if not yet finished, please indicate that the corpus is work in progress and update this information later.
  • Release year
  • Funding information
  • Corpus web site address
  • Reference line: how the corpus should be referenced when used in publications.
  • Copyright information: e.g. who owns the copyright to corpus texts, any limitations to their use, etc.
  • Corpus manual: a bibliographical reference and/or web address to the manual
  • Project leader
  • Compilers: Possible subheadings: compilers, collaborating researchers, other team members, student/research assistants; if possible, please include links to compilers’ home pages.
  • Technical information: a brief description of software / file format / other relevant technical details
  • Associated projects: if possible, please include links to web sites
  • Brief corpus description: a short introduction to be displayed at the top of the front page

Structure of corpus

A more detailed look at the corpus.

  • Basic structure: a more detailed description of the corpus, including information on, e.g., subdivisions of texts, sample size, time-span, word counts; tables & figures are also welcome as attachments.
  • Parameters & coding
  • Annotation
  • Genres, text types
  • Sociolinguistic coverage
  • Technical information

Background, Publications, Other

One of the objectives of the CoRD database is to record the history of corpus compilation. To that end, we encourage corpus compilers to make use of the 'background information' section by writing a description of the compilation process, the choices you made and perhaps the challenges you faced. Photographs and press clippings can also be included.

In addition to historiographical purposes, the descriptions of compilation history will serve the corpus linguistic community in shedding more light on the principles followed in selecting samples for corpora, the methods compilers have used to ensure reliability of primary data, and the thinking followed in deciding on descriptors and metadata.

  • Background and history: Any information you would like to include about the compilers, the motives behind the corpus, its stages of development, compilation guidelines etc. Images are also welcome!
  • Reference works: bibliographical references / web addresses for further information on the corpus
  • Bibliography: publications in which the corpus has been used as a source; please list first in descending order by year of publication and then alphabetically by author.
  • Other relevant information, errata, files appended, questions, comments, feedback