A reduced redundancy USENET corpus (WestLabUSENET)

This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47,860 English language, non-binary-file news groups (see list of newsgroups included with the corpus for details). Despite our best efforts to clean this corpus, contains a very small percentage of non-English words and non-words. No automatic spelling correction was performed, and no text was transformed. The corpus is untagged, raw text. It may be necessary to process the corpus further to put the corpus in a format that suits your needs.

It contains USENET discussions, including all manner of interactions. Also may contain advertisements, spam, vulgar language and other related content. This corpus was collected to be used to build computational models of language, but has been used by many people for many purposes since its release.

This work would not have been possible without the hardware and software provided by the TaPoR project. This research is also supported by NSERC.

Project leader: Cyrus Shaoul
Time of compilation: 2005 - 2011
Released: 2013
Size: 6,089,697,986 words, 22,799,995 texts
Project home page:
http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

Manual

Available on the web site.

Compilers

Cyrus Shaoul

Availability

This corpus is freely available under a Creative Commons license. Is it distributed as a single compressed text file.

Reference line and copyright

Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005-2011) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)

Bibliography

Shaoul, C. & Westbury C. (2013) A reduced redundancy USENET corpus (2005-2011) Edmonton, AB: University of Alberta (downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html)

Shaoul, C. & Westbury, C. (2010). Exploring lexical co-occurrence space using HiDEx. Behavior Research Methods, 42:2, 393-413.

Shaoul, C. & Westbury, C. (2006). Word Frequency Effects in High-Dimensional Co-Occurrence Models: A New Approach Behavior Research Methods, 38:2, 190 - 195