Pakistan Written English Corpus

The Pakistan Written English Corpus (PWEC) contains 7586110 tokens and 4158 files. It contains four genres: 1. Non-Fiction, 2. Newspapers and Magazines, 3. Dissertations and Research Articles and 4. Legal and Official Language. The PWEC is missing the subcorpus of Fiction. The date of publications of all these four genres is from 2020 to 2023. Due to the short time frame (2020–2023) the compilation of a Fiction subcorpus was not possible. There were not enough short stories, novels, and essays etc. written during the given period to compile a 1.8-million-word subcorpus equivalent to the other subcorpora of PWEC though the subcorpus of Newspapers and Magazines contains many files discussing the different registers of the aforesaid genre. Despite of it, the idea of a subcorpus of Fiction was abandoned. PWEC contains 41 different categories including Acts, Amendments, Bills, Dissertations, Research Articles, Art & Culture, Business Articles, Editorials, Perspectives, Sports Articles, Opinions, Letters, Magazines, Notifications, Press Release, Reports of different kinds, Court Appeals, Features, Court Petitions, Circulars, Rules, Constitution, Revisions, Notices, Orders, Court Judgments, Baseline Studies, Conference Reports, Judicial Estacode, Annual Reports, Biddings, Quarterly Reviews, Blogs, Literature, Non-fiction writings related to History, Politics, Economy, Architecture, Mental & Physical Health, Religion, Irrigation, International Relations, World Englishes, Human Rights, Feminism etc., Ordinances, Case Law, Diagnostic Reviews, Proposals, Standard Procedure and Formula Price Adjustment, Rule of Procedures, and Yearly Books. For further information, see the Basic structure of PWEC.

All four genres are almost the same in terms of token counts, 1.8 million words each. Due to the difference in size of the files in each genre, the number of files is different with Newspapers and Magazines having the maximum files (3177) whereas Non-Fiction having the minimum files (35). The average token count per file of the subcorpus of Newspapers and Magazines is 596 and of Non-Fiction 54225.

Project leader: Mr. Usman Khan, Hazara University Mansehra, Pakistan
Time of compilation: 2023
Size: 7,586,110 words
Language: Pakistani English
Number of texts/samples: 4,158
Period: 2020–2023
Funding: None

Reference line and copyright

Khan, U. (2023) Pakistan Written English Corpus.

Compilers

Mr. Usman Khan

Availability

The corpus can be used by those who are interested in the exploration of the Pakistani variety of English. Mr. Usman Khan owns the copyrights.

Technical information

Software used: Notepad ++, AntConc 4.2 Version, Free OCR, and some online websites such as Tinywow.com, etc. The files are in three formats – PDF, Word (with extension docx), and plain text. There is no POS-tagging, parsing, etc.