Basic structure of PWEC

Kindly see the details in the tables. There are many other information to share but I think the below will be enough to give a kind of idea of the corpus as a whole.

Table 1. The whole corpus.
Genre Period Files Tokens Average token count per file Weightage
Newspapers and Magazines Jan 2020–Feb 2023 3177 1895329 596.578 24.984%
Dissertations and Articles " 232 1896262 8173.543 24.996%
Non-Fiction " 35 1897889 54225.400 25.017%
Legal and Official Language " 714 1896630 2656.344 25.001%
Total 4158 7586110

Table 2. Genre of Newspapers and Magazines.
Magazine Period Sections included Files Tokens Average token count per file Weightage
The Dawn Jan 1, 2023, to Feb 28, 2023 Business
Editorials
Letters
Opinions
Fiction
Non-Fiction
Sports
558 297252 532.709 15.683%
The News (daily) Jan 1, 2023, to Feb 28, 2023 Art & Culture
Business
Editorials
Literature
Opinions
Sports
698 355060 508.681 18.733%
The Nation (daily) Jan 1, 2023, to Feb 28, 2023 Blogs
Business
Editorials
Literature
Opinions
Sports
675 342347 507.180 18.062%
The Frontier Post (daily) Jan 1, 2023, to Feb 28, 2023 Business
Editorials
Opinions
Sports
306 142613 466.055 7.524%
The Friday Times (weekly) Jan 1, 2023, to Feb 28, 2023 Opinions
Editorials
Editorial Pick
Feature
People’s Voice
268 190770 711.828 10.065%
The Daily Times (daily) Jan 1, 2023, to Feb 28, 2023 Art & Culture
Business
Editorials
Perspectives
Sports
Op-Ed
460 311631 677.458 16.442%
Magazines Feb 1, 2020, to Feb 20, 2023 Fiction
Non-Fiction
Food
Lifestyle
Culture
Fashion
Technology
History
Pol. Issues
Social. Issues
212 255656 1205.924 13.488%
Total 3177 1895329

Table 3. Genre of Dissertations and Research Articles.
Name Period Files Tokens Average token count per file Weightage
Dissertations Jan 2020–Feb 28, 2023 19 988236 52012 52.11%
Research Articles Jan 2020–Feb 28, 2023 213 908025 4263 47.88%
Total 232 1896261

Table 4. Genre of Non-Fiction.
File name Period Tokens Weightage
Non_Fic_12022386702.037
Non_Fic_2202186060.453
Non_Fic_32021430552.268
Non_Fic_42020601433.168
Non_Fic_52020474782.501
Non_Fic_62021605273.189
Non_Fic_72022608853.208
Non_Fic_9 2020232331.224
Non_Fic_102021589793.107
Non_Fic_112020598843.155
Non_Fic_122020579203.051
Non_Fic_152020605023.187
Non_Fic_162021607203.199
Non_Fic_172021383792.022
Non_Fic_182020473902.286
Non_Fic_232022604513.185
Non_Fic_242021608473.206
Non_Fic_252020595963.140
Non_Fic_262021606013.193
Non_Fic_282020605283.189
Non_Fic_292022603403.179
Non_Fic_302021575443.032
Non_Fic_312020608713.207
Non_Fic_322023606583.196
Non_Fic_332020606903.197
Non_Fic_342022607453.200
Non_Fic_352022606123.193
Non_Fic_362023603593.180
Non_Fic_372021584923.081
Non_Fic_382020590993.113
Non_Fic_392021605283.189
Non_Fic_402021599933.161
Non_Fic_412023598543.153
Non_Fic_422023313381.651
Non_Fic_432022606903.197
Total1897889

Table 5. Genre of Legal and Official Language.
Name of institution Period Files Tokens Categories Average token count per file Weightage
PARLIAMENT (NA, SNT)2020–202336234649Bills, Acts, Ordinances, Reports, Rules of Procedures, Daily Proceedings651812.371
Govt. of Baluchistan"629780Acts49631.570
FBR"6550447Proposal, Bills Reports, Notification, Press Releases7762.659
FSC"47163487Reports, Rules, Appeals, Notices, Notifications, Revisions, Petitions, Circulars, Orders34788.619
HEC"1264899Press Releases, Reports, Tender54083.421
IHC"3398363Judgments, Judicial policy, Order-Sheets29805.186
IPS"5100256Studies, Reports200515.286
SHC"1573659Judgements49103.883
PHC"21397040Case law, Estacode Judgements1890620.933
Govt. of KPK"1958417ACTS30743.080
LHC"2256330Orders, Judgements, Reports25602.970
MOFA"4534177Press Releases7591.801
MOIB"16563356Press Releases3833.340
MOITT"83131Press Releases391.165
MOLJ"1893127Acts, Reports51734.910
MOPND"179901Press Releases582.522
NAB"135172900Reports, Bidding, Press Releases, Review12809.116
Govt. of Punjab"1052942Acts52942.791
SCP"271889Case Law, Annual Report359443.790
PEC"11971Price Adjustment1971.103
SECP"3040827Orders, Review, Press Releases, Reports13602.152
FJA"113058Annual Report13058.688
MOF"112024Press releases12024.633
Total7141896630

Compilation principles

The parameters include:

  1. Writers who got their education at least up to high school in Pakistan.
  2. Mostly those who still live in Pakistan.
  3. It was ensured utmost that the news is national, not international.
  4. Different sections of newspapers such as editorials, business, sports, opinions, and letters.
  5. Different sections of magazines such as culinary, food, fashion, lifestyle, entertainment, etc.
  6. Diversity was ensured by including newspapers from different regions of Pakistan. Dissertations and research articles from different universities and social and natural journals of Pakistan.
  7. Female writers were included as much as possible.

Genre

Four types of genres:

  1. Newspapers and Magazines
  2. Non-Fiction
  3. Legal and Official Language
  4. Dissertations and Research Articles

Sociolinguistic coverage

Gender: it was ensured to include female writers as much as possible, especially in newspapers and magazine subcorpus of PWEC.

Diversity was maintained to include newspapers from different regions/provinces of Pakistan. Dissertations and Research Articles from both social and natural sciences were selected from different universities and different social and natural journals in Pakistan. Legal and Official Language included more than 40 government institutions, organizations, and departments. It included more than 40 different categories of documents. Non-Fiction included different fields such as business, politics, economy.