Data

More detailed information is provided in the User Manual available via the corpus online interface.

Components

The corpus materials are based on the Assessment of Performance Unit (APU) surveys of language performance, carried out by the National Foundation for Educational Research (NFER). The APU writing surveys aimed at assessing pupils’ performance in different communicative situations, such as editing, describing, reporting, etc. (Gorman et al. 1991: 29). This current version of the corpus focuses on the Year 6 age-group and on two types of text (communicative function) with a long-standing tradition in UK schools, namely narration-cum-description and argumentation-cum-persuasion.

The corpus consists of two major components: writings by children – “School Scripts” – and writings for children – “Basal Readers” –, in line with work by Biber and associates (Reppen 1994, Biber et al. 2002). The former will help us to identify the range of lexical and grammatical features that are (fully or partially) mastered at Year-6 level; the latter will signal what linguistic features this age-group tends to be exposed to and/or presented with as linguistic models.

The “School Scripts” component consists of 522 scripts and ca. 93,000 words. The data have been stratified by year of survey, type of text (communicative function), and pupil’s sex. For the year 1979 there are 246 files, ca. 40,800 words, with 123 files for each communicative function, distributed as ca. 12,700 words for the argumentative/persuasive task and ca. 28,100 words for the narrative/descriptive type of text. For the year 1988 there are 276 files, ca. 52,000 words, with 138 files for each communicative function, distributed as ca. 16,000 words in the argumentative/persuasive task, and ca. 36,000 words in the narrative/descriptive type of text.

For the selection of the “Basal Readers” component, we collected 13 readers used in the APU Surveys, ca. 15,500 words (from 1979, 1982, 1988). In a second stage we added 8 supplementary materials, ca. 63,800 words. This totals 21 files and ca. 79,300 words.

Parameters

For each script of the “School Scripts” component:: Pupil ID, Pupil’s sex, Pupil’s date of birth, Script filename, Script title, School level, Survey year, Skill, Task, Task function, Attainment band, Extent.
For each file of the “Basal Readers” component:: Filename, Short title, Function, Publication year, Bibliographic reference, Author’s name, Author’s sex, Contents, Extent.

Metadata

The corpus metadata have been stored in full detail in a MS Access relational database. A selection of the metadata information has been coded in XML TEI-headers attending to the four major TEI elements: file description, encoding description, profile description, revision history. The parameters displayed in the online interface include the following:

Pupil’s details: ID number, Sex, Date of birth
Script reference: Domain, Survey date, Filename, Script title
Script description: Level, Skill, Function, Task, Attainment band, Length

Multidimensional Analysis

The Multi-Dimensional Analysis framework was first introduced by Biber (1988), applied to synchronic register variation in English and in adults’ writings. Later studies have proven its applicability to historical register variation, to university students’ writings, and further to other languages (e.g. Biber 1995, Biber 2006, Biber & Finegan 1997). Given the importance of this approach for the study of genre/register variation, we have carried out a number of analyses with the data compiled in the APU Corpus. The software used is the Multidimensional Analysis Tagger (MAT, Nini 2015), which automatically tags the input files and provides a number of output files with corpus statistics, scores for the dimensions of variation, z-score values for linguistic features, as well as graphs plotting the dimension values by text type and by genre.