home

Studies in Variation, Contacts and Change in English

Volume 22 – Data Visualization in Corpus Linguistics: Critical Reflections and Future Directions

Introduction

Contents

Edited by Lukas Sönning & Ole Schützler

Abstracts

Schützler, Ole
Frequencies in corpus linguistics: Issues of scaling and visualisation
https://urn.fi/URN:NBN:fi:varieng:series-22-1

This chapter starts from the premise that the ways in which corpus linguists conceptualise and visualise frequencies (and frequency differences) have not been sufficiently problematised and need to be addressed in a more principled way. A key aspect in this is the non-linearity of frequency differences and its consequences for data displays, with a focus on normalised text frequencies, frequency differences, and proportions (or percentages) in diachronic research. These issues are relevant because, ultimately, they may strongly affect how we interpret results in empirical, corpus-based research. The chapter proceeds from general and theoretical considerations – rooted, for instance, in principles of numerical processing – to their application in data scaling and visualization. Researchers are supported in making decisions on three counts: (i) How to transform absolute frequency values, (ii) which plot types to use, and (iii) how to label and annotate graphs in an accessible and transparent way. Different ways of scaling and displaying frequencies will be introduced and compared, in order to provide recommendations for more informative, easily (and flexibly) interpretable displays. For illustration, the chapter draws on datasets from the published literature and offers modified visualisations of their findings. To aid researchers in applying suggested solutions, annotated R scripts are made available online.

Sönning, Lukas
Drawing on principles of perception: The line plot
https://urn.fi/URN:NBN:fi:varieng:series-22-2

This paper draws attention to an underused display type for corpus data visualization: the line plot. While this graph type is commonly associated with time series data, its true potential arguably unfolds in the application to multifactorial data sets involving discrete (categorical) variables. Data layouts of this kind are typical of corpus-based work and the preferred vehicle for their visualization is currently the bar chart. It is sometimes argued that line plots should only be used when the horizontal axis represents a continuous trait. However, once we allow for the levels of binary and categorical variables to be connected by lines, we recognize that this form offers several distinct advantages over bar charts. This is especially true for visualization tasks involving multiple predictor variables. The paper starts out by providing some theoretical background for the comparative evaluation of graph types, with a focus on quantitative comparisons and perceptual processing. Drawing on empirical insights into visual perception, evidence-based recommendations for the design of line plots are given. These include the choice of line types and plotting symbols, the use of direct labeling, and the arrangement of variables in the display. Following this, key advantages of line plots are illustrated. These include pictorial minimalism, the availability of extended encoding strategies and the scaffolding provided by perceptual grouping laws. The paper closes by emphasizing limitations of this display type. These concern the depiction of non-continuous x-variables and the asymmetric perception of interactions among predictor variables. While ample attention should be paid to these issues, we argue for a (more) routine use of line plots in corpus data visualization.

Moisl, Hermann
Visualizing the shape of high-dimensional data: Fundamental ideas
https://urn.fi/URN:NBN:fi:varieng:series-22-3

The first step in analysis of data, linguistic or otherwise, is often graphical visualization with the aim of identifying latent structure, awareness of which can be used in hypothesis formulation. Where the data dimensionality is three or less, the standard visualization method is to plot the values for each data object relative to one, two, or three axes, where each axis represents a variable. As dimensionality grows beyond three, however, visualization becomes increasingly difficult and, for dimensionalities in the hundreds or thousands, even widely-used methods like parallel coordinate plots become intractable. General solutions for visualization of high-dimensional data are indirect; the present discussion describes two such solutions, both of which are based on the fundamental ideas that data have a shape in n-dimensional space, where n is the dimensionality of the data, and that this shape can be projected into two or three dimensional space for graphical display with, in general, tolerable loss of information. The first of these is Principal Component Analysis, which is very widely used but ignores any nonlinearity in the data manifold, and the second is the Self-Organizing Map, which preserves the topology of and thereby any nonlinearity in the manifold. The discussion is in two main parts. The first part introduces mathematical concepts relevant to data representation, and the second describes the selected visualization methods using these concepts.

Grafmiller, Jason
Visualizing grammatical similarities in comparative variationist analysis
https://urn.fi/URN:NBN:fi:varieng:series-22-4

Variationist research seeks to identify and examine variation among factors that influence linguistic choices, i.e. alternate ways of saying ‘the same’ thing (Labov 1972: 188), and comparative variationists in particular focus on how the influence of those factors differs across varieties. The aim of this paper is to illustrate methods for visualizing effects of such factors and their cross-varietal patterns based on techniques already common in variationist analysis, e.g. logistic regression and random forests. Recent years have seen a rise in new methods for visually interpreting statistical models, however these methods have not been taken up much (yet) in linguistics research. As a way of encouraging wider use of visualization techniques in variationist studies, I present a few of these newer methods using a case study of the English genitive alternation across five genres of written American English. I show how different methods can be used to examine effects of linguistic factors – and their interactions – from the perspective of entire datasets as well as individual observations.

Tyrkkö, Jukka
Network graphs to the rescue, or how to visualise distributions and networks in corpora and language
https://urn.fi/URN:NBN:fi:varieng:series-22-5

Whether we are talking about the structural properties of corpora or the dispersion of linguistic phenomena within corpora or the language system, corpus-based analyses almost invariably deal with complex and relational data. However, due in part to the design of online and standalone corpus tools, corpora are often treated exclusively from the so-called bag-of-words perspective. As corpora have increased in size, it has become increasingly difficult to understand their structures and metadata, and associations between linguistic features are almost impossible to grasp from tabular data and test statistics alone. In recent years, data visualisation methods developed in the natural sciences have become a part of the digital humanist’s toolkit for gaining insights into complex data, understanding their structure, for identifying outliers and noteworthy categories, and for communicating findings in a way that readers and audiences will remember. In this paper, I will focus on network visualisations, which are highly suited for both exploring and presenting complex linked data. The main tool discussed is Cytoscape, an open-access network visualisation tool widely used in bioinformatics and supported by a large user-base. I will present a series of case studies of how network visualisations can assist in both exploratory analysis and descriptive visualisation of corpora and linguistic data. First, I will demonstrate their utility for exploring the structures of corpora and their metadata. Second, I will show how visualisation methods can clarify collocate relationships and how such visualisations can be designed to represent association strengths in a way that does not mislead the reader. And third, I use network graphing to explore the distribution of multilingual elements across millions of tweets, combining linguistic data and metadata to produce an overview that could not be represented otherwise.

Kretzschmar, William & Steven Coats
Fractal visualization of corpus data
https://urn.fi/URN:NBN:fi:varieng:series-22-6

The relationship between word frequency and rank order, when considering the lexical types of a given text and their frequencies, was first noted and described by George Zipf; it was later interpreted by Mandelbrot in terms of fractal dimensionality. In this paper, we discuss some properties of rank-frequency profiles and demonstrate use of the ZipfExplorer tool, an online app for the visualization of shared lexis in two texts, to compare the lexical types in well-known novels. We demonstrate that the alpha parameter of a power law function as well as several other measures can be used to quantify the shared lexical diversity of two texts. In addition, visual examination of the A-curves of rank-frequency profiles can help to interpret similarities and differences between texts and corpora.

University of Helsinki