Big and Rich Data in English Corpus Linguistics: Methods and Explorations

Edited by Turo Hiltunen, Joe McVeigh & Tanja Säily

Abstracts

Winters, Jane
Tackling complexity in humanities big data: From parliamentary proceedings to the archived web
http://www.helsinki.fi/varieng/series/volumes/19/winters/

One of the key characteristics of big data for the humanities is its complexity, whether we are dealing with the text of digitised nineteenth-century newspapers or the vast quantities of born-digital data generated by social media platforms such as Twitter. It has been produced over different periods of time, using different and often undocumented methods; at best it may be only partially structured; and on occasion we may not even know precisely what it looks like. How can humanities researchers develop theoretical and methodological frameworks for dealing with such complex material? What tools and skills do we need to help us to work effectively with big data? How can we analyse data at scale, while retaining an understanding of the people and stories which are woven into its fabric? What, ultimately, can the humanities in general, and history in particular, bring to big data research? This article addresses some of these questions, focusing on two very different types of data: parliamentary proceedings in the UK, the Netherlands and Canada (from c.1800 to the present day); and the archive of UK web space from 1996 to 2013. Both offer fascinating insights into language, politics, culture and society, but both also present challenges, some of which we are only just beginning to identify, let alone to solve. It is vital that scholars work together to tackle these questions, or contemporary decisions about how we describe, publish and preserve big data may hamper the humanities researchers of the future.

Flanagan, Joseph
Reproducible research: Strategies, tools, and workflows
http://www.helsinki.fi/varieng/series/volumes/19/flanagan/

This paper presents various strategies designed to make analyses involving big, medium, or small data reproducible and show how they can be implemented in R Studio, an IDE (integrated development environment) for the statistical programming language R. While these tools and practices have become increasingly common in certain fields and disciplines, they are not yet widely known – let alone practiced – within the humanities. I will conclude the paper with some thoughts about the obstacles preventing the strategies I outline from becoming standard practice within the humanities and will offer some suggestions for how we might overcome those obstacles.

Brunner, Marie-Louise, Stefan Diemer & Selina Schmidt
“... okay so good luck with that ((laughing))?” – Managing rich data in a corpus of Skype conversations
http://www.helsinki.fi/varieng/series/volumes/19/brunner_diemer_schmidt/

Spoken computer-mediated communication (CMC) presents a complex challenge for corpus creation. While big-data approaches work well with written data, rich conversation corpora pose major problems at the recording, transcribing, annotation and querying stages (Diemer, Brunner & Schmidt 2016). Many features of spoken and especially audio-visual corpora are not covered by current transcription standards (Nelson 2008). They may be a matter of debate (e.g. gestures and gaze, Adolphs & Carter 2013) and raise organizational issues. This article presents examples of rich data from CASE, the Corpus of Academic Spoken English (forthcoming), compiled at Saarland University, Germany. CASE consists of Skype conversations between speakers of English as a Lingua Franca (ELF). CASE data allows research on a wide range of linguistic features of informal spoken academic CMC discourse. Its multimodal nature illustrates both benefits and challenges of rich data, particularly during transcription and annotation. The article presents the organisation and transcription scheme developed for CASE. CASE is designed with multiple layers, including a discourse-oriented basic layer, as well as XML, orthographic, and part-of-speech-tagged layers. The paper discusses the challenges and limitations of transcription and annotation in view of the audiovisual corpus data, especially with regard to paralinguistic (e.g. laughter) and non-verbal (e.g. gestures) discourse features. We illustrate the advantages of the proposed organisation and transcription scheme with regard to the multimodal data set in the context of several quantitative and qualitative case studies.

Coats, Steven
Gender and lexical type frequencies in Finland Twitter English
http://www.helsinki.fi/varieng/series/volumes/19/coats/

English is playing an expanding role as a language of informal online communication in many communities where it has hitherto not been widely used as a language of local communication. This is particularly evident on global social media platforms such as Twitter. Some research has found small but significant differences by gender for the use of grammatical and lexical features in spoken and written language, including online varieties such as chat, instant messaging, and Twitter (Baron 2004, Squires 2012, Bamman, Eisenstein & Schnoebelen 2014). In this study, the frequency distributions of selected standard and non-standard lexical types and their correlation with gender are considered in a corpus of English-language Twitter messages originating from Finland.

In a first step, a corpus of geo-located English-language Twitter user messages from Finland was created by accessing the Twitter Streaming API and using an automated language detection tool to remove non-English user messages. After disambiguating author gender by automated methods, the frequencies and distributional profiles of selected lexical types were examined and compared with those derived from a corpus of English-language Twitter messages worldwide subject to the same processing procedures.

The analysis supports some previous findings pertaining to gendered language use, but also suggests that the manifestation of gender in lexical type frequencies in Finland Twitter English reflects sociolinguistic considerations, particularly for those lexical features most strongly associated with the discourse of the Twitter platform itself. The analysis sheds light on the dynamics of a geographically specified online English variety and considers how sociolinguistic factors interact with technological considerations to contribute to the differentiation of online Englishes.

McVeigh, Joe
Congratulations, You WON!!! Exploring trends in Big Data marketing communication
http://www.helsinki.fi/varieng/series/volumes/19/mcveigh/

Big Data collection and analysis is exploding throughout every field of research. Now that many forms of communication have gone electronic, the possibility to research them is greater than it has ever been. With this data there is also the possibility to correlate linguistic data with metadata to uncover interesting patterns. Big Data research in linguistics is therefore not constrained to just word counts, but depends on the discipline and goals of the study. Linguistic analyses of corpora can have benefits outside the field of linguistics, such as in marketing, where there is a substantial economic value placed on language. Linguistic descriptions of marketing data therefore have a commercial appeal since they can be applied directly to the creation of future texts. This paper researches a corpus of 2,021 email marketing subject lines (9,881 tokens), which were together sent over 84 million times. The subject lines are coupled in the analysis with each email’s average open rates, which is a standard success metric widely used by email marketers in evaluating subject lines. The analysis shows how email marketing subject lines are similar to other types of CMC and other types of marketing. The subject lines exhibit interesting linguistic features, especially in the use of non-standard variations and exclamation points. These variations, as well as the parts of speech of the subject lines, are investigated to show which of them correlate to the success of each subject line. This research also addresses the gap in CMC research on email marketing and email subject lines, both of which have been almost entirely ignored in linguistic research.

Daugs, Robert
On the development of modals and semi-modals in American English in the 19th and 20th centuries
http://www.helsinki.fi/varieng/series/volumes/19/daugs/

The purpose of this study is to shed new light on the diachrony of modal expressions in AmE and relativize earlier results concerning particular patterns of modal development that have long since been accepted among linguists. First, I will provide data from COHA on a relatively uncharted research field, i.e. modal/semi-modal variation and change in 19th century AmE. Secondly, while my data confirm a general decline in the frequency of the modal verbs in AmE over the 20th century, a closer look at their long-term individual developments suggests that particularly the subdivision of the modals into frequent and infrequent ones and the ‘bottom weighting’ of the frequency loss observed in Leech (2003, 2011, 2013) and Leech et al. (2009) need revision. And thirdly, the opposing frequency shifts of will and be going to will receive some attention, as their respective developments point to a possible overall change in referring to future time in English.

Silvennoinen, Olli O.
Not only apples but also oranges: Contrastive negation and register
http://www.helsinki.fi/varieng/series/volumes/19/silvennoinen/

This paper investigates the register variation of contrastive negation in English, a family of constructions that has so far not been explored in corpus-linguistic studies. Contrastive negation refers to expressions in which one element is negated and another one is presented as its alternative (e.g., not once but twice; I come to bury Caesar, not to praise him). The study combines the methods of corpus linguistics and interactional linguistics to investigate expressions that are highly resistant to automatised queries, comparing conversation and newspaper discourse on the one hand (“apples and oranges”), and various sub-registers of newspaper discourse on the other (“apples and apples”). The results show that the expression of contrastive negation is highly differentiated by register: conversation is dominated by asyndetic clause combinations while in writing, various constructions are attested more evenly. Sub-registers of writing also display variation: argumentative texts have a particularly high number of negative-contrastive constructions while in sports reports their prevalence is much lower. The study shows that both apples-and-apples and apples-and-oranges comparisons shed light on construction choice: data needs to be not only big enough but also rich and thick enough for this to be possible in the analysis of highly polysemous items.

Lijffijt, Jefrey & Terttu Nevalainen
A simple model for recognizing core genres in the BNC
http://www.helsinki.fi/varieng/series/volumes/19/lijffijt_nevalainen/

Human communicative practices are organized in terms of genres, and people are highly skilled at recognizing genre differences. In text corpora, genres are typically defined on the basis of text-external features, such as medium, function and format. We show that the core genres of face-to-face conversation, prose fiction, broadsheet newspapers, and academic prose can also be reliably recognized based on a small set of text-internal (linguistic) surface features. Using a 40-million-word subset of the British National Corpus, we study select text-internal surface features that capture language complexity. It is shown that externally-defined genres differ substantially from each other, and that, using pairs of surface features, such as counts of nouns and pronouns, or of average word lengths and type/token ratios, it is possible to recognize those highly productive genres with a high degree (> 90%) of accuracy. Furthermore, our model can be used to get a quick overview of the structure a corpus, which is very useful when exploring big and diverse corpora. It is also possible to detect errors in the genre annotation of the BNC and develop software for detecting genre differences. By applying it to the Lancaster–Oslo/Bergen Corpus of British English, we also demonstrate that the model generalizes well across corpora of different sizes. Not unexpectedly, native speakers are still found to outperform the model, especially when very short text samples are analysed.

Schneider, Gerold, Mennatallah El-Assady & Hans Martin Lehmann
Tools and methods for processing and visualizing large corpora
http://www.helsinki.fi/varieng/series/volumes/19/schneider_el-assady_lehmann/

We present several approaches and methods which we develop or use to create workflows from data to evidence. They start with looking for specific items in large corpora, exploring overuse of particular items, and using off-the-shelf visualization such as GoogleViz. Second, we present the advanced visualization tools and pipelines which the Visualization Group at University of Konstanz is developing. After an overview, we apply statistical visualizations, Lexical Episode Plots and Interactive Hierarchical Modeling to the vast historical linguistics data offered by the Corpus of Historical American English (COHA), which ranges from 1800 to 2000. We investigate on the one hand the increase of noun compounds and visually illustrate correlations in the data over time. On the other hand we compute and visualize trends and topics in society from 1800 to 2000. We apply an incremental topic modeling algorithm to the extracted compound nouns to detect thematic changes throughout the investigated time period of 200 years. In this paper, we utilize various tailored analysis and visualization approaches to gain insight into the data from different perspectives.

Säily, Tanja & Jukka Suomela
types2: Exploring word-frequency differences in corpora
http://www.helsinki.fi/varieng/series/volumes/19/saily_suomela/

We demonstrate the use of the types2 tool to explore, visualize, and assess the significance of variation in word frequencies. Based on accumulation curves and the statistical technique of permutation testing, this freely available tool is especially well suited to the study of types and hapax legomena, which are common measures of morphological productivity and lexical diversity. We have developed a new version of the tool that provides improved linking between the visualizations, metadata, and corpus texts, which facilitates the analysis of rich data.

The new version of our tool is demonstrated using two data sets extracted from the Corpora of Early English Correspondence (CEEC) and the British National Corpus (BNC), both of which are rich in sociolinguistic metadata. We show how to use our software to analyse such data sets, and how the new version of our tool can turn the results into interactive web pages with visualizations that are linked to the underlying data and metadata. Our paper illustrates how the linked data facilitates exploring and interpreting the results.

Volume 19

Big and Rich Data in English Corpus Linguistics: Methods and Explorations

Abstracts