This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims. 

Last week I wrote about the first of my two small research projects, so this week I want to introduce the second: Fake News and Facts in Topical Ballads. This will be a digital humanities project which will use corpus data analysis to look at the links between ballad and pamphlet news.

Thomas Charles Wageman [Public domain]

Shakespeare’s ballad-seller Autolycus is famous for peddling tall tales to credulous commoners hungry for news of monstrous fish and miraculous births.[1]  So my project aims to check the accuracy of information in popular songs to challenge the assumption that ballads were full of fake news.  It will show that, despite recent scholarship which has challenged our belief in the existence of the ‘news ballad’, the genre really did exist prior to the invention of regular news periodicals.  By supplying information to its customers in an entertaining way, it helped to shape social responses to the news.  By using state-of-the-art corpus data analysis of ballads and pamphlets rather than viewing the ballad in isolation from – or in competition with – other news-forms, I hope to demonstrate that there was more than one way to tell the news, and one method was not intrinsically more important or accurate than another.  

Scholarly interest in ballads has surged since the publication of Christopher Marsh’s Music and Society in Early Modern England. There has been a recent special issue of Renaissance Studies on street singers in Renaissance Europe (33:1), for example, and a plethora of articles on English balladry alone, but the role of song in spreading news remains contentious.[2]  Angela McShane argues that ‘there was no such thing as a “news ballad”’ and that ballads, being songs, served a different purpose.[3]  Nevertheless, I don’t believe that their entertainment value need necessarily undermine their newsworthiness.  I intend to carry out the first systematic study of the relationship between English ballad and pamphlet news prior to the development of a regular periodical press. This will enhance our understanding of early modern news networks by offering insights into the intermediality and interdependency of different cheap print genres.

The first step is a database of ballads identified from the Stationers’ Registers Online and British Broadside Ballads of the Sixteenth Century.[5]  I will access topical ballad texts using digital archives such as the English Broadside Ballad Archive, Early English Books Online and Broadside Ballads Online.[6]  Next I will try to find news pamphlets relating to the same events. And this is where the corpus data analysis comes in: specialist corpus linguistics software such as AntConc will highlight any textual overlap between the ballad and pamphlet texts much more quickly and accurately than even the closest of close readings could. This will demonstrate whether ballads have any significant relationship with news pamphlets.  If the software finds substantial similarities between the texts, I will attempt to explain how and why this might have occurred, for example, by looking for evidence that the texts were officially commissioned.  

But there is still no substitute for the human eye and the software can’t do all the analysis. Only by carefully reading the texts will I be able to see whether the need for a narrative story arc in ballads helped to shape the way the news was presented in songs.

Now all I have to do is decide which project I want to make a start on first.


[1] William Shakespeare, The Winter’s Tale.

[2] Christopher Marsh, Music and Society in Early Modern England (Cambridge: CUP, 2010).

[3] Angela McShane, ‘The Gazet in Metre’ in Joop Koopmans (ed.), News and Politics in Early Modern Europe (Leuven: Peeters, 2005), p.140.

[5] <https://stationersregister.online/&gt; [accessed 15 April 2019]; Carole Rose Livingston, British Broadside Ballads of the Sixteenth Century (New York: Garland, 1991).

[6] <https://ebba.english.ucsb.edu/&gt;; <https://eebo.chadwyck.com/home&gt;; <http://ballads.bodleian.ox.ac.uk/&gt;

[all accessed 15 April 2019]

.