September 2019

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week we looked in more detail at collocation.  Collocation is defined by the course as ‘the systematic co-occurrence of words in use, words which tend to occur frequently with one another and, as a consequence, start to determine or influence one anothers’ meanings’.  When looking for collocates,  a span of plus or minus 5 words is useful, and to be considered a collocate the words have to occur within this span several times, with a baseline of 10 hits being suggested. Of course, this related to enormous corpora, and again highlights for me the particular difficulties of using a smaller, opportunistic corpus – it can overstate the importance of associations.

But frequency alone isn’t necessarily a good guide to the importance of a collocation.  Although frequency lists can produce important data, they are usually topped by functional words such as ‘the’ and ‘of’, so they often need manipulation to produce useful results.  One way of carrying out this manipulation is to look at mutual information.  This ranks the closeness of the association between two words, and thereby often removes the functional words from the results because although they appear in close proximity to one word, they also appear in close proximity to many others.

Next we looked at colligation, where words are associated with a particular grammatical class rather than particular meanings.   This was compared to semantic preference, where a word form is closely associated with a set of words which have some kind of semantic similarity.

Then we had to think about keywords: these are created by comparing two frequency lists to one another.  Mathematical tools such as chi-square and log-likelihood tests  are able to identify keywords which are important based on their relative frequency in two corpora.  They identify the words which are unusually frequent in one one of the corpora, suggesting that they are the words which characterise it.  But you still need to set thresholds for when something becomes a keyword, perhaps only taking the top hits, ensuring that they occur a certain number of times or are spread through a certain number of texts in the corpus.  The course argued that one of the most useful aspects of this sort of study was that you get a feel for the ‘aboutness’ of a text – the key nouns and verbs which tell us what is being talked about.  It’s also important that these tests are replicable.

Part 4 explained how CL could be used to plot the changing importance of words over time by looking which words have decreasing or increasing frequency.  (The words whose frequency remains relatively constant over time are known as lock words.)  Increasing or decreasing keywords can be checked against concordances to see whether there are societal changes which might explain them. We can start to identify such societal changes by looking at the words used around the keywords.  These might then be seen not just as a statistical keyword but a socially salient keyword which identifies a dominant discourse in society.  This interested me a lot, because the idea of socially salient keywords would be a relatively easy but nevertheless really interesting subject to investigate with the early modern ballad corpus…!  There are some words which experience suggests appear a lot in ballads at certain points in time, but it would be interesting to see if these scientific techniques might be able to identify trends that the human eye cannot. 

Part 5 looked at why CL needed to be integrated with other methods – they can help bring quantitative and qualitative methods together, but they are only one tool. 

After the quiz, we went over some of these ideas again in more practical ways, with the instructions on how to look at connectivity in the #LancsBox software. Connectivity starts with a node word chosen by the user, where we can look at the first order collocates, and then look at their collocates and how they cross-connect.  This eventually gives us a collocation network – a graph which helps to show the associations and cross-associations between words. This can be drawn up with the GraphColl function, which I then played about with for a bit.  At the moment I’m quite slow, but it’s certainly interesting and, like most things, practice will, I hope, make perfect.

This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims. 

Last week I wrote about the first of my two small research projects, so this week I want to introduce the second: Fake News and Facts in Topical Ballads. This will be a digital humanities project which will use corpus data analysis to look at the links between ballad and pamphlet news.

Thomas Charles Wageman [Public domain]

Shakespeare’s ballad-seller Autolycus is famous for peddling tall tales to credulous commoners hungry for news of monstrous fish and miraculous births.[1]  So my project aims to check the accuracy of information in popular songs to challenge the assumption that ballads were full of fake news.  It will show that, despite recent scholarship which has challenged our belief in the existence of the ‘news ballad’, the genre really did exist prior to the invention of regular news periodicals.  By supplying information to its customers in an entertaining way, it helped to shape social responses to the news.  By using state-of-the-art corpus data analysis of ballads and pamphlets rather than viewing the ballad in isolation from – or in competition with – other news-forms, I hope to demonstrate that there was more than one way to tell the news, and one method was not intrinsically more important or accurate than another.  

Scholarly interest in ballads has surged since the publication of Christopher Marsh’s Music and Society in Early Modern England. There has been a recent special issue of Renaissance Studies on street singers in Renaissance Europe (33:1), for example, and a plethora of articles on English balladry alone, but the role of song in spreading news remains contentious.[2]  Angela McShane argues that ‘there was no such thing as a “news ballad”’ and that ballads, being songs, served a different purpose.[3]  Nevertheless, I don’t believe that their entertainment value need necessarily undermine their newsworthiness.  I intend to carry out the first systematic study of the relationship between English ballad and pamphlet news prior to the development of a regular periodical press. This will enhance our understanding of early modern news networks by offering insights into the intermediality and interdependency of different cheap print genres.

The first step is a database of ballads identified from the Stationers’ Registers Online and British Broadside Ballads of the Sixteenth Century.[5]  I will access topical ballad texts using digital archives such as the English Broadside Ballad Archive, Early English Books Online and Broadside Ballads Online.[6]  Next I will try to find news pamphlets relating to the same events. And this is where the corpus data analysis comes in: specialist corpus linguistics software such as AntConc will highlight any textual overlap between the ballad and pamphlet texts much more quickly and accurately than even the closest of close readings could. This will demonstrate whether ballads have any significant relationship with news pamphlets.  If the software finds substantial similarities between the texts, I will attempt to explain how and why this might have occurred, for example, by looking for evidence that the texts were officially commissioned.  

But there is still no substitute for the human eye and the software can’t do all the analysis. Only by carefully reading the texts will I be able to see whether the need for a narrative story arc in ballads helped to shape the way the news was presented in songs.

Now all I have to do is decide which project I want to make a start on first.

[1] William Shakespeare, The Winter’s Tale.

[2] Christopher Marsh, Music and Society in Early Modern England (Cambridge: CUP, 2010).

[3] Angela McShane, ‘The Gazet in Metre’ in Joop Koopmans (ed.), News and Politics in Early Modern Europe (Leuven: Peeters, 2005), p.140.

[5] <; [accessed 15 April 2019]; Carole Rose Livingston, British Broadside Ballads of the Sixteenth Century (New York: Garland, 1991).

[6] <;; <;; <;

[all accessed 15 April 2019]


The new academic year is approaching fast and things are changing. While I wait to hear what work I’ve got and where, I’ve been getting on with my own research. Several of my projects are almost at an end, so I need to work out which of my projects to dive into next. There are two biggies (the Pilgrimage of Grace book and the martyrs project) and two smaller ones. Realistically, I need to go for one of the smaller ones, both of which should produce a journal article.

The first is a project on the printed epitaphs which seem to have become fashionable from the 1560s onwards: Singing Epitaphs in Sixteenth-Century England

Memorialising the dead in a post-reformation age required imaginative solutions because purgatory and traditional Catholic practises such as masses for the dead were officially no more.  For the first time, epitaphs produced in praise of prominent members of post-reformation English society were printed on a single side of paper and made to look as if they were songs.  I suspect that by combining the enduring popularity of broadside ballads with the new fashion for singing metrical psalms, these epitaph ballads created a new way for Protestants to come to terms with death.  The ballad trade was like a magpie, happy to steal melodies from anywhere and barely aware of differences between ‘high’ and ‘low’ culture.  Psalm tunes would have been particularly fitting melodies for epitaph ballads because they were in vogue, they were devotional and because they gave further meaning to the text.  My research will unite the histories of music, print culture, and doctrinal change, by examining the performance practice of epitaph ballads Identifying tunes for these epitaphs will help bring them to life once again, showing that this crossover genre created a new way for Protestants to process grief.

I want to explore how epitaph ballads were voiced as part of the civic life of sixteenth-century England.  The project will investigate potential similarities between epitaph ballads and the accession day songs in praise of Elizabeth I which have already been studied by Katherine Butler.[1]  ‘Singing Epitaphs’ likewise combines history and musicology, but it also builds on research into the development of psalm singing.[2]  Moreover, ‘Singing Epitaphs’ will couple psalms with recent research on ballads which suggests that melody created meaning.[3] Identifying potential tunes would increase our knowledge of how these epitaphs were heard and understood, based on parallels between old and new texts.

I have already identified a group of printed broadside epitaphs from the Tudor period. The next step will be to compile a list of relevant references in contextual materials such as George Puttenham’s  Art of Poesy, Henry Machyn’s Diary and Holinshed’s Chronicles, in order to suggest when and how the epitaph ballads were performed. Then I will need to identify possible tunes for the epitaph ballads by identifying common metre, rhythm and stress patterns, as well as shared themes and textual similarities.  Further work will investigate whether individual psalm tunes were appropriate because the melody had a suitable emotional effect or was particularly fashionable at the time the epitaph was written.  

[1] Katherine Butler, ‘Creating Harmonious Subjects? Ballads, Psalms and Godly Songs for Queen Elizabeth I’s Accession Day’, Journal of the Royal Musical Association, 140:2 (2015), pp. 273-312.

[2] For example Beth Quitslund’s The Reformation in Rhyme (Aldershot: 2008) and Timothy Duguid’s Metrical Psalmody in Print and Practice (Farnham: 2014).

[3] Chris Marsh’s Music and Society in Early Modern England (Cambridge: 2010).  

A couple of years ago I was sitting in the British Library calling up various documents that might be ballad-related, when I came across John Balshaw’s Jig. What really captured my interest was the fact that Balshaw apparently wrote the piece in Brindle, Lancashire, in 1660. Now Brindle is a little place near Chorley, and about 13 miles away from me by car – fewer as the crow flies. Balshaw’s Jig was a short dramatic piece in verse, probably danced as it was sung to a series of popular tunes of the day, and I spent some time in the months following the find transcribing the text of the jig. But although I found it really interesting, life got in the way and the file was sidelined on my computer for some time, while I carried on with my teaching.

St James’s Parish Church, Brindle CC BY-SA 2.0

Then in January, I began teaching on the Civil War course at Lancaster, and during the summer, I started thinking about the jig again. I couldn’t really remember the plot, and I hadn’t noticed anything particularly significant about the lyrics, but I decided to dig it out and look at it with fresh eyes. And it turned out to be quite a sight.

I started by reading the script through again, looking at the plot in more detail and writing a synopsis as I went along. The jig involves 6 characters in a prologue and 4 scenes, and is based on a fairly standard ‘thwarted lovers’ plot: the girl and boy swear their eternal love, but the girl’s wicked uncle has taken her lands and property and wants to marry his daughter to the boy instead, until fate intervenes and the girl’s fortunes are restored. But there is a twist: the wicked uncle and his daughter are parliamentarians, while the girl and her lover (and his father) are royalists. Fate, in this particular case, takes the form of King Charles II, whose return to London up-ends the balance of power.

Once I’d written the synopsis, I started looking for the music. Each of the four scenes and the prologue are set to different tunes. A couple of the tunes had already been identified by the British Library cataloguer, but I’ve also suggested tunes which might fit the other scenes and provided scores for all of them.

I then went back to the beginning of the document and wrote some introductory paragraphs about jigs, John Balshaw and the manuscript he left behind. I tried hard to find any reference to the man himself, but I couldn’t. More to the point, I couldn’t work out why the BL catalogue claimed that he died in 1679 – there is nothing on the manuscript to sugest this, nor do the Lancashire parish clerk records contain any indication. I even went so far as to contact the BL archivists to ask if they knew where the information came from, but they don’t. So Balshaw remains something of an enigma. in the next section, I provided some context on the civil war, interregnum and their effects on Lancashire. Finally, I expanded my synopsis to provide a commentary on the drama.

All in all, I’m quite pleased with the piece, and I’ve sent it off for consideration by a folk journal. What I’d really like to do, though, is to direct a performance in Brindle! It seems right to take it back where it was born.