I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

The first activity this week was to look at the language used around a term such as ‘refugee’ in a newspaper article.  I looked at an article from a recent edition of the Guardian on President Trump’s changes to the US refugee programme. It referred to them as refugees, displaced people, and talked about them in comparison to ‘people seeking asylum at the southern border’ before pointing out that the asylum program and the refugee program were separate, which implies that the two categories are not the same (I know they aren’t legally – my point is that it doesn’t make this explicit in the text and you are left to work it out). It gives examples of groups with whom we might be expected to be sympathetic (those fleeing religious persecution, for example) and a case study and quotes from a resettled refugee who had contributed to American society. The language around here is positive: ‘love’, ‘safely’ and ‘allowed’. There is an interesting contrast though, towards the end of the article, where the positive language of contribution is replaced by language such as ‘loss’, ‘sad’, ‘shut down’, ‘difficult’ and ‘complicated’, when describing what life is going to be like in the future.

The week then used an ESRC funding research project on how British newspapers talk about refugees and asylum seekers.  The focus was methodological, looking at how Corpus Lingustics (CL) might contribute to critical discourse analysis.

Tony McEnery, the course tutor, described how the project team put together the corpus on which the study was based.  First they put together a pilot corpus of texts about refugees and asylum seekers and looked at what words became key in that corpus.  This helped them to compose a query string which could be used to search huge corpora of newspaper articles for relevant material.  Then they split the data into two corpora – one of tabloid and one of broadsheet journalism. They looked carefully at the number of articles and words in each of the two corpora, explaining the difference by pointing out that tabloid articles are usually shorter than broadsheet ones. Moving on to think about how CL contributes to critical discourse analysis, he introduced the idea of the topos (plural = topoi) – a broad theme in data, although according to the course website, ‘In critical discourse analysis it usually has a slightly different sense of ‘warrant for an argument’ rather than theme’.

The two corpora were tested to see which keywords were associated with the query terms.  As well as looking at the overall picture of what keywords were most commonly associated with the query terms in the broadsheets and the tabloids, this could be done on a more focussed basis by looking at, for example, the words used to describe the arrival of refugees.  So the keywords help to shape the topoi, but also, the discourse was created mainly by the tabloids – almost all the keywords were dominated by the tabloids, except in the category of ‘plight’, where the language used was shared by both but the broadsheet newspapers had more. 

The next video looked at whether the words which turned up most frequently in the articles were collocates.  There were some collacates relating to the number of refugees and the theme of plight, and these were across both tabloid and broadsheet newspapers.  But once the team looked at words clustered to the right of ‘illegal’ which might indicate modifying adjectives.  And the theme of illegality was more emblematic of the tabloids, with some especially strong tabloid clustering with ‘immigration and’ – the conjunction and was forcing discourses together.  In comparing the two corpora, the use of particular words and clusters had to be normalised per million words because the broadsheet corpus was much larger.

Step 4 looked at a particular cluster, ‘pose as’ – both how it was used on its own and how it was used in proximity to refugees or asylum seekers.  The tabloids used the phrase far more often than the broadsheets (normalised per million words) and especially so in relation to refugees and asylum seekers.  The course also needed that in the tabloids, the phrase was reported as fact not opinion and was closely associated by a negative stance, with no space given to an opposing side.  Another interesting cluster was thrown up by ‘X pose as’ plus a statement of status such as ‘asylum seeker’, which was used to show faults in the asylum system.

The final video looked into direct or indirect presentation. Direct presentation is when something is directly attributed to something else through modification. For example, in the phrase ‘illegal immigrants suffocated’, the modification of the immigrants with the adjective ‘illegal’ attributes them with that quality directly.  In indirect references, there are general or indirect references which imply the same – words such as ‘trafficked’ or phrases such as ‘sneaking in’.

After the quiz, we moved on to the hands on section of the course in #LancsBox. At least I would have done, had I been able to make it work. Because here again I was hit with the same problem from week 1- #LancsBox has to be run from a folder which will allow you to make changes… and that’s not as simple as it sounds. I’ve changed the machine on which I’m working, so I had to download the software again. After over an hour of searching around the internet, the course forums, my own notes, my computer folders, as well as several failed attempts to extract and run the files, I finally got it working on my Desktop.

This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims. 

Last week I wrote about the first of my two small research projects, so this week I want to introduce the second: Fake News and Facts in Topical Ballads. This will be a digital humanities project which will use corpus data analysis to look at the links between ballad and pamphlet news.

Thomas Charles Wageman [Public domain]

Shakespeare’s ballad-seller Autolycus is famous for peddling tall tales to credulous commoners hungry for news of monstrous fish and miraculous births.[1]  So my project aims to check the accuracy of information in popular songs to challenge the assumption that ballads were full of fake news.  It will show that, despite recent scholarship which has challenged our belief in the existence of the ‘news ballad’, the genre really did exist prior to the invention of regular news periodicals.  By supplying information to its customers in an entertaining way, it helped to shape social responses to the news.  By using state-of-the-art corpus data analysis of ballads and pamphlets rather than viewing the ballad in isolation from – or in competition with – other news-forms, I hope to demonstrate that there was more than one way to tell the news, and one method was not intrinsically more important or accurate than another.  

Scholarly interest in ballads has surged since the publication of Christopher Marsh’s Music and Society in Early Modern England. There has been a recent special issue of Renaissance Studies on street singers in Renaissance Europe (33:1), for example, and a plethora of articles on English balladry alone, but the role of song in spreading news remains contentious.[2]  Angela McShane argues that ‘there was no such thing as a “news ballad”’ and that ballads, being songs, served a different purpose.[3]  Nevertheless, I don’t believe that their entertainment value need necessarily undermine their newsworthiness.  I intend to carry out the first systematic study of the relationship between English ballad and pamphlet news prior to the development of a regular periodical press. This will enhance our understanding of early modern news networks by offering insights into the intermediality and interdependency of different cheap print genres.

The first step is a database of ballads identified from the Stationers’ Registers Online and British Broadside Ballads of the Sixteenth Century.[5]  I will access topical ballad texts using digital archives such as the English Broadside Ballad Archive, Early English Books Online and Broadside Ballads Online.[6]  Next I will try to find news pamphlets relating to the same events. And this is where the corpus data analysis comes in: specialist corpus linguistics software such as AntConc will highlight any textual overlap between the ballad and pamphlet texts much more quickly and accurately than even the closest of close readings could. This will demonstrate whether ballads have any significant relationship with news pamphlets.  If the software finds substantial similarities between the texts, I will attempt to explain how and why this might have occurred, for example, by looking for evidence that the texts were officially commissioned.  

But there is still no substitute for the human eye and the software can’t do all the analysis. Only by carefully reading the texts will I be able to see whether the need for a narrative story arc in ballads helped to shape the way the news was presented in songs.

Now all I have to do is decide which project I want to make a start on first.


[1] William Shakespeare, The Winter’s Tale.

[2] Christopher Marsh, Music and Society in Early Modern England (Cambridge: CUP, 2010).

[3] Angela McShane, ‘The Gazet in Metre’ in Joop Koopmans (ed.), News and Politics in Early Modern Europe (Leuven: Peeters, 2005), p.140.

[5] <https://stationersregister.online/&gt; [accessed 15 April 2019]; Carole Rose Livingston, British Broadside Ballads of the Sixteenth Century (New York: Garland, 1991).

[6] <https://ebba.english.ucsb.edu/&gt;; <https://eebo.chadwyck.com/home&gt;; <http://ballads.bodleian.ox.ac.uk/&gt;

[all accessed 15 April 2019]

.