I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

The first activity this week was to look at the language used around a term such as ‘refugee’ in a newspaper article.  I looked at an article from a recent edition of the Guardian on President Trump’s changes to the US refugee programme. It referred to them as refugees, displaced people, and talked about them in comparison to ‘people seeking asylum at the southern border’ before pointing out that the asylum program and the refugee program were separate, which implies that the two categories are not the same (I know they aren’t legally – my point is that it doesn’t make this explicit in the text and you are left to work it out). It gives examples of groups with whom we might be expected to be sympathetic (those fleeing religious persecution, for example) and a case study and quotes from a resettled refugee who had contributed to American society. The language around here is positive: ‘love’, ‘safely’ and ‘allowed’. There is an interesting contrast though, towards the end of the article, where the positive language of contribution is replaced by language such as ‘loss’, ‘sad’, ‘shut down’, ‘difficult’ and ‘complicated’, when describing what life is going to be like in the future.

The week then used an ESRC funding research project on how British newspapers talk about refugees and asylum seekers.  The focus was methodological, looking at how Corpus Lingustics (CL) might contribute to critical discourse analysis.

Tony McEnery, the course tutor, described how the project team put together the corpus on which the study was based.  First they put together a pilot corpus of texts about refugees and asylum seekers and looked at what words became key in that corpus.  This helped them to compose a query string which could be used to search huge corpora of newspaper articles for relevant material.  Then they split the data into two corpora – one of tabloid and one of broadsheet journalism. They looked carefully at the number of articles and words in each of the two corpora, explaining the difference by pointing out that tabloid articles are usually shorter than broadsheet ones. Moving on to think about how CL contributes to critical discourse analysis, he introduced the idea of the topos (plural = topoi) – a broad theme in data, although according to the course website, ‘In critical discourse analysis it usually has a slightly different sense of ‘warrant for an argument’ rather than theme’.

The two corpora were tested to see which keywords were associated with the query terms.  As well as looking at the overall picture of what keywords were most commonly associated with the query terms in the broadsheets and the tabloids, this could be done on a more focussed basis by looking at, for example, the words used to describe the arrival of refugees.  So the keywords help to shape the topoi, but also, the discourse was created mainly by the tabloids – almost all the keywords were dominated by the tabloids, except in the category of ‘plight’, where the language used was shared by both but the broadsheet newspapers had more. 

The next video looked at whether the words which turned up most frequently in the articles were collocates.  There were some collacates relating to the number of refugees and the theme of plight, and these were across both tabloid and broadsheet newspapers.  But once the team looked at words clustered to the right of ‘illegal’ which might indicate modifying adjectives.  And the theme of illegality was more emblematic of the tabloids, with some especially strong tabloid clustering with ‘immigration and’ – the conjunction and was forcing discourses together.  In comparing the two corpora, the use of particular words and clusters had to be normalised per million words because the broadsheet corpus was much larger.

Step 4 looked at a particular cluster, ‘pose as’ – both how it was used on its own and how it was used in proximity to refugees or asylum seekers.  The tabloids used the phrase far more often than the broadsheets (normalised per million words) and especially so in relation to refugees and asylum seekers.  The course also needed that in the tabloids, the phrase was reported as fact not opinion and was closely associated by a negative stance, with no space given to an opposing side.  Another interesting cluster was thrown up by ‘X pose as’ plus a statement of status such as ‘asylum seeker’, which was used to show faults in the asylum system.

The final video looked into direct or indirect presentation. Direct presentation is when something is directly attributed to something else through modification. For example, in the phrase ‘illegal immigrants suffocated’, the modification of the immigrants with the adjective ‘illegal’ attributes them with that quality directly.  In indirect references, there are general or indirect references which imply the same – words such as ‘trafficked’ or phrases such as ‘sneaking in’.

After the quiz, we moved on to the hands on section of the course in #LancsBox. At least I would have done, had I been able to make it work. Because here again I was hit with the same problem from week 1- #LancsBox has to be run from a folder which will allow you to make changes… and that’s not as simple as it sounds. I’ve changed the machine on which I’m working, so I had to download the software again. After over an hour of searching around the internet, the course forums, my own notes, my computer folders, as well as several failed attempts to extract and run the files, I finally got it working on my Desktop.

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week we looked in more detail at collocation.  Collocation is defined by the course as ‘the systematic co-occurrence of words in use, words which tend to occur frequently with one another and, as a consequence, start to determine or influence one anothers’ meanings’.  When looking for collocates,  a span of plus or minus 5 words is useful, and to be considered a collocate the words have to occur within this span several times, with a baseline of 10 hits being suggested. Of course, this related to enormous corpora, and again highlights for me the particular difficulties of using a smaller, opportunistic corpus – it can overstate the importance of associations.

But frequency alone isn’t necessarily a good guide to the importance of a collocation.  Although frequency lists can produce important data, they are usually topped by functional words such as ‘the’ and ‘of’, so they often need manipulation to produce useful results.  One way of carrying out this manipulation is to look at mutual information.  This ranks the closeness of the association between two words, and thereby often removes the functional words from the results because although they appear in close proximity to one word, they also appear in close proximity to many others.

Next we looked at colligation, where words are associated with a particular grammatical class rather than particular meanings.   This was compared to semantic preference, where a word form is closely associated with a set of words which have some kind of semantic similarity.

Then we had to think about keywords: these are created by comparing two frequency lists to one another.  Mathematical tools such as chi-square and log-likelihood tests  are able to identify keywords which are important based on their relative frequency in two corpora.  They identify the words which are unusually frequent in one one of the corpora, suggesting that they are the words which characterise it.  But you still need to set thresholds for when something becomes a keyword, perhaps only taking the top hits, ensuring that they occur a certain number of times or are spread through a certain number of texts in the corpus.  The course argued that one of the most useful aspects of this sort of study was that you get a feel for the ‘aboutness’ of a text – the key nouns and verbs which tell us what is being talked about.  It’s also important that these tests are replicable.

Part 4 explained how CL could be used to plot the changing importance of words over time by looking which words have decreasing or increasing frequency.  (The words whose frequency remains relatively constant over time are known as lock words.)  Increasing or decreasing keywords can be checked against concordances to see whether there are societal changes which might explain them. We can start to identify such societal changes by looking at the words used around the keywords.  These might then be seen not just as a statistical keyword but a socially salient keyword which identifies a dominant discourse in society.  This interested me a lot, because the idea of socially salient keywords would be a relatively easy but nevertheless really interesting subject to investigate with the early modern ballad corpus…!  There are some words which experience suggests appear a lot in ballads at certain points in time, but it would be interesting to see if these scientific techniques might be able to identify trends that the human eye cannot. 

Part 5 looked at why CL needed to be integrated with other methods – they can help bring quantitative and qualitative methods together, but they are only one tool. 

After the quiz, we went over some of these ideas again in more practical ways, with the instructions on how to look at connectivity in the #LancsBox software. Connectivity starts with a node word chosen by the user, where we can look at the first order collocates, and then look at their collocates and how they cross-connect.  This eventually gives us a collocation network – a graph which helps to show the associations and cross-associations between words. This can be drawn up with the GraphColl function, which I then played about with for a bit.  At the moment I’m quite slow, but it’s certainly interesting and, like most things, practice will, I hope, make perfect.

This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims. 

An interesting week.  I’ve spent most of it smoothing out the  wrinkles in my  epitaph ballad article.  I think it’s nearly ready to go, which is quite pleasing.  The process of refinement is interesting and one that I really quite enjoy, as it brings out the pedant in me.  I’ve spent most of the week trying to marry together the three elements of the article – the research, the historiography and the background information.  I think, now, that I’ve been fairly successful.  I have a supervision meeting later in the week so the first job for Monday (when I’ve been to visit a possible new hall for the Historical Assocation in Bolton) is to send it off to my supervisors to see what they have to say, then I have to decide where to send it.

I’ve also been rewriting the paper on ‘Knowingness and the Mid-Sixteenth Century Ballad’, mainly about the flyting on Thomas Cromwell.   I hope to be able to do away with the script by Tuesday evening, when I give the paper at the Postgraduate History Seminar Series at the University of Manchester.  There will be a repeat performance in Lancaster on Wednesday for the North West Early Modern Seminar Series.  At the beginning of last week, I wasn’t entirely looking forward to it, but having thought it out again I’m much happier about it.  I was trying to cram too much information in, but having taken a lot of examples out and replaced them with ideas, it seems to work much better.  I’m rather looking forward to the chance to discuss my work with everyone on both days. I plan to go out on something of an academic limb, so I hope that there aren’t any people clinging to the tree trunk with chainsaws!  I still have a handout to finish to go with it, so that will have to be a job for Monday too.  Oh…  Monday is tomorrow.  Hmm.  Busy day then.

On Wednesday I went into Manchester.   I spent a nice day working in the John Rylands Library and then went to the Print and Materiality in the Early Modern World seminar, where I heard Angela McShane give her paper on ‘The Seventeenth Century Political Ballad as Subject and Object’.  We had an interesting conversation afterwards, too, which was great.

Then today I started again on the secondary reading that’s been backing up for weeks.  M. L. Bush on the Government Policy of Protector Somerset, but I’m finding it slow and heavy going, if I’m honest.  There’s not going to be much time this week to catch up.


This is the first time since I started back at work that I’ve really felt like I’m back at work.  I’ve begun work on my fifth chapter, ballads and the common weal.  But it’s been a funny sort of week.  I spent Monday with my head stuck in my source material, trying to find the links, sorting them into groups and writing a time line.  On Tuesday I went into the library in Manchester to read a book about John Payne Collier.  He’s turned out to be something of a pain in the neck, if I’m honest.  Not only did he have a habit of leaving out the provenance of the ballads he published in the mid-nineteenth century, he also had an irritating compulsion to forge things.  Even the transcriptions that aren’t of his own invention are, apparently, full of errors.  So at the moment, I am faced with a choice:  ignore everything he ever went near, or go back to the  original sources themselves if I can find them or get at them.  Not a particularly easy decision to make.  What’s more, the man was all over Victorian literary scholarship and those who were caught unawares innocently passed on his errors, so I have to be very careful indeed.

On Wednesday afternoon I went in to the university to pick up an inter-library loan.  I stayed for the history department’s public event, a conversation between Prof. Michael Wood and Tristram Hunt, MP.  It  was very interesting, but I’m not really sure it could be billed as Prof. Wood’s inaugural lecture, as it wasn’t my idea of a lecture.  Very enjoyable, though, and I’m very, very glad I went.

Yesterday and today I have spent working on my chapter.  I’ve got about 1200 words down on paper, although some of that is just notes of ideas, but I’m still quite pleased.  At least I have got a few ideas to work on this week, which I hadn’t last weekend.  I’m in a familiar, if rather uncomfortable, position where I have got several things rattling round in brain that I’d like to work on, but it’s Friday afternoon and now I’m on childcare duty so everything else has to wait until Monday.

I’ve also offered to present a paper at the North West Early Modern Seminar Series at Lancaster University in November, so I have to fit writing that into the next few weeks as well.


Histfest programme

Histfest programme

I was very pleased to attend Lancaster University‘s postgraduate history conference yesterday, where I spoke about my work on knowingness in Tudor ballads and the links between sacred and secular music.   I think they had a bit of a shock when I started singing ‘Down in Yon Forest‘ to demonstrate the simplicity of melody and ‘call and response form’, both of which help to make it a memorable  tune.  The rest of the musical examples I had recorded my husband singing, because I didn’t feel confident that I would have time to learn them before the seminar, but I think having the musical examples really helped because it brought home how the melody can make links between the songs.  There were some very interesting questions and the paper seemed to go down well.  I was also very interested in the papers presented by my fellow panelists, James Mawdesley and Sarah Ann Robins, both early modernists too.  I would have liked to attend Geoffrey Humble’s paper during the morning, but I accidentally ended up in the wrong room!

I had a really interesting supervision meeting this week where we shared our ideas about early modern attitudes to death and looked at the epitaph ballad that I’ve been studying.  I’ve put that to one side for a bit though, in an attempt to get a chapter finished before my next panel meeting in a month’s time.  So today I’ve gone back to working on the ballad contrafacta, in particular pulling together my table of ballads with more than one set of words to the same tune.  I spent several days on it before we went on holiday and I’ve spent another 4 hours on it today.  It’s still not finished, but I needed a break, so I decided I’d catch up on my blog before I tried to do any more on the table.

On Friday I went to the Pathways postgraduate careers event at the university, but I’m no clearer about what I’m going to do when I finish my PhD.

The big news of the day is that I’ve had my first conference paper accepted for Histfest at Lancaster University.  This will be my first conference paper and as far as I’m concerned it has several advantages as a first conference: it’s just up the road, so I’m nearby and I’m not going to get lost on the way there; it’s a postgrad conference so it’s a good first step; and it’s got a reputation for being very friendly.


I’ve also submitted my first article to a journal.  Now for the waiting game: it will take about three months for the peer review process, which I suppose will take me through to mid-August.  I might as well just forget about it for a while!



Writing (Photo credit: Wikipedia)

I’ve spent some time this week working on my writing, trying to improve the style and clarity.  I’ve been looking at the moralisations of ballads that appear in the Stationers’ Registers for my period, so I thought I’d give serious consideration to how I wrote about them in the light of last week’s lesson on how to write a sentence.  I sent a couple of paragraphs off to my supervisor for inspection and I’m happy to report some improvement.  I think I’ve probably become a bit sloppy because of my tendency to splurge ideas on paper without thinking about where they are going or how I am setting them down.  I also suspect that the bar has suddenly been raised and I’m no longer getting away with things that didn’t matter in the past.  That’s fine.  I know (even though he hasn’t told me) that my supervisor’s making me work harder because he knows I can do better, and that’s a good thing.  I’ve printed out the last set of corrections that he sent and I’m keeping them by me on my desk, to remind me how it should be done!  I’ve written about a thousand words this week, which is great because I know that they are better quality ones.  I hope that in the long run, they’ll need a little bit less messing about with later!


I took advantage of the beautiful weather on Tuesday to work in my garden office.  It was warm and sunny, so I ran a lead out the back door for my laptop and sat at the patio table to work.  It turned out to be a very good day for thinking.  I wrote about 6 pages of ideas in one of my research books.  The questions I came up with have kept me going for the rest of the week.  That helped to improve my writing, because I knew what I wanted to talk about before I started to say it.

English: Queen Mary, University of London's Ch...

English: Queen Mary, University of London’s Charterhouse Square site, home to student accommodation and departments of Barts and The London School of Medicine and Dentistry. (Photo credit: Wikipedia)

The last thing I did before knocking off on Friday afternoon was to book my place as a delegate for the Psalm Culture conference at Queen Mary University, London, in July.  I’m looking forward to going, but I have to say that the idea of spending three days in the capital all by myself is a bit daunting.  I am so used to going everywhere as part of a package that the idea of being a professional person in my own right for several days without interruption is somewhat scary.   I’ve booked everything – trains, hotel and conference – so that I can’t back out of it!


As I posted on twitter, I have hit upon a paradox in my work.

The more I read, the more I want to write.   The more I write, the more I need to read. 

This one’s a difficult one.  Here’s where I am.

Yesterday I read through my musicological ballad analysis chapter and started to read Beth Quitslund’s ‘The Reformation in Rhyme’.  I found references to several other books I should look at  about the metrical psalters.  This is par for the course.  This happens every time I read anything.  Each book I look at generates about another 4 or 5 that I feel the need to look at too.  I’m used to this, but it gets a little bit frustrating.

This morning I sat down to write some notes to remind myself what I need to do in the next few weeks.  I  looked through the notes from my last panel meeting, and from the supervisory meetings I had just before I was taken ill.  More things to read.  Chapters in books, unpublished theses, articles, entire monographs…  More and more things to read.  Then I looked round at my bookshelves, groaning under the weight of unread books from the library.

Most of what I ‘need’ to do is reading,  but I need to write something for my next meeting.  I have 23 thousand words of a working document on the analysis of the ballad tunes and their lyrics (it will be substantially less when I move the ballad lyrics to the appendices), but it’s not finished.  I need to do more work on dating the ballads and analysing their lyrics.  Then I need to relate it to the general trends in Renaissance music of the period, and so we come back to secondary reading.  Everything I do leads to more reading.  But I want to write! What’s more, I need to write.  So I suppose at some point I have to draw the line under reading, at least for a while, to do the writing that that reading has generated and carry on with the primary research.

I had a very supportive ‘back to work’ meeting on Wednesday. We talked about my plans to ease myself back into work gently with some reading!  Also, I have an article about Jacobean corruption and Saint John Roberts almost ready for submission to a journal.  I just have to get an exact reference for the document on which it is based, sort out exactly how to present the website references and check it through once more.  As soon as I get the document archive reference, I will be sending it straight off and not holding my breath.  There are also a couple of conferences I want to prepare something for.  One is the histfest at Lancaster University, which is just up the road from me.  So I’ve got plenty to keep me going.  It was lovely being back at work, and great to know I’ve got my panel supporting me.