I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

The first activity this week was to look at the language used around a term such as ‘refugee’ in a newspaper article.  I looked at an article from a recent edition of the Guardian on President Trump’s changes to the US refugee programme. It referred to them as refugees, displaced people, and talked about them in comparison to ‘people seeking asylum at the southern border’ before pointing out that the asylum program and the refugee program were separate, which implies that the two categories are not the same (I know they aren’t legally – my point is that it doesn’t make this explicit in the text and you are left to work it out). It gives examples of groups with whom we might be expected to be sympathetic (those fleeing religious persecution, for example) and a case study and quotes from a resettled refugee who had contributed to American society. The language around here is positive: ‘love’, ‘safely’ and ‘allowed’. There is an interesting contrast though, towards the end of the article, where the positive language of contribution is replaced by language such as ‘loss’, ‘sad’, ‘shut down’, ‘difficult’ and ‘complicated’, when describing what life is going to be like in the future.

The week then used an ESRC funding research project on how British newspapers talk about refugees and asylum seekers.  The focus was methodological, looking at how Corpus Lingustics (CL) might contribute to critical discourse analysis.

Tony McEnery, the course tutor, described how the project team put together the corpus on which the study was based.  First they put together a pilot corpus of texts about refugees and asylum seekers and looked at what words became key in that corpus.  This helped them to compose a query string which could be used to search huge corpora of newspaper articles for relevant material.  Then they split the data into two corpora – one of tabloid and one of broadsheet journalism. They looked carefully at the number of articles and words in each of the two corpora, explaining the difference by pointing out that tabloid articles are usually shorter than broadsheet ones. Moving on to think about how CL contributes to critical discourse analysis, he introduced the idea of the topos (plural = topoi) – a broad theme in data, although according to the course website, ‘In critical discourse analysis it usually has a slightly different sense of ‘warrant for an argument’ rather than theme’.

The two corpora were tested to see which keywords were associated with the query terms.  As well as looking at the overall picture of what keywords were most commonly associated with the query terms in the broadsheets and the tabloids, this could be done on a more focussed basis by looking at, for example, the words used to describe the arrival of refugees.  So the keywords help to shape the topoi, but also, the discourse was created mainly by the tabloids – almost all the keywords were dominated by the tabloids, except in the category of ‘plight’, where the language used was shared by both but the broadsheet newspapers had more. 

The next video looked at whether the words which turned up most frequently in the articles were collocates.  There were some collacates relating to the number of refugees and the theme of plight, and these were across both tabloid and broadsheet newspapers.  But once the team looked at words clustered to the right of ‘illegal’ which might indicate modifying adjectives.  And the theme of illegality was more emblematic of the tabloids, with some especially strong tabloid clustering with ‘immigration and’ – the conjunction and was forcing discourses together.  In comparing the two corpora, the use of particular words and clusters had to be normalised per million words because the broadsheet corpus was much larger.

Step 4 looked at a particular cluster, ‘pose as’ – both how it was used on its own and how it was used in proximity to refugees or asylum seekers.  The tabloids used the phrase far more often than the broadsheets (normalised per million words) and especially so in relation to refugees and asylum seekers.  The course also needed that in the tabloids, the phrase was reported as fact not opinion and was closely associated by a negative stance, with no space given to an opposing side.  Another interesting cluster was thrown up by ‘X pose as’ plus a statement of status such as ‘asylum seeker’, which was used to show faults in the asylum system.

The final video looked into direct or indirect presentation. Direct presentation is when something is directly attributed to something else through modification. For example, in the phrase ‘illegal immigrants suffocated’, the modification of the immigrants with the adjective ‘illegal’ attributes them with that quality directly.  In indirect references, there are general or indirect references which imply the same – words such as ‘trafficked’ or phrases such as ‘sneaking in’.

After the quiz, we moved on to the hands on section of the course in #LancsBox. At least I would have done, had I been able to make it work. Because here again I was hit with the same problem from week 1- #LancsBox has to be run from a folder which will allow you to make changes… and that’s not as simple as it sounds. I’ve changed the machine on which I’m working, so I had to download the software again. After over an hour of searching around the internet, the course forums, my own notes, my computer folders, as well as several failed attempts to extract and run the files, I finally got it working on my Desktop.

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week we looked in more detail at collocation.  Collocation is defined by the course as ‘the systematic co-occurrence of words in use, words which tend to occur frequently with one another and, as a consequence, start to determine or influence one anothers’ meanings’.  When looking for collocates,  a span of plus or minus 5 words is useful, and to be considered a collocate the words have to occur within this span several times, with a baseline of 10 hits being suggested. Of course, this related to enormous corpora, and again highlights for me the particular difficulties of using a smaller, opportunistic corpus – it can overstate the importance of associations.

But frequency alone isn’t necessarily a good guide to the importance of a collocation.  Although frequency lists can produce important data, they are usually topped by functional words such as ‘the’ and ‘of’, so they often need manipulation to produce useful results.  One way of carrying out this manipulation is to look at mutual information.  This ranks the closeness of the association between two words, and thereby often removes the functional words from the results because although they appear in close proximity to one word, they also appear in close proximity to many others.

Next we looked at colligation, where words are associated with a particular grammatical class rather than particular meanings.   This was compared to semantic preference, where a word form is closely associated with a set of words which have some kind of semantic similarity.

Then we had to think about keywords: these are created by comparing two frequency lists to one another.  Mathematical tools such as chi-square and log-likelihood tests  are able to identify keywords which are important based on their relative frequency in two corpora.  They identify the words which are unusually frequent in one one of the corpora, suggesting that they are the words which characterise it.  But you still need to set thresholds for when something becomes a keyword, perhaps only taking the top hits, ensuring that they occur a certain number of times or are spread through a certain number of texts in the corpus.  The course argued that one of the most useful aspects of this sort of study was that you get a feel for the ‘aboutness’ of a text – the key nouns and verbs which tell us what is being talked about.  It’s also important that these tests are replicable.

Part 4 explained how CL could be used to plot the changing importance of words over time by looking which words have decreasing or increasing frequency.  (The words whose frequency remains relatively constant over time are known as lock words.)  Increasing or decreasing keywords can be checked against concordances to see whether there are societal changes which might explain them. We can start to identify such societal changes by looking at the words used around the keywords.  These might then be seen not just as a statistical keyword but a socially salient keyword which identifies a dominant discourse in society.  This interested me a lot, because the idea of socially salient keywords would be a relatively easy but nevertheless really interesting subject to investigate with the early modern ballad corpus…!  There are some words which experience suggests appear a lot in ballads at certain points in time, but it would be interesting to see if these scientific techniques might be able to identify trends that the human eye cannot. 

Part 5 looked at why CL needed to be integrated with other methods – they can help bring quantitative and qualitative methods together, but they are only one tool. 

After the quiz, we went over some of these ideas again in more practical ways, with the instructions on how to look at connectivity in the #LancsBox software. Connectivity starts with a node word chosen by the user, where we can look at the first order collocates, and then look at their collocates and how they cross-connect.  This eventually gives us a collocation network – a graph which helps to show the associations and cross-associations between words. This can be drawn up with the GraphColl function, which I then played about with for a bit.  At the moment I’m quite slow, but it’s certainly interesting and, like most things, practice will, I hope, make perfect.