I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week we looked in more detail at collocation.  Collocation is defined by the course as ‘the systematic co-occurrence of words in use, words which tend to occur frequently with one another and, as a consequence, start to determine or influence one anothers’ meanings’.  When looking for collocates,  a span of plus or minus 5 words is useful, and to be considered a collocate the words have to occur within this span several times, with a baseline of 10 hits being suggested. Of course, this related to enormous corpora, and again highlights for me the particular difficulties of using a smaller, opportunistic corpus – it can overstate the importance of associations.

But frequency alone isn’t necessarily a good guide to the importance of a collocation.  Although frequency lists can produce important data, they are usually topped by functional words such as ‘the’ and ‘of’, so they often need manipulation to produce useful results.  One way of carrying out this manipulation is to look at mutual information.  This ranks the closeness of the association between two words, and thereby often removes the functional words from the results because although they appear in close proximity to one word, they also appear in close proximity to many others.

Next we looked at colligation, where words are associated with a particular grammatical class rather than particular meanings.   This was compared to semantic preference, where a word form is closely associated with a set of words which have some kind of semantic similarity.

Then we had to think about keywords: these are created by comparing two frequency lists to one another.  Mathematical tools such as chi-square and log-likelihood tests  are able to identify keywords which are important based on their relative frequency in two corpora.  They identify the words which are unusually frequent in one one of the corpora, suggesting that they are the words which characterise it.  But you still need to set thresholds for when something becomes a keyword, perhaps only taking the top hits, ensuring that they occur a certain number of times or are spread through a certain number of texts in the corpus.  The course argued that one of the most useful aspects of this sort of study was that you get a feel for the ‘aboutness’ of a text – the key nouns and verbs which tell us what is being talked about.  It’s also important that these tests are replicable.

Part 4 explained how CL could be used to plot the changing importance of words over time by looking which words have decreasing or increasing frequency.  (The words whose frequency remains relatively constant over time are known as lock words.)  Increasing or decreasing keywords can be checked against concordances to see whether there are societal changes which might explain them. We can start to identify such societal changes by looking at the words used around the keywords.  These might then be seen not just as a statistical keyword but a socially salient keyword which identifies a dominant discourse in society.  This interested me a lot, because the idea of socially salient keywords would be a relatively easy but nevertheless really interesting subject to investigate with the early modern ballad corpus…!  There are some words which experience suggests appear a lot in ballads at certain points in time, but it would be interesting to see if these scientific techniques might be able to identify trends that the human eye cannot. 

Part 5 looked at why CL needed to be integrated with other methods – they can help bring quantitative and qualitative methods together, but they are only one tool. 

After the quiz, we went over some of these ideas again in more practical ways, with the instructions on how to look at connectivity in the #LancsBox software. Connectivity starts with a node word chosen by the user, where we can look at the first order collocates, and then look at their collocates and how they cross-connect.  This eventually gives us a collocation network – a graph which helps to show the associations and cross-associations between words. This can be drawn up with the GraphColl function, which I then played about with for a bit.  At the moment I’m quite slow, but it’s certainly interesting and, like most things, practice will, I hope, make perfect.