This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims.