I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week, the course moved on to look at sociolinguistics. The first task was to think about what you usually notice about someone speaking a language, and whether it’s easy to recognise a particular language.  Probably the first thing I notice is whether I recognise words and phrases, followed quickly by accent, which would tell me whether I’m listening to an English-speaker or not.  Further down the line, I would notice the language they used – whether, for example, it was advanced or simple; the structure of their sentences, their tone of voice, any pauses or the speed at which they spoke, and the musicality of their voice.  It’s only easy to recognise languages with which I am familiar – probably French, German, Italian and Spanish.  Beyond that, I might be able to place a language in the right continent, but I wouldn’t be able to narrow it down.

The lectures began to move away from the pure definitions of language to the variations in the way language is used.

There are 2 main approaches to sociolinguistic variation:

  1. Labovian traditional variations approach which tries to identify sociolinguistic variables – the competition between two different ways of saying the same thing which can happen in similar contexts – and establishes how those decisions are affected by their contexts.
  2. Biber’s approach to language variation looks at the way speakers and writers choose from their inventory of lexical and grammatical features.  These do not need to have similar meanings which could be in direct competition.

Vaclav Brezina then talked a bit about ‘hedges’ – phrases such as ‘you know’, ‘like’ and ‘well’ which mark out the informality of spoken English but are rarely used in the written language.

The next video asked us to look at a short transcript of a dialogue and work out the gender of the first speaker.  It was fascinating, because from the explicit information included in the text there was no way of knowing, but my gut feeling went in one particular direction. Apparently, this is based on patterns of pronouns used by male and female speakers!  So it seems to me that in fact, it’s less about gut feeling than it is about experience.

Another example came in the use of a possessive apostrophe, or not. It is an example of Labovian variation, but it made me chuckle, because it’s one of the things my supervisor used to point out that I did a lot: “Le plume de ma tante…. we can cope with a possessive apostrophe, you know….”  As a result, it’s something I pick up on a lot in student essays (even though I wouldn’t say it had been completely eradicated from my own writing!).  Anyway, the question was which constructs (either internal regarding the situation of the language or external regarding the speaker) favour ‘of’ over ‘apostrophe s’. Researchers used the British National Corpus 1994 and the British National Corpus 2014, generating a large dataset.  The internal constructs mean that ‘of’ is favoured when both the thing being possessed and the possessor are inanimate, while ‘apostrophe s’ is favoured when both are animate.

Corpus linguistics brings large, representative corpora and ways to process them efficiently, while sociolinguistics attends to the contexts and how they create variations.  Between them they can tell us much more than either one alone.

The practical sessions moved on to using two new tools.  The first was BNClab, an online tool which allows you to search the spoken sections of BNC 1994 and BNC 2014.  Although this is undoubtedly really interesting, I’m not studying recent language use, so I haven’t really spent much time on this.

The second tool was another corpus processing tool, CQPweb.  This looks much more my sort of thing, as it has some Historical English corpora, such as the Paston letters and EEBO, installed. However, as the EEBO corpus is behind an institutional login, I haven’t quite worked out how to get into it yet!  Other than that, it seems rather easier to use than #LancasBox, if only in the way that you can do a ‘new query’ – something I found really quite difficult in #LancsBox as I couldn’t work out how to clear the results of a completed search.  The second video tutorial introduced us to different types of Standard Query, including ways to search just for the number of hits; or for the exact combination of upper- and lower-case letters; or to restrict the search to particular parts of the corpus.  There are short cuts which allow you to search within the same restrictions repeatedly without to keep resetting them.

One of the activities was to have a look at the use of colour words in the spoken British National Corpus 2014 to see how males and females talk about colours.  I searched for ‘green’ and got the following results:

  • Green> Male: 554 matches in 203 different texts (in 4,348,982 words [471,572 Utterance units]; frequency: 127.39 instances per million words).
  • Green> Female: 809 matches in 322 different texts (in 7,072,249 words [725,226 Utterance units]; frequency: 114.39 instances per million words).

Judging just by the front page of randomised hits, females tended to be using the word green in more colour-related conversation, whereas sometimes with males it was part of a bigger term that had no alternative – ‘green card’, or the rather more perplexing “a never green tree” – which I did wonder if it might have been a mistranscription of “an evergreen tree”. I don’t have time to look at the context to check what was going on there!

I then went on to have a quick look at less mainstream words for colours – the more poetic ones, if you like, that relate to the shades or different aspects of colour. No male speakers used the term ‘turquoise’, but for females there were 12 matches in 10 different texts (in 11,422,617 words [1,251 texts]; frequency: 1.05 instances per million words). And neither males nor females used ‘Azure’ at all.

I am really enjoying this course, and I’d love to be able to spend more time on it, but I’m beginning to struggle to keep up – it’s week 2 of term at Lancaster and again, I’m teaching new-to-me courses. This means that I’m constantly playing catch up from week to week, reading up on what I’m teaching, and the corpus linguistics course has to take a back seat.