October 2019


I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

It’s getting difficult to keep up with the CL course now, as we’re well into term time and I’ve got a lot going on. This week we started to look at the use of CL in language learning, where corpora are used to identify the aspects of language which will be most useful – the frequently occurring words are the ones people are most likely to come across.  In the early days (early 20th century), much of this was based on written not spoken texts and were often based on ‘canonical texts’ – such as the Bible or 19th century novels – so they were hardly up to date themselves.  This reflected the fact that they were created with the aim of teaching people to read not to speak in a language. In the 1950s there was a move away from teaching vocabulary towards teaching grammar, rules which would help people apply knowledge in all sorts of situations.

In 1963, George used a corpus of about 100,000 verbs and sorted them according to about 168 categories depending on their form and use.  He discovered that 10 verb forms account for about 61% of all verb use.  They were not the ones that were being taught!  From the 1980s, researchers have tended to study language use in particular situations, such as academic writing, arguing that by understanding how academic writing works we can better teach it to others.  The other area being studied is whether language text books teach things in the right order, introducing students to frequently-used language earlier.  By introducing learners to the key, frequently used vocabulary, we teach them a large amount of language use with only a small amount of work.  This does raise several questions:

  • Whether everyone actually wants to learn language for use in that context – I, for example, have never really been interested in Spanish as a conversational language, because I need to be able to read it.
  • Different ways of learning suit different people.
  • Is it colonialist to force non-native speakers to copy native speakers, devaluing their language use?
  • Some frequently used words and concepts might be quite difficult to learn.

I have downloaded this week’s CQPweb videos – they are rather long and as I have already learned how to do the basics, I figured I could store the more advanced videos away in case I ever needed them!

 What I did do this week, which I haven’t done before, was to watch a some of the ‘in conversation’ videos, with Sally Bushell, Ian Gregory and Steve Pumfrey.

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week, the course moved on to look at sociolinguistics. The first task was to think about what you usually notice about someone speaking a language, and whether it’s easy to recognise a particular language.  Probably the first thing I notice is whether I recognise words and phrases, followed quickly by accent, which would tell me whether I’m listening to an English-speaker or not.  Further down the line, I would notice the language they used – whether, for example, it was advanced or simple; the structure of their sentences, their tone of voice, any pauses or the speed at which they spoke, and the musicality of their voice.  It’s only easy to recognise languages with which I am familiar – probably French, German, Italian and Spanish.  Beyond that, I might be able to place a language in the right continent, but I wouldn’t be able to narrow it down.

The lectures began to move away from the pure definitions of language to the variations in the way language is used.

There are 2 main approaches to sociolinguistic variation:

  1. Labovian traditional variations approach which tries to identify sociolinguistic variables – the competition between two different ways of saying the same thing which can happen in similar contexts – and establishes how those decisions are affected by their contexts.
  2. Biber’s approach to language variation looks at the way speakers and writers choose from their inventory of lexical and grammatical features.  These do not need to have similar meanings which could be in direct competition.

Vaclav Brezina then talked a bit about ‘hedges’ – phrases such as ‘you know’, ‘like’ and ‘well’ which mark out the informality of spoken English but are rarely used in the written language.

The next video asked us to look at a short transcript of a dialogue and work out the gender of the first speaker.  It was fascinating, because from the explicit information included in the text there was no way of knowing, but my gut feeling went in one particular direction. Apparently, this is based on patterns of pronouns used by male and female speakers!  So it seems to me that in fact, it’s less about gut feeling than it is about experience.

Another example came in the use of a possessive apostrophe, or not. It is an example of Labovian variation, but it made me chuckle, because it’s one of the things my supervisor used to point out that I did a lot: “Le plume de ma tante…. we can cope with a possessive apostrophe, you know….”  As a result, it’s something I pick up on a lot in student essays (even though I wouldn’t say it had been completely eradicated from my own writing!).  Anyway, the question was which constructs (either internal regarding the situation of the language or external regarding the speaker) favour ‘of’ over ‘apostrophe s’. Researchers used the British National Corpus 1994 and the British National Corpus 2014, generating a large dataset.  The internal constructs mean that ‘of’ is favoured when both the thing being possessed and the possessor are inanimate, while ‘apostrophe s’ is favoured when both are animate.

Corpus linguistics brings large, representative corpora and ways to process them efficiently, while sociolinguistics attends to the contexts and how they create variations.  Between them they can tell us much more than either one alone.

The practical sessions moved on to using two new tools.  The first was BNClab, an online tool which allows you to search the spoken sections of BNC 1994 and BNC 2014.  Although this is undoubtedly really interesting, I’m not studying recent language use, so I haven’t really spent much time on this.

The second tool was another corpus processing tool, CQPweb.  This looks much more my sort of thing, as it has some Historical English corpora, such as the Paston letters and EEBO, installed. However, as the EEBO corpus is behind an institutional login, I haven’t quite worked out how to get into it yet!  Other than that, it seems rather easier to use than #LancasBox, if only in the way that you can do a ‘new query’ – something I found really quite difficult in #LancsBox as I couldn’t work out how to clear the results of a completed search.  The second video tutorial introduced us to different types of Standard Query, including ways to search just for the number of hits; or for the exact combination of upper- and lower-case letters; or to restrict the search to particular parts of the corpus.  There are short cuts which allow you to search within the same restrictions repeatedly without to keep resetting them.

One of the activities was to have a look at the use of colour words in the spoken British National Corpus 2014 to see how males and females talk about colours.  I searched for ‘green’ and got the following results:

  • Green> Male: 554 matches in 203 different texts (in 4,348,982 words [471,572 Utterance units]; frequency: 127.39 instances per million words).
  • Green> Female: 809 matches in 322 different texts (in 7,072,249 words [725,226 Utterance units]; frequency: 114.39 instances per million words).

Judging just by the front page of randomised hits, females tended to be using the word green in more colour-related conversation, whereas sometimes with males it was part of a bigger term that had no alternative – ‘green card’, or the rather more perplexing “a never green tree” – which I did wonder if it might have been a mistranscription of “an evergreen tree”. I don’t have time to look at the context to check what was going on there!

I then went on to have a quick look at less mainstream words for colours – the more poetic ones, if you like, that relate to the shades or different aspects of colour. No male speakers used the term ‘turquoise’, but for females there were 12 matches in 10 different texts (in 11,422,617 words [1,251 texts]; frequency: 1.05 instances per million words). And neither males nor females used ‘Azure’ at all.

I am really enjoying this course, and I’d love to be able to spend more time on it, but I’m beginning to struggle to keep up – it’s week 2 of term at Lancaster and again, I’m teaching new-to-me courses. This means that I’m constantly playing catch up from week to week, reading up on what I’m teaching, and the corpus linguistics course has to take a back seat.

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week started by building a corpus of our own texts.  I estimated that my academic writings (published and as yet unpublished) probably consisted of about 160,000 words.  It turns out to be just over 176,000, so I didn’t think my guess was way out!  I decided to use only my written papers, even though my articles are often generated from papers presented at conferences, because I wanted to make sure that they were all using a similar register and aimed at a similar audience.

 A corpus is a sample of the language you would like to represent, ideally covering as much of the language used outside the corpus as possible – it needs to be balanced and unbiased.  We used three corpora created by Lancaster University, looking at how they were designed.

The first step is to decide what the corpus is designed to represent.

The British National Corpus 2014 updates the British National Corpus from 1994.  It was designed to include places where British English was spoken and places where British English appears.  It was intended to be a ‘national’ corpus which was intended to include all genres and registers. The word ‘Corpus’ shows that it was systematically sampled using robust, rigorous procedures.  2014 was the mid-point of a sampling process which took place between 2010 and 2017.  The final corpus contains 90% of written English and 10% of spoken English, using stratified random sampling.

The next step is to look at the development of a corpus.

The Trinity Lancaster corpus was a joint project to collect the spoken English of learner English speakers (L2 speakers – ie as a second language).  Basic data about the language speakers was recorded, and the recordings were transcribed and annotated in XML with information about the speaker and the part of speech tags too.

Finally, we have to think about annotation.

The Guangwai Lancaster corpus of L2 Chinese is fully error tagged using a standardised format, in order to ascertain the specific features of L2 Chinese and the mistakes which are commonly made.

When building your own corpus, the course leader Vaclav Brezina had 5 tips:

  1. Start with corpus design – think carefully about the type of language the corpus represents, and how you might achieve a representative sample.
  2. Keep notes – keep a record of what you included, excluded and why.
  3. Save texts as separate files so that you can look at the distribution of linguistic features in different types of texts or parts of the corpus.
  4. Check the accuracy of the data to make sure there are no mistakes, extraneous matter such as HTML code etc
  5. Select a suitable tool to analyse the corpus.

The practical session helped us collect our own mini corpus of newspaper texts by selecting a topic and an online newspaper and searching it using a search engine.  Then the texts are copied and pasted individually into a text editor and saved as a text file with the extension .txt with UTF-8 encoding.

I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

The first activity this week was to look at the language used around a term such as ‘refugee’ in a newspaper article.  I looked at an article from a recent edition of the Guardian on President Trump’s changes to the US refugee programme. It referred to them as refugees, displaced people, and talked about them in comparison to ‘people seeking asylum at the southern border’ before pointing out that the asylum program and the refugee program were separate, which implies that the two categories are not the same (I know they aren’t legally – my point is that it doesn’t make this explicit in the text and you are left to work it out). It gives examples of groups with whom we might be expected to be sympathetic (those fleeing religious persecution, for example) and a case study and quotes from a resettled refugee who had contributed to American society. The language around here is positive: ‘love’, ‘safely’ and ‘allowed’. There is an interesting contrast though, towards the end of the article, where the positive language of contribution is replaced by language such as ‘loss’, ‘sad’, ‘shut down’, ‘difficult’ and ‘complicated’, when describing what life is going to be like in the future.

The week then used an ESRC funding research project on how British newspapers talk about refugees and asylum seekers.  The focus was methodological, looking at how Corpus Lingustics (CL) might contribute to critical discourse analysis.

Tony McEnery, the course tutor, described how the project team put together the corpus on which the study was based.  First they put together a pilot corpus of texts about refugees and asylum seekers and looked at what words became key in that corpus.  This helped them to compose a query string which could be used to search huge corpora of newspaper articles for relevant material.  Then they split the data into two corpora – one of tabloid and one of broadsheet journalism. They looked carefully at the number of articles and words in each of the two corpora, explaining the difference by pointing out that tabloid articles are usually shorter than broadsheet ones. Moving on to think about how CL contributes to critical discourse analysis, he introduced the idea of the topos (plural = topoi) – a broad theme in data, although according to the course website, ‘In critical discourse analysis it usually has a slightly different sense of ‘warrant for an argument’ rather than theme’.

The two corpora were tested to see which keywords were associated with the query terms.  As well as looking at the overall picture of what keywords were most commonly associated with the query terms in the broadsheets and the tabloids, this could be done on a more focussed basis by looking at, for example, the words used to describe the arrival of refugees.  So the keywords help to shape the topoi, but also, the discourse was created mainly by the tabloids – almost all the keywords were dominated by the tabloids, except in the category of ‘plight’, where the language used was shared by both but the broadsheet newspapers had more. 

The next video looked at whether the words which turned up most frequently in the articles were collocates.  There were some collacates relating to the number of refugees and the theme of plight, and these were across both tabloid and broadsheet newspapers.  But once the team looked at words clustered to the right of ‘illegal’ which might indicate modifying adjectives.  And the theme of illegality was more emblematic of the tabloids, with some especially strong tabloid clustering with ‘immigration and’ – the conjunction and was forcing discourses together.  In comparing the two corpora, the use of particular words and clusters had to be normalised per million words because the broadsheet corpus was much larger.

Step 4 looked at a particular cluster, ‘pose as’ – both how it was used on its own and how it was used in proximity to refugees or asylum seekers.  The tabloids used the phrase far more often than the broadsheets (normalised per million words) and especially so in relation to refugees and asylum seekers.  The course also needed that in the tabloids, the phrase was reported as fact not opinion and was closely associated by a negative stance, with no space given to an opposing side.  Another interesting cluster was thrown up by ‘X pose as’ plus a statement of status such as ‘asylum seeker’, which was used to show faults in the asylum system.

The final video looked into direct or indirect presentation. Direct presentation is when something is directly attributed to something else through modification. For example, in the phrase ‘illegal immigrants suffocated’, the modification of the immigrants with the adjective ‘illegal’ attributes them with that quality directly.  In indirect references, there are general or indirect references which imply the same – words such as ‘trafficked’ or phrases such as ‘sneaking in’.

After the quiz, we moved on to the hands on section of the course in #LancsBox. At least I would have done, had I been able to make it work. Because here again I was hit with the same problem from week 1- #LancsBox has to be run from a folder which will allow you to make changes… and that’s not as simple as it sounds. I’ve changed the machine on which I’m working, so I had to download the software again. After over an hour of searching around the internet, the course forums, my own notes, my computer folders, as well as several failed attempts to extract and run the files, I finally got it working on my Desktop.

Wellcome is right to call out hyper-competitiveness in research and question the focus on excellence. But other funders must follow its move.
— Read on www.nature.com/articles/d41586-019-02951-4