I’m still working on a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen, given my new teaching commitments and other roles to juggle. In the meantime, I’ll share with you my thoughts as each week progresses.

This week started by building a corpus of our own texts.  I estimated that my academic writings (published and as yet unpublished) probably consisted of about 160,000 words.  It turns out to be just over 176,000, so I didn’t think my guess was way out!  I decided to use only my written papers, even though my articles are often generated from papers presented at conferences, because I wanted to make sure that they were all using a similar register and aimed at a similar audience.

 A corpus is a sample of the language you would like to represent, ideally covering as much of the language used outside the corpus as possible – it needs to be balanced and unbiased.  We used three corpora created by Lancaster University, looking at how they were designed.

The first step is to decide what the corpus is designed to represent.

The British National Corpus 2014 updates the British National Corpus from 1994.  It was designed to include places where British English was spoken and places where British English appears.  It was intended to be a ‘national’ corpus which was intended to include all genres and registers. The word ‘Corpus’ shows that it was systematically sampled using robust, rigorous procedures.  2014 was the mid-point of a sampling process which took place between 2010 and 2017.  The final corpus contains 90% of written English and 10% of spoken English, using stratified random sampling.

The next step is to look at the development of a corpus.

The Trinity Lancaster corpus was a joint project to collect the spoken English of learner English speakers (L2 speakers – ie as a second language).  Basic data about the language speakers was recorded, and the recordings were transcribed and annotated in XML with information about the speaker and the part of speech tags too.

Finally, we have to think about annotation.

The Guangwai Lancaster corpus of L2 Chinese is fully error tagged using a standardised format, in order to ascertain the specific features of L2 Chinese and the mistakes which are commonly made.

When building your own corpus, the course leader Vaclav Brezina had 5 tips:

  1. Start with corpus design – think carefully about the type of language the corpus represents, and how you might achieve a representative sample.
  2. Keep notes – keep a record of what you included, excluded and why.
  3. Save texts as separate files so that you can look at the distribution of linguistic features in different types of texts or parts of the corpus.
  4. Check the accuracy of the data to make sure there are no mistakes, extraneous matter such as HTML code etc
  5. Select a suitable tool to analyse the corpus.

The practical session helped us collect our own mini corpus of newspaper texts by selecting a topic and an online newspaper and searching it using a search engine.  Then the texts are copied and pasted individually into a text editor and saved as a text file with the extension .txt with UTF-8 encoding.