This week I started a FutureLearn/Lancaster University course on Corpus Linguistics (CL). It runs for 8 weeks and is much more work than any of the previous FutureLearn courses that I have undertaken, so whether I’ll get to the end of it remains to be seen. But the course leader suggests that once we get past the first few weeks we can pick and choose to study the elements which are useful to us, so I’m hoping it will be manageable alongside the new teaching activities that I’ve got ahead. In the meantime, I’ll share with you my thoughts as each week progresses.

I signed up to the course way back at the beginning of summer, so that I would get a grounding in CL ready to undertake my project on Fake News and Facts in English ballads. It has become immediately apparent, however, that my little project is more Corpus Discourse Analysis than true CL – more on this another time!

The simplest definition of a corpus is a lot of words stored on a computer.  But it is also a methodology for approaching language.  Large amounts of data in a corpus can tell us about tendencies and what is normal or typical, or rare or exceptional cases – you can’t tell either of these things from looking at single texts.  Computers are quicker and more accurate than humans in dealing with such large amounts of data.

When putting a corpus together or choosing what to use, what you choose to look at depends on your research questions; it must be broadly representative of an appropriate type of language; and it must be in machine readable form such as a text file.  It might be considered a standard reference of what’s typical and is often annotated with further linguistic information.

Next, we had a brief look at annotation (or tagging).  The computer can’t tell what is a heading, where new paragraphs start.  You might want to search for something just within titles, and a computer can’t tell the difference unless you tell it and therefore can’t do what you want it to.   A lot of this info is there in the control characters that make a document to appear the you want it to.  It can’t tell grammatical data such as where a word might begin or end.  It can’t tell what part of speech each word is.  These sorts of annotation allows you to tailor your request, to look what words follow a particular word, or just search the headings. They improve the quality of your searches. Actually, this annotation is often done by computer.

There are two main types of corpora:

  • Specialist – with a limited focus of time, place, language etc which creates a smaller amount of data
  • General – often with a much larger number of words and more representative.

Other corpora include:

  • Multilingual – comparing different languages or different varieties of the same language
  • Parallel – comparing the same texts in translation
  • Learner – the language used by language learners
  • Historical or Diachronic – language used in the past
  • Monitor – which are continually being added to.

From here we moved on to look at some technical terms, including frequency data, which quickly shows how often words appear per million words in the corpus. Concordances show the context of the hits, and again, it can be done quickly, so you can sort the context according to emerging patterns.  Collocation is a requirement for words to co-occur and for meaning to be built from those co-occurrences.

Then we thought a bit about the limitations of corpus data analysis.  It can’t tell us whether something is possible in language (ie, it can’t tell us about appropriate usage). We can only deduce things from them, they aren’t facts in and of themselves, and although it gives us evidence it can’t give us explanations.  Finally, the corpora rarely present the related images alongside the text, or the body language, behaviours etc of speakers, so they present language out of context.

Then the course moved on to having a go at using #LancsBox, which we had to download and open – bizarrely, this was by far the hardest bit of the course so far, because it has to be run from the Applications folder (my Applications folder is dusty and otherwise unused, hidden somewhere in the attic of my machine and only located using the search function on the file manager).  #LancsBox comes with various ready-prepared corpora that you can download, so I decided to have a go with the 1641-1660 English newsbooks.  As I wasn’t working on my desktop, but my laptop, it wasn’t the quickest import, but once it was done we had a go at using the KWIC function to create a concordance.  You search for a word or ‘node’. Having found all the instances of the word ‘Rupert’, in my case, I could sort them by the context on the left or the right, or I could filter them by clicking on one side or the other and typing in a word (although if you want filter by the word immediately preceding or following the node, you need to use  the advanced filter function).  But I had really jumped the gun, as the next activity was a walk-through in using one of the corpora to search for some given words!  Still it was fun to play around with it.

There were several readings for the week – the first was interesting and talked, for example, about the distinction between types of corpora (monitor, balanced, snapshot). Most of the corpora in which I would be interested would, I think, be classified as opportunistic corpora, because something else comes into play when working with historical corpora – survival. So in some respects, historical corpora are self-selecting, because we can only create a corpus out of what survives. #LancsBox comes with the 1640-60 newspapers, but (I think – not quite my field so I’m not sure) they are a relatively ‘good’ source because Thomason made the decision to collect everything that he could of the printed ephemeral material relating to the civil war. Without collectors like him, other material is much more dependent on survival rates. Which isn’t to say that I don’t think CL is useful, just that there are extra (or maybe different) caveats about what it can tell us, so we need to be very aware of these when we interpret our data.

As I’ve never done any linguistics, the second chapter was much harder going and I didn’t really understand a lot of it!  Then things got even more complicated for me with a chapter on statistics for CL.

The final task before the optional activities was to think about the design aspects of our proposed work.  Although my Fake News and Facts project uses concordances and collation, I wanted to think about something on a rather bigger scale, so I imagined a larger corpus of ballad texts and news pamphlets to search for any features that were particularly common or common to both in the period prior to the civil war.  The question is whether the data would be skewed by the inclusion of all ballads rather than exclusively topical ones.   The restriction to topical ballads (and news pamphlets) would in itself be subjective, whereas the inclusion of all ballads might show up other common features that we would not otherwise spot, so as far as I’m concerned the jury is still out on that question! 

The corpus would be quite large, but as it is an opportunistic corpus based on only those texts which have survived, it would not be anything like as large as some that have been mentioned on the course. Annotation might be helpful, in terms of divisions between headings and body text, as this might highlight particularly sensational language which is used to grab attention, or news lexicon which highlights topicality, novelty and truth claims. 

Before the summer vacation I completed the FutureLearn Blended Learning Essentials – Getting Started course, so when it finished I thought it probably made sense to sign up for the follow-up course on Embedding Practice.  As someone who teaches on a blended learning degree course delivered by Liverpool Hope University through Holy Cross College in Bury, these courses are providing useful Continuing Professional Development.  But in the spirit of using technology to assist learning, I have decided again to keep my reflective log here on the blog.

The course leaders define blended learning as

an appropriate mix of face-to-face and online learning activities, using traditional instruction, guided support and independent learning, underpinned by the use of digital technologies and designed using strong pedagogical principles, to support learner engagement, flexibility and success.

One of the first activities asked what learners think about blended learning.  Well, as you might expect, some answers were more nuanced than others:

  • It generates more enthusiasm because it’s  more fun
  • It generates competition between students which makes them want to do well
  • It makes it easy to collaborate, because students can work on one thing together but on their own computers
  •  Some identified it as significantly better than pen, paper and text book because you can do things that they can’t do – demonstrations, for example.
  • It offers the ability to check what you need to do whenever you like (at undergraduate level it would have been normal for years to provide this kind of information in a course handbook although I accept that at school it was less so).
  • You can use interactive quizzes to test yourself.  this helps you to find flaws and gives you an immediate response)
  • It allows you to complete things at your own speed. (I’m not sure I agree with that – you still have to keep alongside the rest of the class in order to progress with class activities).
  • Foremost, it is easily accessible – it makes education available when you want, where you want.  There are disadvantages to this though: if you’re trying to do all your studying on the bus, it might not be the optimal learning environment.

The video made the point that blended learning works well at the college because it is “carefully managed and well supported.  They’ve embedded it into their practice by engaging staff and listening to learners”.

The next activity asked me to suggest ways in which teachers could collect evidence of learning during a session teaching aromatherapy (yes it’s a bit far removed from academic history, but I knew that when I signed up for it, and I’m sure I’ll have more to say on the implications of moving the techniques from vocational FE to academic HE later).  First off, they can make sure that people appear to be paying attention to the teacher speaking.  While doing the planning and practical activities they could walk round the classroom and watch what the students were doing.  This helps the teacher to see if they have understood the passive, listening activity.  They could also have collected in the recipes of three suggested mixtures.  The course leaders also pointed out that the computer aromatherapy programme gives the students instant visual feedback on their choices, while promoting discussion and planning on the part of the students.  Here again, the teacher can listen to what the students say to identify evidence of how well the students learn the concepts they need.  In summary

The pie chart feedback encourages experimenting. The digital tool records their sequence of experiments. The collaborative task promotes discussion, which the tutor can reflect on. And the students record their own results, which provides evidence of what they have learned.

 The sixth lesson was to look at an existing learning plan and see how technology could be used to enhance it.  It showed that by adding blended learning activities to the vocational lesson, the students were engaged in more active learning. Now, I still have a bit of a problem with how to adapt some of these activities for the sorts of higher level, critical-thinking skills that I am trying to encourage in my students.  One bit of advice did have resonances, however, because it related to the ways in which students upload their own work to a website for comparison.  It didn’t sound dissimilar to the chat-room activity that I conduct fortnightly with my students.  It suggested that they could use their own research

to produce a digital report, and then upload that to the class website. And then, the next class meeting, this means that instead of just discussing what they found, both learners and the tutor can make use of the website to compare their shared experiences, and then make an edited record of the best bits of their reports. It’s a great way to summarise what they found that’s relevant to the intended learning outcome.

I wonder if it’s possible to do this with the student analyses of the secondary sources which they study each week?  If I could persuade the students to upload their responses to a forum ahead of the session, we could then use that as a starting point for an online discussion which identified the most important factors, or where students had got a particular idea.  I wonder if this would make better use of our online time than just reading through 12 sets of answers to the same questions without having time to engage in any debate…  There is a problem though, because I usually find that my students aren’t actually able to type, which means that they are unable to answer anything other than seen questions.  They can’t type fast enough to answer off the cuff questions.

From a management point of view, blended learning technology offers the ability to track the amount of work done by individual students and the progression of students under the supervision of different teachers.  The course pointed out that it is possible to track progress towards learning outcomes (either as a teacher or manager) through

  • Reading comments in a forum
  • Tracking learners’ VLE activity
  • Results from automated tests
  • Using voting tools in response to your questions in class

The data generated by educational technologies such as Moodle are known as ‘learning analytics’ or ‘activity data’.  There are, however, some significant ethical issues surrounding learning analytics, particularly with regard to clarity, privacy and validity.

 I also had a go with an electronic lesson planner, or as it prefers to call itself, a learning designer. It was useful to be able to see instantly how much time  you had used up, and the pie chart made it easy to see the proportions of different types of learning activity.