After Easter, I attended three sessions of training on Transkribus, the Handwriting Recognition Software that works a bit like Optical Character Recognition to create automatic transcriptions of manuscripts. This is the second in a series of three posts about the training.

The second session was a bit different. Before going into further information about how the software works, we had a presentation from Paty Murietta-Flores and Rodrigo Vega-Sanchez. The project Digging into Early Colonial Mexico developed computational techniques for dealing with geographical information sent back to Spain from its colonies in the period 1577-1585. The Geographic Reports of New Spain were made up of 12 volumes of questionnaire responses with several thousand pages – over 4 million words.  The textual data was supplemented by maps which combined spatial information from the Spanish and native American traditions.  It is not a stable geography so they created the first gazetteer of 16th century Mexico. It is now open source with more than 14.8k historical placenames.

They also identified the concepts in the text and marked them up with tags through Natural Language Processing so that people could find them more easily.  Identified concepts useful to them for the project and the wider historical community. They marked up an annotated dataset and trained a model to mark up further pages.  Then you can do really interesting processes such as Geographical Text Analysis. It identifies whether a concept is related to a particular place.

Unlocking the Colonial Archive took this a stage back, to think about how we could unlock the massive amounts of information in colonial archives that have enormous potential but it is time consuming to access.  The idea was to leverage AI to access archival information. Prioritising documents from the sixteenth and seventeenth centuries that is so difficult to read.  Aimed to improve handwritten text recognition, natural language processing to annotate, identify and extract data, and work on computer vision that might allow computers to identify elements in pictorial documents including place names, objects etc.

So by combining them they have created ways of connecting data in different archives and countries.  Now allowing them to work with communities to put forward some of their knowledge about these documents and concepts.

Fleets of New Spain project investigates the social networks, commerce routes, identity marks, slave trade and scientific knowledge exchange between the Americas and Spain.

Rodrigo Vega then talked about Handwritten Text Recognition.  Normally we have to analyse a small group of documents simply because of time. They aimed to take a big data approach and needed to train the models to read four different types of handwriting which were most frequent in their corpus. They used the Transkribus software on their digitised images, starting with layout analysis and cleaning.

First you need access to the images, then you need to deal with potentially complex or mixed layouts. Another challenge faced by the team is the mixture of languages in the documents, and the given that there are fewer documents in indigenous languages it is difficult to achieve the number of pages and characters you need for good automatic transcription.  A person might be able to understand it with a 15% error rate, but for any corpus data project, that error rate is too high and it needs to be below 10% for any natural language processing technique to work. They found that 15000 words is not nearly enough to create a ground truth that works – it needs to be significantly higher, and they had over 100,000 to get it down to 8.8%. They also found the model works much better with high resolution images.

We then went back to the training itself, where we were taught how to ensure that the software recognised the layout of a page correctly – this is essential because inaccurate layout causes inaccurate transcription. So cleaning the layout needs to be the first step. Every image is divided into text regions that allow the software to identify where the lines are. These are made up of:

  • Baseline is a polyline that runs along the bottom of the text. You can move it at lots of different places. It is important that it runs all the way along the bottom o f the line, not including the ascenders or descenders.
  • The Text Regions are rectangular shapes enclosing sections of text, eg columns, running header, footnotes. In the default layout analysis, it looks for the baselines first then applies them to the text regions.
  • Line Polygons encase all of the written text in a line, and they take in all the ascenders and descenders, as well as indicators of abbreviation. You can modify the polygon to ensure that it includes everything in the correct line. You then have to tick the box to ‘keep original line polygons’  otherwise it overwrites them with its own. You can also train a model to do that.

There are public layout models that you can use to recognise your layout. They recommend testing with all three of Universal Lines, Mixed Line Orientations (all directions) and Horizontal Line Orientation  (just horizontal and vertical).  You might need to go into the settings to tell it the baselines are mixed in order to get it to work.

The only problem with this training was that so far, after 7 hours, we have just had stuff explained to us. I find it difficult to take in information for this long because I haven’t had chance to process, try out and learn the basics which would allow me to build on it with more advanced techniques. I get that to some extent, all of this is pretty basic and probably we need to know a lot of it to make a decent start, but it would have been helpful, I think, to break it up with us having a go at things so that we’d had chance to consolidate before moving on to the next step. Still, it’s very interesting and there is likely to be a lot in the recording that I will go back to later.