These are some rough notes from what I thought was an interesting keynote from Melissa Terras, Director of the UCL Centre for Digital Humanities, at this year’s BL Labs Symposium.
Melissa started by asserting that reuse of digital cultural heritage data is still rare, and that preservation of such data is problematic. Of the content digitised in the National Lottery Fund’s New Opportunities programme around the turn of the Millennium, ~60% of the content digitised then is no longer available now.
However, referred to collectively under the unofficial label #openglam, a number of changes have converged to give hope that the situation may be improving:
Melissa then went on to describe how UCL has been working with the British Library’s archive of digitised 19th century books. These books, numbering 65,000 were digitised by Microsoft and then handed back to the BL in 2012 under a CC0 license.
The data generated by the digitisation of these books, and the subsequent OCR output, comprises about 224GB of text data in ALTO XML format. This is too much data to make available over the network - and it is this fact which creates the need for better infrastructure services to allow researchers to work with the data.
The UCL Centre for Digital Humanities engages with science faculties as well as humanities faculties. Any member of UCL staff can access what is effectively ‘unlimited’ local compute power. What has become apparent is that this local infrastructure is typically optimised for science with the following characteristics:
whereas the requirements for a researcher wanting to work with the digitised books data are more like this:
UCL has therefore been designing computational platforms which allow users to filter the 65,000 books and find, for example, 300 books about some subject and then to download this data to process on a laptop. This project has also been good for computer science students who have been invited to design platforms to solve these kinds of problems.
Melissa suggested that there was a small number of very common query ‘types’:
… all returned in a derived dataset, in context
Melissa proposes that these would give 90% of what those people researching a collection like the BL 19C digitised books would want. Furthermore, librarians are quite capable of applying these basic recipes as a service for researchers, and they can build on these to offer more sophisticated searches.
Melissa identified the following best practices: