These are some rough notes from what I thought was an interesting keynote from Melissa Terras, Director of the UCL Centre for Digital Humanities, at this year's BL Labs Symposium.
Melissa started by asserting that reuse of digital cultural heritage data is still rare, and that preservation of such data is problematic. Of the content digitised in the National Lottery Fund's New Opportunities programme around the turn of the Millennium, ~60% of the content digitised then is no longer available now.
However, referred to collectively under the unofficial label #openglam, a number of changes have converged to give hope that the situation may be improving:
- funders now frequently mandate that research data will be made available for long periods - up to 10 years.
- licensing is greatly simplified with the growing adoption of the Creative Commons
- technical frameworks, which address the challenge of making such data available for others to use, are becoming available
- projects are more willing to address these issues
Melissa then went on to describe how UCL has been working with the British Library's archive of digitised 19th century books. These books, numbering 65,000 were digitised by Microsoft and then handed back to the BL in 2012 under a CC0 license.
The data generated by the digitisation of these books, and the subsequent OCR output, comprises about 224GB of text data in ALTO XML format. This is too much data to make available over the network - and it is this fact which creates the need for better infrastructure services to allow researchers to work with the data.
The UCL Centre for Digital Humanities engages with science faculties as well as humanities faculties. Any member of UCL staff can access what is effectively 'unlimited' local compute power. What has become apparent is that this local infrastructure is typically optimised for science with the following characteristics:
- one large dataset
- one or two complex queries
- single output (the answer), often a visualisation
whereas the requirements for a researcher wanting to work with the digitised books data are more like this:
- to work with 65,000 individual datasets
- to make one simple query
- to generate multiple outputs - e.g. hundreds of pages, which the researcher will take away and processes further
UCL has therefore been designing computational platforms which allow users to filter the 65,000 books and find, for example, 300 books about some subject and then to download this data to process on a laptop. This project has also been good for computer science students who have been invited to design platforms to solve these kinds of problems.
Melissa suggested that there was a small number of very common query 'types':
- searches for all variants of a word
- searches that return keywords in context traced over time
- NOT searches for a word or phrase that ignores another word or phrase
- searches for a word when in close proximity to a second word
- searches based on image metadata
... all returned in a derived dataset, in context
Melissa proposes that these would give 90% of what those people researching a collection like the BL 19C digitised books would want. Furthermore, librarians are quite capable of applying these basic recipes as a service for researchers, and they can build on these to offer more sophisticated searches.
Melissa identified the following best practices:
- support derived datasets - people want to take subset of the data away to process further
- document decisions - researchers need to know about the dataset - the decisions about how it was generated, provenance, how their query is working etc.
- offer fixed/defined datasets (has the data changed since the query was run?)
- support normalisations (e.g. if you find more mentions of your query term in later books, it might be because there are more books in the collection from that year)