professional reflections

Digital library pipeline for a million books.

I was pleased to be invited by Brian Fuchs to a 'Million Books Workshop' at Imperial College, London last Friday. A fascinating day, in the company of what was, for me, an unusual group of 20-30 linguists, classical scholars and computer scientists. The morning session consisted of three presentations (following an introduction from Gregory Crane which I missed thanks to the increasingly awful transport system between London and the South West) which brought us up to speed with some advances in OCR, computer aided text analysis and translation, and classification. The presentations were intended to form an ordered progression:

  1. From Image to Text: OCR and Mass Digitisation ( Dr. Thomas Breuel, DFKI and Technical University Kaiserslautern)
  2. From Text to Information: Machine Translation and Syntax Recognition ( David Smith, Johns Hopkins University, & David Bamman, Perseus Project)

  3. From Information to Learning: Machine Learning and Classification Techniques ( David Mimno, U Mass, Amherst)

Listening to these presentations, I quickly found myself well outside of my comfort zone, in terms of both the science and the domain (classical literature), so it was a challenging and exhilarating morning! It was difficult to take comprehensive notes as I had to really concentrate on the presentations at times in order to follow them - the pace was pretty smart, with jargon and 'in jokes' galore.

David Smith, Johns Hopkins University gave a fascinating and entertaining presentation which outlined some of the challenges, and advances, in language parsing and translation. He pointed out that although the structured view of the semantic web is a seductive one, even the newer online, digital genres such as email, blogs mostly use unstructured or semi-structured text. However, parsing free text is very difficult, especially with the growing scale and diversity of texts available on the web. To illustrate this he employed a series of (sometimes amusing) translations from the Google translation service. The best available technology today uses supervised machine learning techniques to build a treebank. An alternative approach employs semi-supervised, modelling techniques. Parallel texts in different languages are useful but, for some languages, only the Universal Declaration of Human Rights exists as a parallel text! As an aside, David pointed to the potential advantage in search engines searching several languages: if you enter your query in English for example, by searching resources in other languages, the search engine automatically expands the search, utilises synonyms etc. 'for free'. This can then be more effective than monolingual searching. David offered a future based in pragmatism: translation support rather than fluent translation.

David Mimno presented on classification, sequences & topic modelling. In an interesting talk, it was the visualisation (as a topical transfer graph) of topic relations extrapolated from citations in a set of scholarly communications which really got the audience engaged - a series of questions ensued before David could move on. He illustrated his work with accessible examples: for example, it turned out from one experiment that the single term most likely to identify email spam was, believe it or not, the word "red" showing in the markup, owing to the fact that "only spammers use red text"! Apparently, he had a system which could classify any of Shakespeare's plays as tragedy, comedy or history…. with the exception of Romeo and Juliet, which comes out as a comedy for some reason….

The takeaway for me was that some of the technology in these spaces is maturing. Thomas Breuel, for instance, made a compelling case for really effective OCR (Optical Character Recognition) in his description of the open-source OCRopus project, which he leads and which is sponsored by Google. Building on previous systems like the character recognition system Tesseract, OCRopus employs a modular design with components which offer the following workflow, focussing on the processing of scanned books:

layout analysis -> isolated character recognition -> statistical language modelling - > text

The project is heading towards a beta release this year, and the team plan to create a deployment 'bundle' in the form of an Amazon EC2 AMI. I didn't quite catch the details but I think they have found a way to monetise this through the Amazon referral program, which sounds interesting. In any case, the idea is that one could take the AMI, deploy it, run it for a few hours to process a particular scan, and then shut it down again - potentially a very cost-effective way of proceeding. Thomas made the point that, as OCR technology continues to improve, we are likely to want to process scans of books several times. He explained how the project was aiming towards a "full digital library pipeline", a system which could be deployed from a connected laptop: with the new affordability of powerful digital cameras, a researcher might photograph a book's pages themselves before feeding the resulting image into the OCRopus workflow OCRopus can handle the distortion effect of non-flat pages very effectively). Another interesting aspect of this work is the distributed parallel training which underpins the statistical language modelling: a large model is achieved by combining many little models created by many people, through the web. If you are interested in this area, then you should also check out the hOCR format specification and tools.

I had been invited to this workshop because of my role and interest in the deployment of services at a community and network level. I joined a panel at the the very end of the day where we were invited to consider what services and infrastructure might be required, in the UK, to support the digitisation and useful processing of a 'million books'. We didn't get very far with this because we had run out of time and, I suspect, energy by this point, but the question remains…. I'll be picking this up with some colleagues in due course.

Fascinating day, and topped of with a quick pint standing outside a packed London pub in a light drizzle, which was actually a refreshing and pleasant way to conclude!


[…] Amazon) allows a user to have access to any real world asset that is in their Delicious Library digital workflow storehouse and have it available in a digital, virtual environment. The real question that come from this is […]

Introduction – Digital Library Development

Libraries are rightly called the store house of valuable knowledge. It was invented in 5th century BC with both fiction and non fiction books and today there are millions of library all over the world. With rapid growing advancement in every field more and more documents are becoming available in printed forms and Libraries keep and preserve materials making availability of all the historical items.

Many Libraries in India have not yet catalogued all of their holdings and searching the physical format of over 100 years has become a difficult task. Due to the invent of new technologies, many providers are now providing customized digitization services to the libraries around the world.

Digital Library Advantage - Digital libraries need not keep large and expensive stores of bulky and decaying paper. Libraries can shrink from large warehouses to small rooms and catalogs can be electronic, electronically updatable, and computer generatable, making them easier, faster, and cheaper to search, produce, and update. Libraries will not need to buy multiple copies to allow for book scuffing, book destruction, or to place one book in several categories.

Nor will they need binderies to bind journals or magazines into volumes, or to rebind old books. Nor will they need reshelvers. Also, the library can more easily refer readers to other books with similar subjects, tastes, or interests. Libraries will not need to chemically treat their decaying books, microfilm them, or transcribe them to large-print, or audio. All transformations are easier with electronic books.

Leave a comment!

Designed by Paul Walk