Repository Architecture 83

At a JISC workshop last Thursday I was invited to present some ideas around an architecture to support and exploit repositories in the UK. I gave the presentation the title Repository Architecture #83 ;-)

My intention was to suggest some starting principles and then explore how they held up in the face of real-world issues. Here is the slide where I outlined these principles:

I also asked the question: "do we actually need a new architecture?" - suggesting that there is already a ubiquitous & successful architecture supporting much/most/(all?) of the functionality we want from repositories. Taking a _ resource oriented_ approach also seems to offer all kinds of advantages. Applying this approach is certainly not a new idea - others have been here before. However, I suggest that the resource oriented approach and the service oriented approach can be most effective when used to complement each other. I think that there is still be place for the institutional repository as the collection of systems which surround what I call the source repository. I define the 'source repository' as an (ideally) quite simple system which contains:

  • the resources themselves, individually addressed with HTTP URIs
  • simple, item-level metadata records
  • site-map(s) to aid remote search engines
  • public, HTTP interfaces
  • feeds to notify remote agents of the deposit of new resources in the repository (RSS and/or Atom)

An 'institutional' or 'subject' or 'learning object' repository contains one or more source repositories plus any systems needed to manage it in its particular context. These larger repositories might be very complex: the important point is that the logical component I call the source repository should be as simple as possible in it's public facing interface: basically a bunch of resources, with an address space. So, a resource is given a Cool URI , and a (probably) simple metadata record is made available, also as a resource with a URI. I suggested that an ORE resource map could be used to relate metadata record to resource - from the point of view of the web or ORE, a metadata record is a resource just like, for example, a PDF of a scholarly paper. Elsewhere more, richer metadata might be created through mechanisms ranging from automatic metadata creation, to further human effort which might be in the nature of traditional cataloguing by trained and motivated individuals, or 'crowd-sourced' tagging by untrained but still motivated people. Complexity is introduced, where necessary, in services developed to manage and exploit resources held in source repositories. Crucially, such activity does not happen unless there is a clear incentive for it, and then it happens close to the point of incentive. As an example, if a particular domain has a strong need to classify papers then someone might go to the trouble of harvesting, aggregating and text-mining the text of these papers with a view to extracting terms to use for classification. Or something similar might be achieved through the application of a team of professional cataloguers using an agreed vocabulary. However it is done, the new metadata thus created could be made available as a web resource where it could be used and combined with other resources as required. I was asked to illustrate this with a few diagrams which provoked a fair amount of discussion.

The point was made, strongly, that it is subject repositories which have the content, rather than institutional repositories. Regardless of whether this is, or will continue to be true, I think the architectural principles hold up. The business drivers are, I guess, quite different!

I learned a lot from the workshop and had some of these ideas challenged quite robustly. I think they held up but the clarity of presentation could be improved - this is what I will be working on now.