Paul Walk's Web

RIOXX application profile - draft 1

Tuesday, October 23, 2012

Together with Sheridan Brown, I have been tasked with developing some guidelines and a metadata ‘application’ profile for institutional repositories (IRs) in the UK. We are calling this work RIOXX. This post focusses on the application profile more than the guidelines, and describes phase 1 of the project, which aims to deploy this application profile across IRs in the UK by the first quarter of 2013.


Scope and approach

Funder policy regarding Open Access (OA) is being actively developed and the OA landscape is shifting. The emphasis in this phase of RIOXX is to do something which is adequate and able to be quickly implemented. This work will provide an application profile and guidelines which are inherently an interim solution. Broadly speaking, the approach we are taking is as follows:

Develop the simplest possible application profile, based on Dublin Core (DC).

Pretty much all repositories support DC, as another application profile of DC, OAI-DC, is a mandated minimum metadata format for the ubiquitous protocol for harvesting metadata from repositories (OAI-PMH). If all goes well, the development work needed for repository systems should be minimised.

We have examined two related initiatives: the OpenAIRE guidelines (and the Driver guidelines which preceded these), and the EThOS Toolkit which developed an application profile of DC for eTheses.

Consider a CERIF-XML expression of this application profile

The interest in CERIF as the de facto standard format for exchanging this kind of information between systems is growing steadily. We are liaising with the CERIF Support Project and ensuring that a transition towards a CERIF-based approach remains viable.

Develop a modelled, expressive application profile

In later phases of RIOXX, we hope to develop the application profile more fully. This will take into account such things as: * greater use of controlled vocabularies * a move away from DC and towards CERIF * greater involvement of systems other than repositories - notably Current Research Information Systems (CRIS). * modelling of ‘access-level semantics’ - i.e. describing how, where and under what license or conditions the resource might be accessed and used

Rationale for some decisions in phase 1

Keeping things very simple

Timescales are very, very tight. From a pragmatic, technical point of view we have restricted ourselves in this phase to developing an approach which allows the repository to emit RIOXX records based on information properties already catered for in the repository system (that is, the placeholders for Sponsor and ProjectID already being there, even if the actual data has not yet been entered). We have deferred a more complete and complex approach to a later phase because the capacity to deliver this kind of information from institutional systems is developing rapidly.

The ProjectID property

We found ourselves unable to simply adopt the OpenAIRE guidelines as these mandate a particular syntax for the ProjectID (designed for EC funded projects) which would preclude certain UK funders. In any case, we consider it to be a mistake to embed semantics into this property and believe it is best provided as a globally-unique, opaque identifier. To this end, we are actively looking at the possibility of funders minting DOIs for the ProjectID. In the meantime, we will be requiring that the ProjectID be whatever identifier is provided by the funder of the output being described in the record. We have chosen the term ProjectID rather than, for example, GrantID, as we have been advised that the former is the more widely used term in common usage in the UK.

The Sponsor property

For phase 1 we are mandating this property, but specifying only that a recognised form of identifier for the funder/sponsor be used. This will mean a free-text string for now. We are actively exploring possibilities for identifying and then mandating a particular authority list of funder names, such that this property becomes underpinned by a controlled vocabulary. However, this will not make it into phase 1. This property, while essential in the short term, might become more of a convenience than a necessity, as the ProjectID becomes more reliably ‘actionable’. In the medium-term, we would anticipate being able to reliably derive the sponsor/funder from the ProjectID. For this reason, we have not modelled the relationship between these two properties closely - except insofar as they exist in a particular record. This means that some records may contain more than one Sponsor and more than one ProjectID with no direct way to relate a given ProjectID to a given Sponsor. While it would be possible to model this relationship, we have chosen not to do so in this phase, because:

We anticipate that this will need to be modelled more thoroughly in future phases.

Deferring the ‘access-level-semantics’ question

In order to convey the precise nature of the open-access ‘state’ of resource, RIOXX will need to develop a richer way of describing such concepts as ‘green’ or ‘gold’ open access, embargoes, licenses etc. The use-cases and operations which will depend on such information are not yet clear and, while the time has now come to model these, this should not be done in a hurry.

The following is a table of proposed elements and recommended formats. We propose to use extend the Dublin Core elements with two new elements under the rioxterms namespace.

Element Inclusion M/R/O Format Format M/R/O
dc:title M Free text. It is recommended to use the form: Title:Subtitle R
dc:creator M Free text. Recommended practice is to either use the form Last Name, First Name(s) or a unique identifier from a recognised system. Each creator should be given a separate dc:creator element R
dc:identifier M A globally unique identifier. It is strongly recommended to use a URI which can be de-referenced (i.e. is ‘actionable’) where this is appropriate R
dc:source M Journal title, reference or ISSN M
dc:language M Use ISO 639-3 language codes M
rioxxterms.projectid M Use the identifier provided by the funder to indicate the project within which this output has been created M
dc:coverage O The extent or scope of the content of the resource. Coverage will typically include spatial location (a place name or geographic co-ordinates), temporal period (a period label, date or date range) or jurisdiction (such as a named administrative entity).
dc:rights O No agreed vocabulary or semantics exist for this in the context of Open Access papers, and it is common practice for this to be ignored by repositories currently. Some work is being funded to look at this area for the next phase of RIOXX. For now, this element has to be optional.
dc:audience O Free text.
dc:format R It is recommended to use the IANA registered list of Internet Media Types (MIME types) M
dc:date M One date using ISO 8601. Published date is the default and recommended interpretation. M
dc:type O This is currently free text and an optional element. However, RIOXX phase 1 will be recommending that a vocabulary be adopted or developed for this element. O
dc:contributor O (as for dc:creator)
rioxxterms.sponsor M Free text - Funder name using the funder’s preferred format O
dc:publisher R Free text indicating the name of the publisher (commercial or non-commercial) O
dc:description R Best practice is to use an English language abstract. O
dc:subject R Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme. E.g. LOC, MESH. O

I would appreciate any comments people might have about the technical aspects of this.

Share this post:FacebookTwitterEmailGoogle Plus
comments powered by Disqus

Designed by Paul Walk, built with  Hugo
Copyright © Paul Walk. This website and blog are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License