professional reflections

Google gives up on supporting OAI-PMH for Sitemaps

For some time now I have occasionally advised people involved in repository administration that they should consider registering the Base URL of their OAI-PMH interface (if they have one) with Google as a proxy for a Sitemap. Until recently, Google has supported the use of OAI-PMH Base URLs in its Webmaster Tools which site owners can use to create and register sitemaps in order to give hints about the structure of the website to Google's web-crawler.

A while ago, I noticed that there was no longer any reference to this particular support in any of the documentation and began to suspect that this was being deprecated. Today, Google announced via their official blog that:

…we've found that the information we gain from our support of OAI-PMH is disproportional to the amount of resources required to support it. Fewer than 200 sites are using OAI-PMH for Google Sitemaps at the moment.

In order to move forward with even better coverage of your websites, we have decided to support only the standard XML Sitemap format by May 2008. We are in the process of notifying sites using OAI-PMH to alert them of the change.

Fewer than 200 sites…..

There are a few ways of looking at this. Perhaps 'open access' repositories are less concerned with Google rankings than the typical website owner. Perhaps the penetration of OAI-PMH in the world is still below any level that Google could find particularly interesting - certainly they never went to great lengths to advertise this support while it lasted. Clearly, Google have come to the end of a 'trial period' for their support for this protocol in their main indexing service.

Can we conclude anything from this? Probably not - surely OAI-PMH can thrive without Google Sitemap support? It certainly plays a fairly significant part in my professional life at present! Or should we view this as a symptom of decline….?

The official Google announcement is here.

Comments

I really like it when folks get together and share thoughts.

Great blog, keep it up!


[…] have gleefully rejoiced in the possible demise of the standard. However the discussion on Paul Walk’s blog is more balanced, lucid and informative, and covers well the many areas where OAI-PMH is more […]


@Wobbler - it depends what 'interoperability' you are looking for. Unless we define how we want repositories to inter-operate, then we can't assess the best tools for the job.

OAI-PMH is about 'metadata harvesting' - and in theory provides a lightweight way of an automated agent finding out what is in a repository. However, in reality this has only seen take up within the small world of repositories. Google, Yahoo etc. simply crawl the web following links - they don't see the need to implement a new protocol to deal with a tiny (in web terms) amount of content.

Given this, I would argue we should concentrate on how to integrate the material that is stored in repositories into the 'web' environment. Now, this means making the content crawlable and linkable. As the Semantic Web is now getting broader takeup (with the key players such as Yahoo and Google now taking an interest), I would guess that embedding microformats and looking at other semantic web technologies could be a fruitful approach.


[…] alla notizia - i pochi apparsi sul blog di Google Webmaster, e soprattutto quelli apparsi sul blog di Paul Walks. Che ne […]


I have read that blog post earlier (thanks to your blog). I think it is an interesting post, but unless with "alternative", he means: ditch the concept of institutional repositories altogether and go central repositories (which I do not think is realistic), it does not actually give any alternative approaches to handle this.

And assuming "the web", e.g. Google Scholar?, can do it just as good without protocols such as OAI-PMH, why have that protocol in the first place? I am going to assume that it was necessary to optimize interoperability between institutional repositories?


Wobbler, I guess the short answer to your question would be 'The Web'. The case has been made for this more strongly elsewhere - I suggest that you have a look at http://efoundations.typepad.com/efoundations/2008/02/repositories-th.html for one argument along these lines.


First I would like to say that since I have found your blog, I have been finding a lot more interesting blogs on scholarly communication/ OA/ repositories. So thanks.

Anyway, maybe I have been following the Stevan Harnad/ Alma Swan crowd too much, but I was under the impression that OAI-PMH, in the context of institutional repositories, was currently the best approach to achieve a high level of accessibility of eprints? And also that it is flexible enough to be extended with other metadata (such as comments)?

If not, can someone tell me about (better) alternatives for accessibility/ interoperability between institutional repositories? What other protocols are there like OAI-PMH (but better and more flexible)?


I'm not so sure that sitemaps are going to grow in importance. Their principal importance in the past was due to inadequete spidering. Nowadays, Google's spidering with additional HTTP requests (unfriendly URLs). The only good thing about being Google sitemaps-compliant is the ability to have the most important links exposed in results lists, but some sites get that without being particularly SEO.

Also, harvesting over sitemap would be a difficult proposition for large collections, as the heft of the XML layer would be more pronounced than a single-object query via a web services protocol.

There's nothing wrong with OAI-PMH necessarily – there just isn't any "killer app" to really entice people. I guess this might be an instance where "If you can't beat 'em, join 'em" would be sound advice.


The lesson presumably isn't "Google don't do it therefore we shouldn't" but "the fact that only 200 sites signed up say something about the technology which we should listen to"?

I keep coming back to the same old cracked record mantra: if a technology doesn't have a wide and fairly rapid uptake, if it isn't easy, if it can't demonstrate benefits to ALL (not just the geek community) within a reasonable time span, then it simply isn't a technology worth pursuing. Knowing not much about OAI

The latest Ofcom report seems to back this up. http://tinyurl.com/4vrkcz


@pete cliff - but this is a good point, z39.50 has never been widely adopted outside the library sector, and now looks like an increasing barrier to access rather than enabler.

I think there is a risk that OAI-PMH is in a similar situation - a sector specific solution, that is seen by many as too complicated and is never going to get widespread acceptance by the wider community.

We've got to ask, what does OAI-PMH add - is it achieving it's goal of being "is a low-barrier mechanism for repository interoperability" (from http://www.openarchives.org/pmh/)? To understand this, we have to start dragging the definition of 'repository' apart, but if we (for example) say Flickr is a repository of pictures, you could argue that OAI-PMH does nothing for interoperability between an e-prints repository and Flickr. What OAI-PMH does is enable interoperability between two systems supporting OAI-PMH - and as support for this drops, the likelihood of OAI-PMH being successful also drops.

We need to look really hard at this - I'm not convinced that OAI-PMH is 'the future' for repositories, and if not, we need to look at where the future is, and start heading in that direction.


The webmaster tools can be pretty useful for developers checking everything is OK (see: http://blogs.openaccesscentral.com/blogs/orblog/entry/how_many_how_fast for an example).

Over time I suspect sitemaps are going to grow in importance. Ignoring repositories for a moment, most standard webmasters aren't really aware of them, and the benefits they can bring, for example being able to mark certain pages as having a higher priority than others.

No doubt once sitemaps become part of your average webmaster's daily toolkit, their importance will grow, and repository people will catch on. Most repository managers have a set of tick boxes they like to check, one of them being inclusion in G and GS. If they know sitemaps will help them tick these boxes, I'm sure they'll be all for them.

And from a development perspective, sitemaps are very easy to create (100x easier than writing an OAI-PMH interface) so there is no reason really for repository platforms not to support them out-of-the-box.


Stuart, I didn't know about the new DSpace support for sitemaps - good news. I agree that the withdrawal of Google support for the use of OAI-PMH URLs as sitemaps is not a big deal on the face of it.

Do you think repository administrators will generally care about sitemap support in their systems - enough to bother with the Google Webmaster tools for example?


I really don't see the problem here. Have we ever complained that other search engines don't make use of OAI-PMH? They (not just Google - other search engines too) are now supporting a better protocol (sitemaps), so we as repository people still have a way of letting search engines know 'what's new'. DSpace version 1.5 ships with support for sitemaps, once the others follow, this will be a non-issue.


Pete, I'm committed to developing infrastructure to support scholarly research - and OAI-PMH is, or will be, a significant component in that infrastructure.

However, I still very much doubt that OAI-PMH will play much of a part in the simple search activity which Google supports, especially with the loss of Sitemap support.


Y'know, since Google never adopted Z39.50 I've not found a book in a library in years! ;-) Just because everyone does it, we shouldn't be swayed to thinking using Google is the best and only way to find things - especially for scholarly research.


Rachel, I wouldn't draw any strong conclusions from this event - frankly, Google did not exactly push this support very strongly. Perhaps we might suppose that repository managers have been less concerned with, or less aware of SEO issues than mainstream web-site managers? If so, I think this is interesting in itself.


I find it a bit alarming that the implication is that if Google does not need a protocol then its not worth bothering with….

In my view OAI-PMH is a first step to enable interaction (harvesting) between repositories, and between repositories and aggregators, that is aggregators other than Google. In other words a significant enabler of an interoperable scholarly communication ecology. It gives an opportunity to develop services customised for the HE information environment. Albeit I grant you, not many of those services are there yet.

And I guess we should remember that the intention of OAI-PMH is to build services for the whole HE community, I don't suppose our usage (by which I mean JISC funded research and development) is typical.

Rachel


Chris, The short answer to your specific question is: "not much". This is because you have astutely covered all the areas where OAI-PMH is significant to me in your exclusons. OAI-PMH does play a fairly significant part in my professional life….. in development projects and JISC-sponsored activity. :-)

I am not aware that I use any service currently which depends upon OAI-PMH.


Paul, other than developments and JISC-sponsored activity, I'd be interested to know how OAI-PMH "plays a fairly significant part in [your] professional life"? It seems to me that the protocol plays a fairly small role at the moment; almost no-one uses the PMH-enabled search sites. Google et all index the repository content, and that's how it is discovered…

OAI-ORE may change things, but I've yet to understand how!


Leave a comment!






Designed by Paul Walk