Heather Morrison points to this excellent post by Glen Newton, wherein Glen proposes that Open Access should explicitly include machine readability:
Open Access must include access by machines:
* At minimum one must allow crawls of the site/content or (to reduce the impact of badly configured crawlers) create a compressed XML file containing all metadata and either content, or direct links to content and make it available for download (and if bandwidth is still an issue put it on a P2P network like BitTorrent).
* Preferable is to offer some kind of API (OTMI) or protocol (OAI-PMH) to get at content and metadata and citations.
* Better is to offer access to the XML of the articles in addition to the PDF and/or HTML; if the XML actually has some semantic content, then we are approaching the optimum.
The end goal is to support and encourage text mining and analysis of the full-text (preferably semantically rich XML), metadata and citations to allow literature-based exploration and discovery in support of the scientific research process.
Most importantly: hear, hear!
I do, however, have a nitpick to make. Heather makes no comment on Glenn’s idea that this is an addition to the definition of OA, but in fact I think it’s already built in to the accepted BBB definition. Peter Suber refers to the removal of price and permission barriers, to distinguish Open from “merely” free access, which removes only price barriers; I’ve quoted him on this before, so here he is again:
The best-known part of the BBB definition is that OA content must be free of charge for all users with an internet connection. However, the BBB definition doesn’t stop at free online access. It adds an extra dimension that isn’t as easy to describe, and consequently is often dropped or obscured. This extra dimension gives users permission for all legitimate scholarly uses. It removes what I’ve called permission barriers, as opposed to price barriers. The Budapest statement puts the extra dimension this way:
By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The Bethesda and Berlin statements put it this way: For a work to be OA, the copyright holder must consent in advance to let users “copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship”.
All three tributaries of the mainstream BBB definition agree that OA removes both price and permission barriers. Free online access isn’t enough. “Fair use” (“fair dealing” in the UK) isn’t enough.
Having said all that, though, I’ll add that an explicit description of machine readability requirements would be an addition to the accepted definition of OA — and one that I would welcome. Peter Murray-Rust recently noted that, according to the “price and permission barriers” view of Open Access, PubMed isn’t OA — even PubMed Central isn’t OA.
I’ll go even further: can anyone point me to a single Open Access repository? I don’t know of even one such site that removes both price and permission barriers. Surely there must be some, but the Big Names (PubMed Central, arXiv, Cogprints, CiteSeer, RePEc, etc — see ROAR) don’t seem to qualify, because digital objects in these repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository.
Can this be true? Five years after the BBB definition came together, more than ten years since Stevan Harnad’s subversive proposal and on the first day of the NIH mandate — widely referred to as an OA mandate! — can it be that we really don’t have a single truly OA repository in all the world? And if it is true, would it help to make the official definition more explicitly machine-friendly?