Further to my complaints about the copyright thicket in which data are being lost, Charles W Bailey Jr points out that, in fact, it’s worse than that: a good deal of the potential functionality of existing Open Access archives is jammed up in the same thicket:
If… repositories could not be trusted, then libraries would have to attempt to archive the postprints in question themselves; however, since postprints are not by default under copyright terms that would allow this to happen (e.g., they are not under Creative Commons Licenses), libraries may be barred from doing so.
(Emphasis mine.) Charles is talking about the question of whether or not self-archiving of scholarly articles (the “green road” to Open Access) will cause libraries to cancel journal subscriptions. I touched on this issue in an earlier entry, and don’t want to revisit it here. What interests me here is the fact — which I initially had trouble grokking, as you’ll see if you read the comments on Charles’ entry, where he patiently explains it — that digital objects in Open Access repositories carry their own copyrights, rather than being covered by a blanket license provided by the repository. For instance, PubMed Central refers to Open Access (using the Bethesda Statement), and then says:
Note that this definition of open access goes beyond the simple free access that applies to all full-text content viewable directly in PubMed Central (PMC) from the National Institutes of Health (NIH).
A number of PMC journals make all or most of their contents available as open access publications. See the Open Access list for details.
So PMC is OAI-PMH-compliant, but contains digital objects that are not themselves Open Access. I suspect the same is also true of the majority of institutional and centralized repositories (though I only checked ePrintsUQ, arXiv.org and Cogprints, none of which make any mention of copyright at all).
To get an idea of what that actually means, read carefully this brief discussion by Peter Suber of the BBB definition of Open Access:
The best-known part of the BBB definition is that OA content must be free of charge for all users with an internet connection. However, the BBB definition doesn’t stop at free online access. It adds an extra dimension that isn’t as easy to describe, and consequently is often dropped or obscured. This extra dimension gives users permission for all legitimate scholarly uses. It removes what I’ve called permission barriers, as opposed to price barriers. The Budapest statement puts the extra dimension this way:
By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The Bethesda and Berlin statements put it this way: For a work to be OA, the copyright holder must consent in advance to let users “copy, use, distribute, transmit and display the work publicly and to make and distribute derivative works, in any digital medium for any responsible purpose, subject to proper attribution of authorship”.
All three tributaries of the mainstream BBB definition agree that OA removes both price and permission barriers. Free online access isn’t enough. “Fair use” (“fair dealing” in the UK) isn’t enough.
Because each digital object carries its own copyright, e-print repositories do not remove permission barriers. Here’s Peter Suber again:
Permission barriers are more difficult to discuss than price barriers. First, there are many kinds of them, some arising from statute (copyright law), some from contracts (licenses), and some from hardware and software (DRM). They are not like prices, which differ only in magnitude. Second, their details are harder to discover and understand. Third, different users in different times, places, institutions, and situations can face very different permission barriers for the same work. Fourth, authors who deposit their articles in open-access archives bypass permission barriers even if they also publish the same articles in conventional journals protected by copyright, licenses, and DRM.
As far as I can tell, that fourth point is simply not true of any existing archives. If you want to do anything with an article in, say, PubMed Central, other than simply read it — if you want to copy it and distribute the copies, if you want to make a derivative work, if you want to pass it to text-mining or other software — you will have to determine, on an article-by-article basis, whether you are allowed to do that.
Take, for example, the following paper from the lab I work in, available free from PubMed Central:
Deletion of Mnt leads to disrupted cell cycle control and tumorigenesis.
Peter J. Hurlin, Zi-Qiang Zhou, Kazuhito Toyo-oka, Sara Ota, William L. Walker, Shinji Hirotsune, and Anthony Wynshaw-Boris
Right above the title on the linked page is a copyright notice: “Copyright © 2003 European Molecular Biology Organization”. The link provided goes to a PMC page which makes it very clear that an article’s presence in PMC tells you nothing about what rights the copyright holder(s) reserve or waive. Searching the EMBO site for “copyright” brings up nothing useful, but the EMBO Journal (which is actually part of Nature Publishing Group) has this to say:
Nature Publishing Group does not require authors of original research papers to assign copyright of their published contributions. Authors grant NPG an exclusive licence to publish, in return for which they can re-use their papers in their future printed work. NPG’s author licence page provides details of the policy and a sample form. Authors are encouraged to submit their version of the
accepted, peer-reviewed manuscript to their funding body’s archive, for public release six months2 after publication. In addition, authors are encouraged to archive their version of the manuscript in their institution’s repositories (as well as on their personal web sites), also six months after the original publication.
Apart from the foul six-month embargo (Do you have any idea how many experiments I can do in six months? But I digress.), this seems reasonable, and it leaves permissions up to the authors. So “copyright EMBO” is misleading, and it’s likely that EMBO J authors, having reposited their articles, wish them to be fully Open Access. As it happens, in this case the corresponding author is my boss so I can assure you that he knows about Open Access and is all in favour. The point, though, is that you have to dig around to find out that it’s up to Peter, and then you have to contact him to find out that he fully intends you to have the permissions you need. You are not going to be able to do that for more than a handful of papers; it certainly puts an effective brake on text-mining.
I think this brief example makes clear that, in practice, you cannot do anything much with repository content but read it (“fair use”, of course, still applies). You simply don’t have the time to uncover the necessary permissions for anything else. Which in turn means that there are no, or very few, actual Open Access repositories currently in existence.
I’ll say it again: e-print repositories do not provide Open Access. They provide free access to human eyes, one paper at a time; as the accepted definitions make clear, that’s not at all the same thing. Since self-archiving in such repositories is the current focus of many, if not most, efforts to provide 100% Open Access to the world’s scholarly literature, this is a big deal. There are two obvious solutions: 1, ignore the whole issue; and 2, start applying labels to digital objects.
In the short term and for individual researchers, solution 1 has considerable appeal. There’s even precedent: a recent study pointed out that patents do not slow research down much, mostly because researchers ignore them. The majority of e-prints are probably in a repository because their authors want Open Access; the likelihood of running afoul of copyright and actually being called to account for it seems pretty low. I think, however, that this head-in-the-sand approach is a very bad idea. What authors want is not always what counts, as when the copyright is actually owned by a publisher. I’ve been trying to think of the kinds of things you might do with a body of OA literature — build a text-mining robot that offers novel ways to look for deep connections between ideas and among data, make a local database of papers on your research specialty, and so on — but in fact, much of the point of Open Access is to make possible things I cannot think of. Look what the Web has made possible, and ask yourself: how much of that could I have predicted in 1991? It seems to me that anything which makes use of a substantial number of papers, or relies on being able to mine an entire corpus, runs the risk of being shut down or co-opted just when it starts to get interesting and useful. Suppose, for instance, that I write that text-mining robot: while I am using it to feed ideas into my own benchwork, I’m OK, but the minute I give that robot to someone (or, as is my preference, everyone) else, I run the risk of being sued for copyright violations.
This is the same risk that researchers are already running when using patented technology without a license; you are fine until you come up with something good, but then if the patent owner notices what you’ve done, you can be in trouble. “Trouble” means three things: legal sanctions, loss of the opportunity to profit from your invention, and removal of your invention from the commons. The first seems pretty unlikely from an individual perspective — what company is going to risk the PR nightmare of trying to recover fines from a researcher? — but substantially more worrisome for universities and other institutions. The “loss of profit” is of no interest to me; if I wanted to be rich I wouldn’t be a scientist. What really concerns me is the potential for patent/copyright owners to exert anti-commons, profit-taking control over research outcomes, and it’s this risk that makes the Ostrich Option unacceptable to me.
In the longer term, for community minded researchers and especially for institutions (which are typically more wary of litigation than individual researchers, and since Bayh-Dole, increasingly focused on profiting from research outcomes), solution 2 is a reasonable fix. In principle, OA repositories could include labels (that is, metadata) specifying which uses are explicitly permitted or prohibited, so search engine users and text-mining robots could search only that portion of the database that allowed whatever rights they need. In fact, the Bethesda and Berlin definitions of OA both include the requirement for every OA article to carry an explicit label regarding permissions. Project RoMEO was intended to deal with precisely this issue, and produced (in addition to the valuable SHERPA/RoMEO database of publisher permissions for forward-looking authors) six surveys of the field and an XML-based implementation of the resulting rights management concepts, incorporating Creative Commons licenses. Unfortunately, there seems to have been zero uptake of the concepts or the technical implementation. As far as I can tell there are no search interfaces which provide this kind of rights-based functionality, and every repository contains a mix of well-labelled, partially labelled and unlabelled objects. In addition, the body of scholarly work in relevant repositories is already so large that adding the necessary rights metadata is an enormous task, one which grows larger and more forbidding by the day (I might call this the “backlog problem”).
Nonetheless, the fundamental OA definitions include rights beyond simple reading access for good reason. As I discussed in my earlier entry, rights management is going to be at the heart of Open Data, and I have argued elsewhere that licensing and standards/metadata are also going to be crucial to bringing the “openness” of Open Access to science as a whole. I think the Open Science field is headed for some serious problems if permissions barriers are not given more attention. I might concede that the most important thing to achieve right now is removal of access barriers to human eyeballs, but why make trouble for ourselves by — as seems to be happening3 — ignoring the rights issue? There’s no reason why the process of encouraging authors to self-archive, and building tools to make that easier, should not include information and tools that focus on rights management. At the very least, we should be making authors who are already on-side, who are self-archiving and using the SPARC Author Addendum and so on, aware of the issue — and giving them the tools to label their own papers with clear statements of the rights they wish to retain or waive. At least then the rate of growth of the backlog problem will begin to slow down, and should approach zero as we approach 100% OA (even on the green road) rather than continuing to grow unchecked.