Where are the data? Can I have them? What can I do with them?

There’s a new subversive proposal in town.  The original was Stevan Harnad’s landmark call for self-archiving of the scientific (“esoteric”) literature (see here for a ten-year update, and here for context).  Now, 12 years later, Open Access is gathering momentum and forward-looking advocates of knowledge as a public good are thinking about Open Data (some extra background here).  Peter Murray-Rust recently stepped up with a subversive proposal of his own:

The simplest thing that researchers can do [to promote Open Data] is to add a Creative Commons license to their data. It costs nothing, is a simple cut-and-paste, and could be trivially made a template in any data production tool. [...]
I think the effect of this would be dramatic. Scientists would start to see these messages and think: “Why should I give these data to the publisher?” And if the publisher simply adds a copyright notice saying “all these data are copyright the publisher – you cannot use them for X, Y, Z without permission” this would be in violation of the authors’ license. The author would have to deliberately remove this statement to hand over the IPR to the publisher.

I think Peter’s proposal is a good one, similar in form and effect to the SPARC author addendum.  Importantly, Science Commons also offers author addenda, and will soon offer them in the machine-, human- and lawyer-readable versions that come with all Creative Commons licenses; as Peter notes, the machine-readable version is crucial to full Open Data utility.  Use of the proposed Open Data addendum (in combination, where necessary, with an Open Access addendum) would clarify the legal status of an author’s data, provided we get the wording right.  Herewith some thoughts on how to do that, based on the questions in the title.

First, note that papers do not usually contain raw (useful, useable) data. They contain, say, graphs made from such data, or bitmapped images of it — as Peter says, the paper offers hamburger when what we want is the original cow.  Chris Surridge of PLoS puts it this way:

A figure in a paper is a way of representing the raw data in such a way to best illustrate the point the author is making. A figure then is the product of an operation upon the raw data, and that operation results in a loss of information.
The raw data could have been presented in a host of different ways possibly supporting other conclusions not thought of by the author. Equally if a reader had raw data compatible with that the author obtained wouldn’t it be useful if it could be processed in the same way for comparison? Wouldn’t it be much better for readers to have access not only to the figures in a paper but also to the underlying data and the transform that created it. In this way no information, neither implicit nor explicit, is lost.

So if authors want to make their data openly and usefully available, they will need to host it themselves or find someone to host it for them.  Many journals will host supplementary information, and many institutional repositories will take datasets as well as manuscripts.  I have been saying for some time that it should by now be de rigueur to make one’s raw data available with each publication. This is very rarely done — even supplementary information, when I have come across it, tends to be of the hamburger-rather-than-cow variety and so not very useful.  (The situation speaks sad volumes about the emphasis on competition over cooperation within the scientific community and, perhaps in many cases, about the quality of the raw data in question, if only one were ever able to see it; but I digress.)  Thus an effective Open Data addendum will first have to answer the question: where are the data?

Second, there is the issue of licensing (“Can I have them?  What can I do with them?”).  In comments on Peter’s proposal, Jonathan Eisen observes that publishing in Open Access journals should provide open access to data as well.  Peter replies that this is not always the case and points to Molbank as a problematic example, because they require a copyright transfer and it is simply not clear what rights they claim over raw data.  In fact, the situation is even worse.  In the same entry, Peter points approvingly to the BioMed Central OA charter, which is based on the Bethesda Statement:

Every peer-reviewed research article appearing in any journal published by BioMed Central is ‘open access’, meaning that:
  1. The article is universally and freely accessible via the Internet, in an easily readable format and deposited immediately upon publication, without embargo, in an agreed format – current preference is XML with a declared DTD – in at least one widely and internationally recognized open access repository (such as PubMed Central).
  2. The author(s) or copyright owner(s) irrevocably grant(s) to any third party, in advance and in perpetuity, the right to use, reproduce or disseminate the research article in its entirety or in part, in any format or medium, provided that no substantive errors are introduced in the process, proper attribution of authorship and correct citation details are given, and that the bibliographic details are not changed. If the article is reproduced or disseminated in part, this must be clearly and unequivocally indicated.

But what does that mean for Open Data?  Take any paper in any BMC journal: where are the data?  Can I have them?  What can I do with them?  It’s true but it’s simply not enough that, having published in BMC, the authors are probably amenable to giving me the data and allowing me to do with them as I please.  I need unfettered access to the data at the same time as I access the paper.  Even as a human I don’t have time to chase down permission for every dataset I want to re-use, and if I’m data-mining by web crawler I need machine-readable licenses that tell my robot what it can have.  Policies regarding data and materials are journal-specific within the BMC group, but I browsed a few and it seems they all use a standard template, which includes the following:

Submission of a manuscript to [BMC Journal in question] implies that readily reproducible materials described in the manuscript, including all relevant raw data, will be freely available to any scientist wishing to use them for non-commercial purposes. Nucleic acid sequences, protein sequences, and atomic coordinates should be deposited in an appropriate database in time for the accession number to be included in the published article. In computational studies where the sequence information is unacceptable for inclusion in databases because of lack of experimental validation, the sequences must be published as an additional file with the article. [There follows a list of databases that can be used to deposit nucleotide and protein sequences and structures, chemical structures and assays, microarray data, computer models and plasmids.]

Note though that these policies are not strict demands, and I’ll bet they are not policed in any way.  I think most journals include similar language in their instructions to authors, and have done for some time, but we still do not have widespread Open Data.  Further, the actual BMC license (which BMC says is identical to the Creative Commons Attribution License) refers only to “the work” which it defines as “the copyrightable work of authorship offered under the terms of this License”.  That seems to me to allow an interpretation that excludes data, which sit in the grey zone between creative works that can be copyrighted and, er, things (like gene sequences and chemical structures of drugs) that can be patented.

So how about Public Library of Science and Hindawi, the other major OA publishers?  Well, Hindawi seems to say nothing about data whatsoever, only that authors retain copyright and articles are published under a CC Attribution license.  PLoS also publishes everything under a CC Attribution license, which says nothing about data, but if you dig a bit you find encouraging things in the editorial/publishing policies:

Publication is conditional upon the agreement of authors to make freely available any materials and information associated with their publication that are reasonably requested by others for the purpose of academic, noncommercial research.

Data Availability
Open access applies to both the scientific literature and the data used to establish that literature. Publication is contingent on making data integral to a manuscript freely available without restriction, provided that appropriate attribution is given and that suitable mechanisms exist for sharing the data used in a manuscript.
  1. Data for which public repositories have been established that are in general use should be deposited before publication, and the appropriate accession numbers or digital object identifiers published with the paper.
  2. If an appropriate repository does not exist, data should be provided as supporting information with the published paper. If this is not practical, data should be made freely available upon reasonable request.
  3. The conclusions of a study must not be dependent solely on the analysis of proprietary data. If proprietary data were used to reach a conclusion, and the authors are unwilling or unable to make these data public, then the paper must include an analysis of public data that validates the conclusions so that others can reproduce the analysis and build on the findings.

Note that any restrictions on the availability or on the use of datasets might be judged to diminish the significance of a paper and will therefore influence the decision about whether a paper should be published. These policies have been developed in accordance with the principles established in Sharing Publication-Related Data and Materials (National Academies Press, 2003).

That’s better, stronger language — but why is there no mention of data in the actual license, and why is there a need for warnings about restrictions that “might be judged to diminish the significance, etc” if publication is truly conditional on open access to data?  I suspect another toothless tiger.  It’s not that I want the tiger to have teeth, that is, for journals to actively police data availability, but that I wonder why I have to go digging around the website just to find this wishy-washy nod in the general direction of Open Data.  To illustrate my point here, suppose I read a paper in PLoS Biology, and I want to get my hands on some raw data from that paper: where are they?  Can I have them?  What can I do with them?  All of these things are, basically, left up to the authors. 

Now remember that these highly unsatisfactory examples are drawn from the most prominent Open Access publishing houses, which might be expected to be much more supportive of Open Data than commercial traditional publishers.  Thus the power of Peter’s Open Data addendum becomes apparent: it is attached directly to the paper, so readers do not have to go hunting through journal websites to find out the intellectual property status and location of interesting datasets.  It allows authors to take control.

To be effective, then, an Open Data addendum must at least answer my opening questions: it must point to the online, freely accessible location of the raw, un-hamburgered data; it should make clear that yes, you can have them; and it should state clearly what you can do with them.  The last question probably requires the creation of multiple addenda, since some people (like Jonathan Eisen) will want to effectively copyleft their data, whereas others will prefer less restrictive licenses.  My preferred answer is “anything you want, so long as you do not remove information or materials from the scientific commons”.

So, finally, let me take a stab at a draft Open Data addendum.  This is based on largely copied from the SPARC author addendum, and my idea is that it should, like (and if necessary with) the SPARC addendum, be submitted to the publisher together with their publication agreement.


THIS ADDENDUM hereby modifies and supplements the attached Publication Agreement concerning the following Article:

[manuscript title]

and the following Raw Data from which the Article was prepared:

[list of data sets, including permanent web address/es from which they can be obtained]

The parties to the Publication Agreement and to this Addendum are:

[list of authors, indicating corresponding author] (individually, or if more than one author, collectively, the Author), and


The parties agree that wherever there is any conflict between this Addendum and the Publication Agreement, the provisions of this Addendum are paramount and the Publication Agreement shall be construed accordingly.  Notwithstanding any terms in the Publication Agreement to the contrary, AUTHOR and PUBLISHER agree as follows:

1. Author’s Retention of Rights. In addition to any rights under copyright retained by Author in the Publication Agreement, Author retains all rights to the Raw Data underlying the Article, including but not limited to: (i) the rights to reproduce, distribute and publicly display the Raw Data in any medium; and (iii) the right to authorize others to make any use of the Raw Data so long as Author receives credit as author and the journal in which the Article has been published is cited as the source of first publication of the Article and Raw Data.

2. Licensing of Raw Data.  Author hereby releases the Raw Data under the terms of a Creative Commons Attribution Share-Alike License [or insert whatever license you prefer], where “the work” is understood to mean the data sets listed above.  Publisher agrees to include in the Article this statement of licensing terms and the above list of data sets and web address/es from which they can be freely obtained.

3. Publisher’s Acceptance of this Addendum. Author requests that Publisher demonstrate acceptance of this Addendum by signing a copy and returning it to the Author. However, in the event that Publisher publishes the Article in the journal identified herein or in any other form without signing a copy of the Addendum, Publisher will be deemed to have assented to the terms of this Addendum.

That’s not perfect, not by a long shot — most especially not for automated data mining, which requires machine-readable metadata and data. It should, however, do what Peter suggests: provide some relief from endless rounds of find-the-permissions, and get a much-needed conversation underway.

5 thoughts on “Where are the data? Can I have them? What can I do with them?

  1. Great discussion. With the current journal publishing system, at least in organic chemistry, the raw data are not actually available through the publisher and thus are not part of that copyright. Supplementary data rarely contain actual raw data. The last time I looked up Supplementary Info in JACS, all I got was a table of yields. What I wanted was NMR and chromatography data. For it to be really useful you have to be able to interact with the raw data in the same way that the researcher did to arrive at their conclusions. For example:
    The way I look at it, I may filter, summarize and integrate the raw data when I publish in a journal but I never give it away. I don’t hand over the copyright to my lab notebooks. But I can reference a lab notebook page to support a statement in my article, including the experimental section. Using a wiki makes it easy to reference a specific page version. At least that is what I will try. I’ll let you know if the editors find that acceptable. If not then I think we’ve come to a fork in the road of scientific communication.

  2. I really think this is a great idea and is turyl in the spirit of Open Science. There have been a variety of attempts to do ths for some types of data that still need more work. For example, for DNA sequencing, Genbank is considered by many to be this raw data Database for sequences. While Genbank is a phenomenal resource, it does not actually contain the raw data on sequences. The raw data comes in the form of the results of sequencing experiments themsleves (e.g., electropherograms). The data in Genbank is a model/interpretation of the raw data. The distinction here is important as not all the bases in a DNA sequence in Genbank are of equal quality. And when you are using sequencing to identify subtle differences within or between species, the quality is really important. There is a place for the raw data for DNA sequencing – it is called the Trace Archive and it is something NCBI has set up. But not everyone deposits their data there. But at least there is aplace for it (and the journals never ask fro this to be piut in Supplemental Data) But they ask for all sorts of thigns to be put there and I did not realize until your blog how this is in a way an insiduous idea of the journals. Not only do the get the rights to the paper which they do not deserve, they sometimes get control over the raw data. Keep pushing this – I will try to help.

  3. Interesting discussion of Open Data. I like CC licenses, but they may not work well for “data”.
    Many scientific projects collect pretty “objective” factual data with little expressive content. Copyright only protects expressions, and since Creative Commons licenses only apply to copyright protected works, these licenses won’t work with many scientific datasets.
    A CC license won’t give contributors a legal assurance that they will be cited and attributed for data. Even if a dataset is pretty “expressive” (copyrightable and CC licensed), simple attribution may not be enough. Papers are the currency of achievement not raw data.
    Making raw data available may put you at a competitive disadvantage, especially if you are less certain about being attributed for it.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>