Via Peter Suber, I came across an editorial about data sharing in The Scientist. I disagree with the author, PNNL’s Steven Wiley, on a number of points:
Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare — and probably not worth the effort.
Wiley is making two mistakes here: setting the perfect against the good, and vastly underestimating human ingenuity.
Standards are inarguably required for automated sharing and essential for the sharing of ALL data, but that doesn’t mean that sharing SOME data, with evolving standards or even without any standards, has no utility. My pet example is the long standing practice of supporting scientific claims with the phrase “data not shown” in peer-reviewed papers, something I think should no longer be allowed. All scientific claims should be supported by data. “Data not shown” belongs to the print era, when space was limited and distribution relied on physical reproduction and transport. This is the era of the online supplement, to which no such restrictions apply.
Reasonable people might contend that I am stretching the concept of “data sharing” to cover my pet peeve there, but I chose the example deliberately as an edge case: there is, to me, clear utility in that kind of data sharing, even though it involves no standards, only some data, and only eyeball-by-eyeball access (whereas I myself frequently argue that the greater part of the value of Open distribution probably lies in the long term, in machine-to-machine access). I argue that more sharing, using — despite their current flaws — evolving standards, is likely to yield significant dividends well before reaching the eventual goal of sharing all data using universal standards.
This leads me to the second mistake. It seems odd to me to insist that because standards are difficult to develop and implement, the bulk of such work is futile. The key is the phrase “currently… impractical”. The whole concept of the internet was probably considered “currently impractical” by a great many people, until someone went and built it. There are plenty of people still willing to pronounce Free/Open Source software “currently impractical”, even as they (perhaps unwittingly) rely on it every time they go online or send email. Then-existing hurdles at various times surely made business on the internet “currently impractical”, and banking on the internet “currently impractical”, and — need I go on?
Moreover, I am not the only one who disagrees about the value of creating standards for difficult-to-share data. If you think western blots would be a nightmare, how about biodiversity data — like, say, museum specimens? How about anthropometric data, exchangeable biomaterials, neuroscience data, electron micrographs, magnetic resonance images or microscopy images? The MIBBI project has dozens of other examples, the Open Biomedical Ontologies Foundry is working on dozens more, and Bioformats.org might offer a lightweight solution to some of the same problems.
(In re: Wiley’s specific examples: I was easily able to find efforts underway to enable sharing of gel electrophoresis data, protein affinity reagents and molecular interaction experiments; and I can’t imagine ELISA data being much harder to share than microarray information — surely MIAME, for instance, could readily be adapted if it wouldn’t already serve? I’m not sure what kind of protein assay Wiley has in mind.)
I cannot begin to imagine how to build semantic and exchange standards for those kinds of data, but I’m not about to bet against the people currently trying to do so; nor do I believe that, once built, their systems will prove to have been “not worth the effort”.
As I mentioned, reasonable people might disagree about various points above. But Wiley goes on to say:
Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.
which is just plain wrong. Much of the rationale for data sharing, the engine of much of its promise, is the simple observation that you cannot know what someone else will do with your data, particularly when they have access to lots of other people’s data to go with it. Re-use beyond the scope of the original author’s imagination is a primary impetus for data sharing, and innovative examples abound; for instance, just take a look at Tony Hirst’s blog. (If there is a dearth of examples from biomedical research, I’d call that an argument in favor of more, not less, data sharing.)
“Can rarely be used” is an empirical claim, and those should be backed by data — and I can think of only one way to get the relevant data in this case.
Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.
[…] In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions.
Experimental results are supposed to provide useful information about the world of sense-perception. If a result cannot be repeated by different hands in a different lab, then it is probably not telling us what we think it is telling us about the way the world works. If, on the other hand, a particular result does mean what we think it means about the underlying system, then we should be able to design different experiments to be carried out with different hands, conditions, equipment etc., and obtain data that supports the same conclusions. That’s what we call a robust result, and standard practice is to aim for robust results.
Regarding integration and comparison of results from different conditions — just what does meta-analysis mean, if not exactly that? As an example, if you were to knock Pin1 down in HeLa cells, you’d block their growth, but Pin1 knockout mice survive just fine. Comparison of those results is not only possible, but extremely interesting, and is the way we learned that mice have an active Pin1 isoform, Pin1L, which is present but potentially inactive in humans.
I think that variation in conditions between labs is a good reason to build finer-grained semantic structures, but no reason at all to throw up our hands and give up on linked data.
Wiley goes on to give as his sole concrete example the lack of uptake into published papers of data from the Alliance for Cell (sic) Signaling. It’s actually the Alliance for Cellular Signaling1; their website lists 20 publications, NextBio finds 35 and Google Scholar (which covers a lot more than peer-reviewed papers) finds 440. Scholarly papers are a somewhat limited measure of research impact, but that’s not at first glance an impressive showing. Consider, though, that the AfCS was established in the late 1990’s, which puts it well ahead of its time, and then compare the first, second and ongoing third decades of the undisputed poster child of data sharing2:
There’s more to Wiley’s choice of example, though:
In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.
I wish I had a dollar for every time I’d heard an argument against some new idea that boils down to: “I can’t figure this out, or find a use for it myself; therefore it’s no good and will never be any use to anyone”. I’m sure there’s a pithy Latin name for this particular logical fallacy.
Wiley continues in, as it turns out, a similar vein:
And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.
And just what does the PNNL’s Biomolecular Systems Initiative (of which Wiley is director) do? By a strange coincidence, this:
advancing our high-resolution, high-throughput technologies by exploiting PNNL’s strengths in instrument development and automation and applying these technologies to solve large-scale biological problems….
We are building a comprehensive computational infrastructure that includes software for bioinformatics, modeling, and information management. To be more competitive in obtaining programmatic funding, we will continue to invest in new capabilities and technologies such as cell fractionation, affinity reagents, high-speed imaging, affinity pull downs, and ultra-fast proteomics. This will help us build world-class expertise in the generation and analysis of large, heterogeneous sets of biological data. The ability to productively handle extremely large and complex datasets is a distinguishing feature of the biology program at PNNL.
The remainder of this post is left as an exercise for the reader; be sure to cover the question of how less well-heeled institutions are supposed to carry out work in proteomics and deep sequencing and so on, and don’t forget to ask for evidence showing that it is not important to share data even between such high-fliers, since presumably they can extract every last conceivable piece of useful information from their own data…
1You’d be amazed how many things share that acronym — activity-friendly communities, antibody-forming cells, ataxia functional composite scale, antral follicle count, alveolar fluid clearance, age at first calving, amniotic something something — that’s where I gave up. Why oh why can’t we have a decent text search? Even just “match case” would have solved much of my problem here. /rant
2 graph from here