Lazy reporter, no donut.

Dennis Carter in an eCampus News article about NPG’s Scitable:

Scitable’s January launch came as elite universities across the United States are embracing open-access formats–making research articles available for free online. This marks an abrupt departure from the traditional model of printing research articles in academic journals, which can cost campuses as much as $20,000 annually, open-access experts say.

So, is it the traditional model that can cost campuses up to $20K/yr, or academic journals, each of which can cost etc?
It’s only obvious that what is meant is $20K/yr per journal subscription if you already know that libraries spend millions of dollars per year on serials.
I’d expect a publication that wants you to register to read its content1 to bother making that content accurate and unambiguous.
1 Sure, registration is free. Registration also provides the publisher with a great bolus of immensely valuable marketing information, to say nothing of the slimy opt-out spam opportunity. Which is why I recommend poisoning such databases with fake information providing minimal information unless you get content that you really value from the site. (Two wrongs etc, hence the edit.)

Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.

The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It’s a pretty good analysis of an important topic, but something vital is missing:

Where are the data? Can I have them? What can I do with them?1

In fact, the data are approximately available here. Why “approximately”? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article — one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.
This is disappointing, for both general and specific reasons. It’s always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It’s also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.
It’s also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can’t think of a reason to take logs of that kind of data — and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


I could answer my own question quickly and easily if I could get my hands on the underlying data — which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It’s simply not possible to know — the only way to find out is to make the data openly available!
I realise it’s not possible for journals to demand Open Data from their authors — that’s what funder-level mandates are for, though there’s much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.

Why don’t we share data? Not for the reasons Steven Wiley thinks we don’t.

Via Peter Suber, I came across an editorial about data sharing in The Scientist. I disagree with the author, PNNL’s Steven Wiley, on a number of points:

Despite the appeal of making all biological data accessible, there are enormous hurdles that currently make it impractical. For one, sharing all data requires that we agree on a set of standards. This is perhaps reasonable for large-scale automated technologies, such as microarrays, but the logistics of converting every western blot, ELISA, and protein assay into a structured and accessible data format would be a nightmare — and probably not worth the effort.

Wiley is making two mistakes here: setting the perfect against the good, and vastly underestimating human ingenuity.
Standards are inarguably required for automated sharing and essential for the sharing of ALL data, but that doesn’t mean that sharing SOME data, with evolving standards or even without any standards, has no utility. My pet example is the long standing practice of supporting scientific claims with the phrase “data not shown” in peer-reviewed papers, something I think should no longer be allowed. All scientific claims should be supported by data. “Data not shown” belongs to the print era, when space was limited and distribution relied on physical reproduction and transport. This is the era of the online supplement, to which no such restrictions apply.
Reasonable people might contend that I am stretching the concept of “data sharing” to cover my pet peeve there, but I chose the example deliberately as an edge case: there is, to me, clear utility in that kind of data sharing, even though it involves no standards, only some data, and only eyeball-by-eyeball access (whereas I myself frequently argue that the greater part of the value of Open distribution probably lies in the long term, in machine-to-machine access). I argue that more sharing, using — despite their current flaws — evolving standards, is likely to yield significant dividends well before reaching the eventual goal of sharing all data using universal standards.
This leads me to the second mistake. It seems odd to me to insist that because standards are difficult to develop and implement, the bulk of such work is futile. The key is the phrase “currently… impractical”. The whole concept of the internet was probably considered “currently impractical” by a great many people, until someone went and built it. There are plenty of people still willing to pronounce Free/Open Source software “currently impractical”, even as they (perhaps unwittingly) rely on it every time they go online or send email. Then-existing hurdles at various times surely made business on the internet “currently impractical”, and banking on the internet “currently impractical”, and — need I go on?
Moreover, I am not the only one who disagrees about the value of creating standards for difficult-to-share data. If you think western blots would be a nightmare, how about biodiversity data — like, say, museum specimens? How about anthropometric data, exchangeable biomaterials, neuroscience data, electron micrographs, magnetic resonance images or microscopy images? The MIBBI project has dozens of other examples, the Open Biomedical Ontologies Foundry is working on dozens more, and might offer a lightweight solution to some of the same problems.
(In re: Wiley’s specific examples: I was easily able to find efforts underway to enable sharing of gel electrophoresis data, protein affinity reagents and molecular interaction experiments; and I can’t imagine ELISA data being much harder to share than microarray information — surely MIAME, for instance, could readily be adapted if it wouldn’t already serve? I’m not sure what kind of protein assay Wiley has in mind.)
I cannot begin to imagine how to build semantic and exchange standards for those kinds of data, but I’m not about to bet against the people currently trying to do so; nor do I believe that, once built, their systems will prove to have been “not worth the effort”.
As I mentioned, reasonable people might disagree about various points above. But Wiley goes on to say:

Unfortunately, most experimental data is obtained ad hoc to answer specific questions and can rarely be used for other purposes.

which is just plain wrong. Much of the rationale for data sharing, the engine of much of its promise, is the simple observation that you cannot know what someone else will do with your data, particularly when they have access to lots of other people’s data to go with it. Re-use beyond the scope of the original author’s imagination is a primary impetus for data sharing, and innovative examples abound; for instance, just take a look at Tony Hirst’s blog. (If there is a dearth of examples from biomedical research, I’d call that an argument in favor of more, not less, data sharing.)
“Can rarely be used” is an empirical claim, and those should be backed by data — and I can think of only one way to get the relevant data in this case.
Wiley continues:

Good experimental design usually requires that we change only one variable at a time. There is some hope of controlling experimental conditions within our own labs so that the only significantly changing parameter will be our experimental perturbation. However, at another location, scientists might inadvertently do the same experiment under different conditions, making it difficult if not impossible to compare and integrate the results.
[…] In order to sufficiently control the experimental context to allow reliable data sharing, biologists would be forced to reduce the plethora of cell lines and experimental systems to a handful, and implement a common set of experimental conditions.

Experimental results are supposed to provide useful information about the world of sense-perception. If a result cannot be repeated by different hands in a different lab, then it is probably not telling us what we think it is telling us about the way the world works. If, on the other hand, a particular result does mean what we think it means about the underlying system, then we should be able to design different experiments to be carried out with different hands, conditions, equipment etc., and obtain data that supports the same conclusions. That’s what we call a robust result, and standard practice is to aim for robust results.
Regarding integration and comparison of results from different conditions — just what does meta-analysis mean, if not exactly that? As an example, if you were to knock Pin1 down in HeLa cells, you’d block their growth, but Pin1 knockout mice survive just fine. Comparison of those results is not only possible, but extremely interesting, and is the way we learned that mice have an active Pin1 isoform, Pin1L, which is present but potentially inactive in humans.
I think that variation in conditions between labs is a good reason to build finer-grained semantic structures, but no reason at all to throw up our hands and give up on linked data.
Wiley goes on to give as his sole concrete example the lack of uptake into published papers of data from the Alliance for Cell (sic) Signaling. It’s actually the Alliance for Cellular Signaling1; their website lists 20 publications, NextBio finds 35 and Google Scholar (which covers a lot more than peer-reviewed papers) finds 440. Scholarly papers are a somewhat limited measure of research impact, but that’s not at first glance an impressive showing. Consider, though, that the AfCS was established in the late 1990’s, which puts it well ahead of its time, and then compare the first, second and ongoing third decades of the undisputed poster child of data sharing2:


There’s more to Wiley’s choice of example, though:

In my own case, I am interested in the EGF receptor and receptor tyrosine kinases. This aspect of cell signaling was not covered in their dataset, and thus it is of no interest to me.

I wish I had a dollar for every time I’d heard an argument against some new idea that boils down to: “I can’t figure this out, or find a use for it myself; therefore it’s no good and will never be any use to anyone”. I’m sure there’s a pithy Latin name for this particular logical fallacy.
Wiley continues in, as it turns out, a similar vein:

And soon, discussions about the importance of sharing may become moot, since the rapid pace of technology development is likely to eliminate much of the perceived need for sharing primary experimental data. High throughput analytical technologies, such as proteomics and deep sequencing, can yield data of extremely high quality and can produce more data in a single run than was previously obtained from years of work. It will thus become more practical for research groups to generate their own integrated sets of data than try to stitch together disparate information from multiple sources.

And just what does the PNNL’s Biomolecular Systems Initiative (of which Wiley is director) do? By a strange coincidence, this:

advancing our high-resolution, high-throughput technologies by exploiting PNNL’s strengths in instrument development and automation and applying these technologies to solve large-scale biological problems….
We are building a comprehensive computational infrastructure that includes software for bioinformatics, modeling, and information management. To be more competitive in obtaining programmatic funding, we will continue to invest in new capabilities and technologies such as cell fractionation, affinity reagents, high-speed imaging, affinity pull downs, and ultra-fast proteomics. This will help us build world-class expertise in the generation and analysis of large, heterogeneous sets of biological data. The ability to productively handle extremely large and complex datasets is a distinguishing feature of the biology program at PNNL.

The remainder of this post is left as an exercise for the reader; be sure to cover the question of how less well-heeled institutions are supposed to carry out work in proteomics and deep sequencing and so on, and don’t forget to ask for evidence showing that it is not important to share data even between such high-fliers, since presumably they can extract every last conceivable piece of useful information from their own data…
1You’d be amazed how many things share that acronym — activity-friendly communities, antibody-forming cells, ataxia functional composite scale, antral follicle count, alveolar fluid clearance, age at first calving, amniotic something something — that’s where I gave up. Why oh why can’t we have a decent text search? Even just “match case” would have solved much of my problem here. /rant
2 graph from here

Fooling around with numbers, part 5b.

I’ve already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series:


I’m just sayin’. (I may have to get that graph as a tattoo).

P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns.