Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.

The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It’s a pretty good analysis of an important topic, but something vital is missing:

Where are the data? Can I have them? What can I do with them?1

In fact, the data are approximately available here. Why “approximately”? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article — one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.
This is disappointing, for both general and specific reasons. It’s always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It’s also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.
It’s also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can’t think of a reason to take logs of that kind of data — and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


I could answer my own question quickly and easily if I could get my hands on the underlying data — which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It’s simply not possible to know — the only way to find out is to make the data openly available!
I realise it’s not possible for journals to demand Open Data from their authors — that’s what funder-level mandates are for, though there’s much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.

Leave a Reply

Your email address will not be published. Required fields are marked *