Fooling around with numbers, part 3; or, why would anyone pay for these journals?

Following on from part 2, I thought I’d ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I’d missed a point: there is an even-further-out outlier above the Elsevier set I pointed out:
It’s another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280!
It gets worse, too, because I’m led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I’m missing something or the arXiv search function is totally for shit, so I couldn’t do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv — mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it’s also covered by the arXiv). Prices haven’t improved in the intervening 5 years, either:
[I had a table here but Movable Type keeps munging it. Piece of shit. Here’s a jpg until I sort it.]

That got me wondering how the rest of the journals are distributed by price/use and publisher:
The inset shows a zoomed view but even that wasn’t particularly informative, so I zoomed in a bit further:
The curve fits are for the whole of each dataset, even though it’s a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn’t change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set — score another one for Bob O’Hara’s suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:
As in part 2, with uses restricted to 5000 or fewer there was improvement in price/use correlation in most cases, but nothing dramatic; I’m not sure why the Blackwell fit got worse. The Nature subset is close to being able to claim at least a modest fit to a straight line there, so not only does NPG boast some of the lowest prices and highest use rates, they are the closest of all the publishers to pricing their wares according to (at least one measure of) likely utility. Special note to Maxine Clarke, remember this post next time I tee off on Nature! 🙂
Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):
Now it seems that we’re looking at some kind of long-tailed distribution, which is hardly surprising. The majority of the titles fall into the first few price/use intervals, say less than about $6/use. Since most pay-per-view article charges are between $25 and $40, I more-or-less arbitrarily picked $30/use as a cutoff and asked how many titles from each publisher fall above that cutoff, and what proportion of the total expenditure (viz, list price sum) does that represent? The inset shows that 161 titles, most of them from Kluwer and Springer (whose figures I combined because Springer bought most of Kluwer’s titles sometime after 2003), account for about 5% of the total in list price terms. That was a bit more useful, so I expanded it to ask the same question for each interval:
What becomes apparent now, I think, is that the UC librarians are doing a good job! Only 6% of the total number of journals (5% of the total list price cost) fall into the “more than $30/use” category, of which it could reasonably be said that the library might as well drop the subscription and just cover the pay-per-view costs of their patrons. Only a further 15% or so work out to more than $6/use, and around 80% of the collection (figured as titles or cost) comes in under $6/use, with around 30% less than $1/use.
So, are these reasonable prices — $1 per use, $6 per use? I’m not sure I can, but I’ll try to say something about that question, using the UCOSC dataset, in Part 4.

Peters Murray-Rust and Sefton on “science and selfishness”

Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody’s post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it’s more important to get scientists using semantic markup than to take an ideological stand against Microsoft:

Microsoft is “evil”. I can understand this view – especially during the Hallowee’n document era. There are many “evil” companies – they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors – such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one .
The monopoly exists and nowhere more than in in/organic chemistry where nearly all chemists use Word. We have taken the view that we will work with what scientists actually use, not what we would like them to use. The only current alternative is to avoid working in this field – chemists will not use Open Office.

Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR’s entry:

I will have to talk about this at greater length but I think the issue is not working with Microsoft it’s working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here – MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word.

(Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook…)
There’s a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the “Evil Empire” concerns largely go away, because the project can simply fork to support any applications other than Word.
(I’ve emailed John Wilbanks to get his reaction to all this, but be patient because he’s insanely busy in general, and right now he’s on honeymoon!)

On science and selfishness.

Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here’s the gist:

One of the things that disappoints me is the lack of understanding of what’s at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it’s OK to work with Microsoft, provided it furthers their own specific agenda. Here’s a case in point:

John Wilbanks, VP of Science for Creative Commons, gave O’Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O’Reilly Emerging Technology Conference. […] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly.

That might sound fine – after all, the plugins are open source, right? But no. Here’s the problem:

Wilbanks said that Word is, in his experience, the dominant publishing system used in the life sciences [and] probably the place that most people prepare drafts. “almost everything I see when I have to peer review is in a .doc format.”

In other words, he doesn’t see any problem with perpetuating Microsoft’s stranglehold on word processing. But it has consistently abused that monopoly […]
Working with Microsoft on open source plugins might seem innocent enough, but it’s really just entrenching Microsoft’s power yet further in the scientific community […]
It would have been far better to work with to produce similar plugins, making the free office suite even more attractive, and thus giving scientists yet another reason to go truly open, with all the attendant benefits, rather than making do with a hobbled, faux-openness, as here.

Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match.
Having said that — and most expressly without wishing to put words into John Wilbanks’ mouth — my response to Glyn’s criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that:
Glyn talks about making Open Office more attractive and providing incentives for scientists to use Open solutions, but what he may not realize is that incentives mostly don’t work in that tribe. Scientists will do nothing that doesn’t immediately and obviously contribute to publications, unless forced to do so. Witness the utter failure of Open Access recommendations, suggestions and pleas vs the success of OA mandates. These are people who ignore carrots; you need a stick, and a big one.
For instance: I use Open Office in preference to Word because I’m willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I’m something of an exception. Faced with a single difficulty, one single function that doesn’t work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it’s there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I’ve got a paper to write that will make me the most famous and important scientist in all the world?
The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change — I think this might be partly due to the kind of personality that ends up in research, but it’s also a response to the pressure to produce. In science, only one kind of productivity counts — that is, keeps you in a job, brings in funding, wins your peers’ respect — and that’s published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to — at best — important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn’t have time to sit around sharpening tools, he’s got a tree to cut down!
I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that’s true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they’ll agree about the advantages of Open over closed — for instance, I’ve yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I’ve also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won’t let them self-archive! “I have a job to keep”, they say, “I’m not going to sacrifice my livelihood to the greater good”; or “that’s great, but first I need to get this grant funded”; or my personal favourite, “once I have tenure I’ll start doing all that good stuff”. (Sure you will. But I digress.)
So to return to the question at hand: it’s a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent.
It’s also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won’t be another reason to use the Free application, it will be another reason to reject semantic markup: “oh yeah, the semantic web is a great idea, yeah I’d support it but there’s no Word plugin so I’d have to install Open Office and I just don’t have time to deal with that…”.
When it comes to scientists, you don’t just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn’t.

Fooling around with numbers, part 2

Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset.
The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt.
As a first pass, I’ve simply made pairwise comparisons between impact factor, price and online use. There’s no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove “extremes”, as shown in the inset figure:
One other thing that stands out is the cluster of Elsevier journals in the high-price, low-impact quadrant, and the Nature group smaller cluster of NPG’s highest IF titles at the opposite extreme. Note that n < 3137 because not all titles have impact factors, usage stats, etc. I've included the correlation coefficients mainly because their absence would probably be more distracting than having the (admittedly fairly meaningless) numbers available, at least for readers whose minds work like mine. Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses: UCOSCpriceuse.JPG
Again, not so much. I played about with various cutoffs, and the best I could get was a weak correlation at the low end of both scales (see inset). And again, note Elsevier in the “low value” quadrant, and Nature in a class of its own. Being probably the one scientific journal every lay person can name, in terms of brand recognition it’s the Albert Einstein of journals. Interestingly, not even the other NPG titles come close to Nature itself on this measure, though they do when plotted against IF. I wonder whether that actually reflects a lay readership?
Finally (for the moment) I played the Everest (“because it’s there”) card and plotted use against impact factor:
The relationship here is still weak, but noticeably stronger than for the other two comparisons — particularly once we eliminate the Nature outlier (see inset). I’ve seen papers describing 0.4 as “strong correlation”, but I think for most purposes that’s wishful thinking on the part of the authors. I do wish I knew enough about statistics to be able to say definitively whether this correlation is significantly greater than those in the first two figures. (Yes yes, I could look it up. The word you want is “lazy”, OK?) Even if the difference is significant, and even if we are lenient and describe the correlation between IF and online use as “moderate”, I would argue that it’s a rich-get-richer effect in action rather than any evidence of quality or value. Higher-IF journals have better name recognition, and researchers tend to pull papers out of their “to-read” pile more often if they know the journal, so when it comes time to write up results those are the papers that get cited. Just for fun, here’s the same graph with some of the most-used journals identified by name:
Peter Suber has pointed out a couple of other (formal!) studies that have come to similar conclusions to those presented here. There are probably many such, because the relevant literature is dauntingly large. There’s even a journal of scientometrics! The FriendFeed discussion of my earlier post has generated some interesting further questions, for instance Bob O’Hara’s observation that a finer-grained analysis would be more useful. I’m not sure I’m up for manually curating the data, though, and I can’t see any other way to achieve what Bob suggests… I might do it for the smaller Elsevier Life Sciences set. For the moment I think I’ll concentrate more on slightly different questions regarding IF and price distributions, as in Fig 3 in my last post — tune in next time for more adventures in inept statistical analysis!
1 I’m always on about Open Data and “publish early, publish often” collaborative models like Open Notebook Science, and it occurs to me that the ethos applies to blogging as much as to formal publications. So I’m going to try to post analyses like this in parts, so as to get earlier feedback, and of course I try to make all my data and methods available. Let me know if you think I’m missing any opportunities to practice what I preach.

Fooling around with numbers

A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found “no direct or straightforward relationship” between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn’t plot that one out, so I will:
There is some circularity here, since RPI is calculated using price, but once again I’d call that no direct or straightforward relationship.
All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier’s 2009 price list, handily downloadable as an Excel spreadsheet. It doesn’t include Impact Factors, but the linked “about” page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley’s set, there was no apparent correlation between IF and price:
Interesting, no? If the primary measure of a journal’s value is its impact — pretty layouts and a good Employment section and so on being presumably secondary — and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money — then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they’re asked to justify their prices or the latest round of increases in same.
There’s even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a “we can charge more if we have one” kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:
About the time I was finishing this up, I came across a much larger dataset from U California’s Office of Scholarly Communication. I’ve converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I’ll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there).
Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls.

No one goes into science to get rich.

A while back, Heather posted an entry about salaries in France, and just came right out and said what she makes:

The beginning junior professor (maitre de conférences, or MdC) fresh out of the Ph.D. (which never happens anymore) gets approximately 1700 euros in their pocket after benefits withholding each month, and this measure will bring it up to about 1800 euros. […] A MdC with 15 years’ seniority on the Le Monde comment thread earns 2600 euros a month; I earn 2300. (Unlike the French, I have an American indifference to revealing my salary to all; what with the fluctuating exchange rate it’s approximately equivalent to that of a tight-belted American high school teacher.)

I don’t know that it’s particularly American, but I’ve never minded telling everyone my income either. I understand that there are lots of reasons why one might be reticent to reveal this information, but by and large I’ve always felt that such reticence was mostly encouraged by those setting the salary levels, so that they could keep them as low as possible: divide and conquer exploit, or something.
Anyway, Heather’s comments got me curious, and I’ve always been scornful of the numbers available from sites like as they seem ridiculously inflated to me. Further, most of the survey data I’ve seen have been like this set from the AAUP or this one (warning: Word doc) from CPST — no mention of postdocs or grad students at all. When the CPST, for instance, reports a median salary of $80,000/year for “doctoral scientists”, believe me when I tell you their numbers are skewed towards faculty! Similarly, The Scientist’s annual life sciences survey for 2008 (free but requires registration) lists a median salary for academic scientists of $77,900. When you look at further breakdowns, though, you find that the median for scientists with no supervisory/managerial responsibilities is $49,400/year — tell that to the next TA, grad student or (junior) postdoc you meet!
So, I went ahead and posted a question — “how much money do you make?” — to the Life Scientists room on FriendFeed. There’s quite a conversation underway in that thread as I write this; Donnie pointed me to the AAUP survey I linked, others have posted reference material of various kinds, and Daniel reminded me of Mike Barton’s bioinformatician survey, the data from which can be downloaded from here. Some workup is available on OpenWetWare, but there’s not much there about salary so far, so I went ahead and did a little Excel spreadsheet-ing (shut up, ok, I’m just a biologist) of my own.
(Pause here to applaud Mike for all his hard work in collecting this data, and even more loudly for his decision to make it Open.)
I removed the entries with no salary information and made three arbitrary decisions: anyone reporting between $1K and $10K per year was actually reporting monthly salary, anyone under $1000/year was probably reporting monthly salary but who knows so I deleted them too, and anyone reporting between $10K and $20K/year didn’t entirely make sense as monthly OR yearly so I deleted them too. (I couldn’t be arsed to make case-by-case decisions by, for instance, looking at how many years each person had worked in the field.) This left me with n = 490 and a healthy appreciation for careful survey design (read: never give your respondents a free-form field if you can help it!).
If you’re really keen, you can download the spreadsheet I used from here. The basic outcomes are these:


The categories are as follows:

  1. Masters / PhD / Entry Level (n = 211)
  2. Post Doc / Research Scientist (n = 138)
  3. Senior Post doc / Senior Scientist (n = 58)
  4. PI / Group Leader / Team Leader (n = 52)
  5. Professor / Senior Managment (n = 31)

Means are shown +/- one standard deviation. I did break out categories 2+3 separately but it was not much different from 1+2+3. Plotting salary vs. years served of service gives us this:


I dicked about with the outliers a little, but nothing I did improved the curve fit much — unsurprising, given the spread, and almost certainly meaningless (note, for instance, that it extrapolates to a negative starting salary!). Anyway, there it is; if I get another wild hair I might break out the categories by industry/academia/government, but right now I’m too lazy.
If all of this has whet your appetite for more data, the NSF might have something for you (it’s getting late, so I’m not going to dig around in there myself today). The most believable numbers I’ve seen (viz, the numbers which accord most closely with my experience!) come from the Sigma Xi postdoc survey. You can get hold of the Sigma Xi data; briefly, data were collected from ~7,600 postdocs at >40 institutions, median salary in 1995 = $28,000 ($34,700 in 2004 dollars) and median salary in 2004 = $38,000.