Fooling around with numbers, part 4; or, those data — you keep using them — I don’t think they mean what you think they mean…

At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I’d try to say something about what constituted a fair price. I hadn’t thought that through at all, and it turns out that I really can’t get much leverage against that question from the UCOSC dataset alone.
In addition to the graphs in parts 1-3, here’s yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my balls perfectly good table1):
Table 1
Perhaps Elsevier doesn’t stand out quite so much as I might have expected — they still dominate by virtue of market share, but in terms of cost/use or use/title, Springer looks the worst of the bunch. Mean ($0.76) and median ($1.89) cost per use doesn’t mean much without context. I could argue that since libraries are having trouble keeping up with serials costs and usage is only likely to increase, those probably don’t represent fair prices… but I don’t know how much weight that argument would hold, and anyway you should go read Heather Morrison on why usage-based pricing is dangerous. (That’s one of the benefits of thinking-out-loud like this; knowledgeable people come along and point out stuff you need to know. Yay lazyweb!)
So, I need context: let’s start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA — but for my purposes, I’m really only interested in those which carry the scholarly literature. The US Dept of Education’s National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries.
According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 — about $13 billion — per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don’t pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data.
Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries — again, I just looked at 2004. There’s also the Association of Research Libraries, which is “a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements”, mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here).
Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC’s 10 main campus libraries).  The resulting data look like this2:
Table 2
Na, not applicable; cc, couldn’t calculate. The ACRL data is derived mainly from two summaries, one showing expenditure (red) and one showing holdings (blue). The mean cost/serial is a fudge, since it was calculated using figures from both summaries, but I doubt it’s significantly different from the value I would get if I had all the data, since the number of libraries included in each set is so similar. The other values in green are also approximations derived from summary reports3. Note that the “per library” figures for the UCOSC dataset are actually just for that subset of journals (hence the “<<1” entry for “no. libraries”).
I’ve put some sanity checks — do these data make sense? — in a footnote4; to me, the data appear both externally and internally consistent.  I don’t, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway:
Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off — the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion.  Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand.  That’s still an enormous amount of money, but it’s not half the NIH budget!  So why the discrepancy?
Quite apart from “list price” and “what libraries actually pay” being two very different things, I’ve been making a mistake in terminology.  When I think of “serials” in a library, I think of the peer-reviewed scholarly literature; I tend to use “journals” to mean the same thing.
This is very, very wrong.
(As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine):

Current serial subscriptions (ongoing commitments) (line 13) – Report expenditures for current subscriptions to serials in all formats. These are publications issued in successive parts, usually at regular intervals, and, as a rule, intended to be continued indefinitely. Serials include periodicals, newspapers, annuals (reports, yearbooks, etc.), memoirs, proceedings, and transactions of societies.
Current serial subscriptions (line 26) — Report the total number of subscriptions in all formats. If the subscription comes in both paper and electronic form, count it twice. Count each individual title if it is received as part of a publisher’s package (e.g., Project MUSE, JSTOR, Academic IDEAL). Report each full-text article database such as Lexis-Nexis, ABI/INFORM as one subscription in line 27. Include paper and microfilm government documents issued serially if they are accessible through the library’s catalog.

From the ARL ditto:

Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials… Exclude unnumbered monographic and publishers’ series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is

a publication in any medium issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. This definition includes periodicals, newspapers, and annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc. of societies; and numbered monographic series.

Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren’t scholarly journals are (or can be) serials too. “Periodicals” means National Geographic qualifies — hell, so does Playboy magazine!
As of today (March 17), Ulrich’s Periodicals Directory lists 224,151 “active” periodicals; of those, 65,461 are “academic/scholarly”; and of those, 25,425 are “refereed”.
What do those things cost which aren’t part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I’ve been looking at?
Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets.  Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure — so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren’t included in the UCOSC dataset?  If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals — or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe?
In other words, I’ve made no headway at all on the question of a “fair price”; all I’ve managed to do here is to find more questions.  I guess that’s progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost?  I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I’m up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I’d love it. After all, look what I’m doing for fun.)
To return to where I started: there’s another angle of attack on the “fair price” question, which is to look at things from the other side.  How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I’ve been collecting links and notes for a while so in Part 5 6* I’ll try to put them all together and see if I’ve got anything useful.
* I’ve just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets.
Update: if you’ve read this far, go read the FriendFeed discussion, you’ll like it.


1 If you want the data there’s a comma-delimited text version of the table here and the spreadsheet from which the table is derived is here.
2 Comma-delimited text file here.
3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset.  Numbers in black are taken from the summaries provided, numbers in pink are calculated from them.
Table 3
Mean total expenditure per library was calculated using an approximate average number of libraries of 1074.
4 Sanity checks:

  • the ARL and ACRL subsets of the NCES libraries spend less in sum than the NCES set, but the mean and median expenditures/library are lower for the NCES set because it includes more, and smaller, libraries
  • the mean and median number of serials/library is similar between the ARL dataset and its UC subset, both figures being much larger than the mean serials/library for the NCES or ACRL sets (again, more and smaller libraries)
  • the mean and median cost/serial is similar throughout, except for the UCOSC dataset which is a curated subset of high-end scholarly journals (discussed above)

Are those reasonable totals for the libraries to be spending?

  • The ARL 2004-5 report shows that member libraries spent $680,774,493, with a median per library of $5,904,464, on serials, and total library expenditure was $2,683,008,943 (median per library $20,210,171)
  • The NCES 2004 summary shows that 3653 libraries surveyed spent, in sum, $5,751,247,194 on total operating expenses, $1,363,671,792 on serials and $2,157,531,102 on information resources in general

Are those reasonable total numbers of journals per library?

  • OHSU (where I was until recently employed) has 20857 entries in its “journals” catalog
  • The NCES 2004 summary shows that, all together, 3653 academic libraries held 12,763,537 serials subscriptions
  • The ARL 2004-5 report shows that 113 member libraries held 4,658,493 subscriptions, with a median per library of 37,668

Are those reasonable mean and median costs per serial?

  • I could only find unit costs for serials in the ARL report, in the “analysis of selected variables”, where the mean cost/serial is given as $247.55 per subscription (range $656.31 to $93.72, median $231.90, 88 libraries reporting).

So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog).  There are, e.g., no order-of-magnitude discrepancies — except perhaps in cost/serial, as discussed above.

Updates on “science and selfishness”

Update the first: now I feel bad for not waiting (though I did put “read AFTER honeymoon!!!” in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability.
Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding “a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin”:

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with and with earlier versions, and Mac versions of Microsoft’s products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

    The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out — not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsoft’s toll-access software but at least others can read the code and re-implement elsewhere. If it’s not interface code then it must be written in a portable language and/or framework.

MT weirdness

1. Comments are working again. Thanks to everyone who told me about the problem — I don’t know what it was, but my technical consultant (Spousal Unit) turned off the spam firewall and things look fine.
2. Help me, lazyweb! I can enter html tables just fine, unless there’s an image upstream — then MT inserts a dozen or more <br> tags above the table! I’ve tried <br clear=all> and every kind of spacing between the table and the character right before it. No amount of text between the table and the image seems to have any effect. My stylesheet is here and you can view source to see the main index, but I can’t see any obvious cause of the weirdness.

Fooling around with numbers, part 3; or, why would anyone pay for these journals?

Following on from part 2, I thought I’d ask a couple more questions about price-per-use, based on the online usage stats in the UCOSC dataset. I started on this because I noticed that in Fig 2 of part 2, I’d missed a point: there is an even-further-out outlier above the Elsevier set I pointed out:
It’s another Elsevier journal, Nuclear Physics B. In 2003, only 1001 online uses were reported to UC by the publisher, but the 2004 list price was $15,360. The companion journal Nuc Phys A is not much better, $10,121 for 1198 uses. Compare that with Nature, 286125 uses at just $1,280!
It gets worse, too, because I’m led to believe that anything that appears in a physics journal these days is available ahead of time from the arXiv. I tried to confirm that for Nuc Phys B, but either I’m missing something or the arXiv search function is totally for shit, so I couldn’t do it systematically. I did go through the latest table of contents (Vol 813 issue 3) on the Science Direct page, and was easily able to find every paper in the arXiv — mostly just by searching on author names, though in a couple of cases I had to put titles into Google Scholar. Still, they were all there, which leads me to wonder why any library would buy Nuc Phys B (or Nuc Phys A, assuming it’s also covered by the arXiv). Prices haven’t improved in the intervening 5 years, either:
[I had a table here but Movable Type keeps munging it. Piece of shit. Here’s a jpg until I sort it.]

That got me wondering how the rest of the journals are distributed by price/use and publisher:
The inset shows a zoomed view but even that wasn’t particularly informative, so I zoomed in a bit further:
The curve fits are for the whole of each dataset, even though it’s a zoomed view; the Nature set excludes British Journal of Pharmacology, the only NPG title that recorded 0 uses, and Nature itself. Colour coding by publisher is the same for each figure in this post. As in part 2, the correlation between price and use is weak at best and doesn’t change much from publisher to publisher. Also, each publisher subset shows a stronger correlation than the entire pooled set — score another one for Bob O’Hara’s suggestion that finer-grained analyses of this kind of data are likely to produce more robust results. Since cutoffs improved the apparent correlation for the pooled set, I tried that with the publisher subsets:
As in part 2, with uses restricted to 5000 or fewer there was improvement in price/use correlation in most cases, but nothing dramatic; I’m not sure why the Blackwell fit got worse. The Nature subset is close to being able to claim at least a modest fit to a straight line there, so not only does NPG boast some of the lowest prices and highest use rates, they are the closest of all the publishers to pricing their wares according to (at least one measure of) likely utility. Special note to Maxine Clarke, remember this post next time I tee off on Nature! 🙂
Next, I broke the data out into intervals (for clarity the labels say 0-1, 1-2 etc, but the actual intervals used were 0-0.99, 1-1.99 etc):
Now it seems that we’re looking at some kind of long-tailed distribution, which is hardly surprising. The majority of the titles fall into the first few price/use intervals, say less than about $6/use. Since most pay-per-view article charges are between $25 and $40, I more-or-less arbitrarily picked $30/use as a cutoff and asked how many titles from each publisher fall above that cutoff, and what proportion of the total expenditure (viz, list price sum) does that represent? The inset shows that 161 titles, most of them from Kluwer and Springer (whose figures I combined because Springer bought most of Kluwer’s titles sometime after 2003), account for about 5% of the total in list price terms. That was a bit more useful, so I expanded it to ask the same question for each interval:
What becomes apparent now, I think, is that the UC librarians are doing a good job! Only 6% of the total number of journals (5% of the total list price cost) fall into the “more than $30/use” category, of which it could reasonably be said that the library might as well drop the subscription and just cover the pay-per-view costs of their patrons. Only a further 15% or so work out to more than $6/use, and around 80% of the collection (figured as titles or cost) comes in under $6/use, with around 30% less than $1/use.
So, are these reasonable prices — $1 per use, $6 per use? I’m not sure I can, but I’ll try to say something about that question, using the UCOSC dataset, in Part 4.

Peters Murray-Rust and Sefton on “science and selfishness”

Peter Murray-Rust (welcome back to blogging!) has replied to Glyn Moody’s post about semantic plugins being developed by Science Commons in collaboration with the Evil Empire, which I discussed in my last post. Peter MR takes the view, with which I concur, that it’s more important to get scientists using semantic markup than to take an ideological stand against Microsoft:

Microsoft is “evil”. I can understand this view – especially during the Hallowee’n document era. There are many “evil” companies – they can be found in publishing (?PRISM), pharmaceuticals (where I used to work) Constant Gardener) , petrotechnical, scientific software, etc. Large companies often/always? adopt questionable practices. [I differentiate complete commercial sectors – such as tobacco, defence and betting where I would have moral issues] . The difficulty here is that there is no clear line between an evil company and an acceptable one .
The monopoly exists and nowhere more than in in/organic chemistry where nearly all chemists use Word. We have taken the view that we will work with what scientists actually use, not what we would like them to use. The only current alternative is to avoid working in this field – chemists will not use Open Office.

Another, to my mind even more important, point was raised by Peter Sefton in a comment on Peter MR’s entry:

I will have to talk about this at greater length but I think the issue is not working with Microsoft it’s working in an interoperable way. The plugins coming out of MS Research now might be made by well meaning people but unless they encode their results in something that can interop with other word processors (the main one is OOo Writer) then the effect is to prolong the monopoly. There is a not so subtle trick going on here – MS are opening up the word processing format with one hand while building addons like the Ontology stuff and the NLM work which depend on Word 2007 to work with the other hand. I have raised this with Jim Downing and I hope you can get a real interop on Chem4Word.

(Peter S, btw, blogs here and works on a little thing called The Integrated Content Enviroment (ICE), which looks to me like a good candidate for an ideal Electronic Lab Notebook…)
There’s a difference between the plugins being Open Source and the plugins being useful to the F/OSS community. If collaborators hold Microsoft to real interoperability, the “Evil Empire” concerns largely go away, because the project can simply fork to support any applications other than Word.
(I’ve emailed John Wilbanks to get his reaction to all this, but be patient because he’s insanely busy in general, and right now he’s on honeymoon!)

On science and selfishness.

Glyn Moody has a nice post up about fraternizing with the enemy in Open Science; you should read the whole thing, but here’s the gist:

One of the things that disappoints me is the lack of understanding of what’s at stake with open source among some of the other open communities. For example, some in the world of open science seem to think it’s OK to work with Microsoft, provided it furthers their own specific agenda. Here’s a case in point:

John Wilbanks, VP of Science for Creative Commons, gave O’Reilly Media an exclusive sneak preview of a joint announcement that they will be making with Microsoft later today at the O’Reilly Emerging Technology Conference. […] Microsoft will be releasing, under an open source license, Word plugins that will allow scientists to mark up their papers with scientific entities directly.

That might sound fine – after all, the plugins are open source, right? But no. Here’s the problem:

Wilbanks said that Word is, in his experience, the dominant publishing system used in the life sciences [and] probably the place that most people prepare drafts. “almost everything I see when I have to peer review is in a .doc format.”

In other words, he doesn’t see any problem with perpetuating Microsoft’s stranglehold on word processing. But it has consistently abused that monopoly […]
Working with Microsoft on open source plugins might seem innocent enough, but it’s really just entrenching Microsoft’s power yet further in the scientific community […]
It would have been far better to work with to produce similar plugins, making the free office suite even more attractive, and thus giving scientists yet another reason to go truly open, with all the attendant benefits, rather than making do with a hobbled, faux-openness, as here.

Let me say upfront that I mostly agree with Glyn here. Scientists should be at the forefront of abandoning closed for Open wherever possible, because in the long term Open strategies offer efficiencies of operation and scale that closed, proprietary solutions simply cannot match.
Having said that — and most expressly without wishing to put words into John Wilbanks’ mouth — my response to Glyn’s criticism is that I think he (Glyn) is seriously underestimating the selfish nature of most scientists. Or if you want to be charitable, the intense pressure under which they have to function. Let me unpack that:
Glyn talks about making Open Office more attractive and providing incentives for scientists to use Open solutions, but what he may not realize is that incentives mostly don’t work in that tribe. Scientists will do nothing that doesn’t immediately and obviously contribute to publications, unless forced to do so. Witness the utter failure of Open Access recommendations, suggestions and pleas vs the success of OA mandates. These are people who ignore carrots; you need a stick, and a big one.
For instance: I use Open Office in preference to Word because I’m willing to put up with a short learning curve and a few inconveniences, having (as they say here in the US) drunk the Open Kool-Aid. But I’m something of an exception. Faced with a single difficulty, one single function that doesn’t work exactly like it did in Word, the vast majority of researchers will throw a tantrum and give up on the new application. After all, the Department pays the Word license, so it’s there to be used, so who cares about monopolies and stifling free culture and all that hippy kum-ba-yah crap when I’ve got a paper to write that will make me the most famous and important scientist in all the world?
The last part is a (slight) exaggeration, but the tantrum/quit part is not. Researchers have their set ways of doing things, and they are very, very resistant to change — I think this might be partly due to the kind of personality that ends up in research, but it’s also a response to the pressure to produce. In science, only one kind of productivity counts — that is, keeps you in a job, brings in funding, wins your peers’ respect — and that’s published papers. The resulting pressure makes whatever leads to published papers urgent and limits everything else to — at best — important; and urgent trumps important every time. Remember the old story about the guy struggling to cut down a tree with a blunt saw? To suggestions that his work would go faster if he sharpened the saw, he replies that he doesn’t have time to sit around sharpening tools, he’s got a tree to cut down!
I said above that scientists should move from closed to Open wherever possible because of long term advantages. I think that’s true, but like the guy with the saw, scientists are caught up in short-term thinking. Put the case to most of them, and they’ll agree about the advantages of Open over closed — for instance, I’ve yet to meet anyone who disagreed on principle that Open Access could dramatically improve the efficiency of knowledge dissemination, that is, the efficiency of the entire scientific endeavour. I’ve also yet to meet more than a handful of people willing to commit to sending their own papers only to OA journals, or even to avoiding journals that won’t let them self-archive! “I have a job to keep”, they say, “I’m not going to sacrifice my livelihood to the greater good”; or “that’s great, but first I need to get this grant funded”; or my personal favourite, “once I have tenure I’ll start doing all that good stuff”. (Sure you will. But I digress.)
So to return to the question at hand: it’s a fine thing to suggest that scientists should use Open Office, but I flat-out guarantee you that they never will unless somehow their funding comes to depend on it. Word is familiar and convenient; none of the advantages of Free/Open Source software are sufficiently important to overcome the urgency with which this paper or that grant has to be written up and sent.
It’s also a great idea to get researchers to start thinking about, and using, markup and metadata and all that chewy Semantic Web goodness, but again I guarantee 100% failure unless you fit it into their existing workflow and habits. If you build your plugins for Open Office, that won’t be another reason to use the Free application, it will be another reason to reject semantic markup: “oh yeah, the semantic web is a great idea, yeah I’d support it but there’s no Word plugin so I’d have to install Open Office and I just don’t have time to deal with that…”.
When it comes to scientists, you don’t just have to hand them a sharper saw, you have to force them to stop sawing long enough to change to the new tool. All they know is that the damn tree has to come down on time and they will be in terrible trouble (/fail to be recognized for their genius) if it doesn’t.

Fooling around with numbers, part 2

Following on from this post, and in the spirit of eating my own dogfood1, herewith the first part of my analysis of the U Cali OSC dataset.
The dataset includes some 3137 titles with accompanying information about publisher, list price, ISI impact factor, UC online uses and average annual price increase; these measures are defined here. The spreadsheet and powerpoint files I used to make the figures below are available here: spreadsheet, ppt.
As a first pass, I’ve simply made pairwise comparisons between impact factor, price and online use. There’s no apparent correlation between impact factor and price, for either the full set or a subset defined by IF and price cutoffs designed to remove “extremes”, as shown in the inset figure:
One other thing that stands out is the cluster of Elsevier journals in the high-price, low-impact quadrant, and the Nature group smaller cluster of NPG’s highest IF titles at the opposite extreme. Note that n < 3137 because not all titles have impact factors, usage stats, etc. I've included the correlation coefficients mainly because their absence would probably be more distracting than having the (admittedly fairly meaningless) numbers available, at least for readers whose minds work like mine. Next I asked whether there was any clearer connection between price and online uses aggregated over all UC campuses: UCOSCpriceuse.JPG
Again, not so much. I played about with various cutoffs, and the best I could get was a weak correlation at the low end of both scales (see inset). And again, note Elsevier in the “low value” quadrant, and Nature in a class of its own. Being probably the one scientific journal every lay person can name, in terms of brand recognition it’s the Albert Einstein of journals. Interestingly, not even the other NPG titles come close to Nature itself on this measure, though they do when plotted against IF. I wonder whether that actually reflects a lay readership?
Finally (for the moment) I played the Everest (“because it’s there”) card and plotted use against impact factor:
The relationship here is still weak, but noticeably stronger than for the other two comparisons — particularly once we eliminate the Nature outlier (see inset). I’ve seen papers describing 0.4 as “strong correlation”, but I think for most purposes that’s wishful thinking on the part of the authors. I do wish I knew enough about statistics to be able to say definitively whether this correlation is significantly greater than those in the first two figures. (Yes yes, I could look it up. The word you want is “lazy”, OK?) Even if the difference is significant, and even if we are lenient and describe the correlation between IF and online use as “moderate”, I would argue that it’s a rich-get-richer effect in action rather than any evidence of quality or value. Higher-IF journals have better name recognition, and researchers tend to pull papers out of their “to-read” pile more often if they know the journal, so when it comes time to write up results those are the papers that get cited. Just for fun, here’s the same graph with some of the most-used journals identified by name:
Peter Suber has pointed out a couple of other (formal!) studies that have come to similar conclusions to those presented here. There are probably many such, because the relevant literature is dauntingly large. There’s even a journal of scientometrics! The FriendFeed discussion of my earlier post has generated some interesting further questions, for instance Bob O’Hara’s observation that a finer-grained analysis would be more useful. I’m not sure I’m up for manually curating the data, though, and I can’t see any other way to achieve what Bob suggests… I might do it for the smaller Elsevier Life Sciences set. For the moment I think I’ll concentrate more on slightly different questions regarding IF and price distributions, as in Fig 3 in my last post — tune in next time for more adventures in inept statistical analysis!
1 I’m always on about Open Data and “publish early, publish often” collaborative models like Open Notebook Science, and it occurs to me that the ethos applies to blogging as much as to formal publications. So I’m going to try to post analyses like this in parts, so as to get earlier feedback, and of course I try to make all my data and methods available. Let me know if you think I’m missing any opportunities to practice what I preach.

Fooling around with numbers

A while back, there was some buzz about a paper showing that, for a particular subset of journals, there was essentially no correlation between Impact Factor and journal subscription price. I think, though my google-fu has failed me, that the paper was Is this journal worth $US 1118? (pdf!) by Nick Blomley, and the journals in question were geography titles. Blomley found “no direct or straightforward relationship” between price and either Impact Factor or citation counts. He also looked at Relative Price Index, a finer-grained measure of journal value developed by McAfee and Bergstrom. He didn’t plot that one out, so I will:
There is some circularity here, since RPI is calculated using price, but once again I’d call that no direct or straightforward relationship.
All this got me wondering about the same analyses applied to other fields and larger sets of journals. My first stop was Elsevier’s 2009 price list, handily downloadable as an Excel spreadsheet. It doesn’t include Impact Factors, but the linked “about” page for each journal displays the IF, if it has one, quite prominently. So I went through the Life Sciences journals by hand, copying in the IFs. I ended up with 141 titles with, and 90 titles without, Impact Factors. As with Blomley’s set, there was no apparent correlation between IF and price:
Interesting, no? If the primary measure of a journal’s value is its impact — pretty layouts and a good Employment section and so on being presumably secondary — and if the Impact Factor is a measure of impact, and if publishers are making a good faith effort to offer value for money — then why is there no apparent relationship between IF and journal prices? After all, publishers tout the Impact Factors of their offerings whenever they’re asked to justify their prices or the latest round of increases in same.
There’s even some evidence from the same dataset that Impact Factors do influence journal pricing, at least in a “we can charge more if we have one” kinda way. Comparing the prices of journals with or without IFs indicates that, within this Elsevier/Life Sciences set, journals with IFs are higher priced and less variable in price:
About the time I was finishing this up, I came across a much larger dataset from U California’s Office of Scholarly Communication. I’ve converted their html tables into a delimited text file, available here: UCOSC.txt. For my next trick I’ll see what information I can squeeze out of a real dataset (there are about 3,000 titles in there).
Oh, and if anyone wants it, the Elsevier Life Sciences data are in this Excel file: ElsevierLifeSciPriceList.xls.