Fooling around with numbers, part 4; or, those data — you keep using them — I don’t think they mean what you think they mean…

At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I’d try to say something about what constituted a fair price. I hadn’t thought that through at all, and it turns out that I really can’t get much leverage against that question from the UCOSC dataset alone.
In addition to the graphs in parts 1-3, here’s yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my balls perfectly good table1):
Table 1
MTsucksass.png
Perhaps Elsevier doesn’t stand out quite so much as I might have expected — they still dominate by virtue of market share, but in terms of cost/use or use/title, Springer looks the worst of the bunch. Mean ($0.76) and median ($1.89) cost per use doesn’t mean much without context. I could argue that since libraries are having trouble keeping up with serials costs and usage is only likely to increase, those probably don’t represent fair prices… but I don’t know how much weight that argument would hold, and anyway you should go read Heather Morrison on why usage-based pricing is dangerous. (That’s one of the benefits of thinking-out-loud like this; knowledgeable people come along and point out stuff you need to know. Yay lazyweb!)
So, I need context: let’s start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA — but for my purposes, I’m really only interested in those which carry the scholarly literature. The US Dept of Education’s National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries.
According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 — about $13 billion — per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don’t pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data.
Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries — again, I just looked at 2004. There’s also the Association of Research Libraries, which is “a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements”, mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here).
Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC’s 10 main campus libraries).  The resulting data look like this2:
Table 2
MTstillsucksass.png
Na, not applicable; cc, couldn’t calculate. The ACRL data is derived mainly from two summaries, one showing expenditure (red) and one showing holdings (blue). The mean cost/serial is a fudge, since it was calculated using figures from both summaries, but I doubt it’s significantly different from the value I would get if I had all the data, since the number of libraries included in each set is so similar. The other values in green are also approximations derived from summary reports3. Note that the “per library” figures for the UCOSC dataset are actually just for that subset of journals (hence the “<<1″ entry for “no. libraries”).
I’ve put some sanity checks — do these data make sense? — in a footnote4; to me, the data appear both externally and internally consistent.  I don’t, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway:
Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off — the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion.  Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand.  That’s still an enormous amount of money, but it’s not half the NIH budget!  So why the discrepancy?
Quite apart from “list price” and “what libraries actually pay” being two very different things, I’ve been making a mistake in terminology.  When I think of “serials” in a library, I think of the peer-reviewed scholarly literature; I tend to use “journals” to mean the same thing.
This is very, very wrong.
(As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine):

[expenditure]
Current serial subscriptions (ongoing commitments) (line 13) – Report expenditures for current subscriptions to serials in all formats. These are publications issued in successive parts, usually at regular intervals, and, as a rule, intended to be continued indefinitely. Serials include periodicals, newspapers, annuals (reports, yearbooks, etc.), memoirs, proceedings, and transactions of societies.
[...]
[holdings]
Current serial subscriptions (line 26) — Report the total number of subscriptions in all formats. If the subscription comes in both paper and electronic form, count it twice. Count each individual title if it is received as part of a publisher’s package (e.g., Project MUSE, JSTOR, Academic IDEAL). Report each full-text article database such as Lexis-Nexis, ABI/INFORM as one subscription in line 27. Include paper and microfilm government documents issued serially if they are accessible through the library’s catalog.

From the ARL ditto:

Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials… Exclude unnumbered monographic and publishers’ series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is

a publication in any medium issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. This definition includes periodicals, newspapers, and annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc. of societies; and numbered monographic series.

Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren’t scholarly journals are (or can be) serials too. “Periodicals” means National Geographic qualifies — hell, so does Playboy magazine!
As of today (March 17), Ulrich’s Periodicals Directory lists 224,151 “active” periodicals; of those, 65,461 are “academic/scholarly”; and of those, 25,425 are “refereed”.
What do those things cost which aren’t part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I’ve been looking at?
Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets.  Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure — so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren’t included in the UCOSC dataset?  If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals — or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe?
In other words, I’ve made no headway at all on the question of a “fair price”; all I’ve managed to do here is to find more questions.  I guess that’s progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost?  I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I’m up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I’d love it. After all, look what I’m doing for fun.)
To return to where I started: there’s another angle of attack on the “fair price” question, which is to look at things from the other side.  How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I’ve been collecting links and notes for a while so in Part 5 6* I’ll try to put them all together and see if I’ve got anything useful.
* I’ve just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets.
Update: if you’ve read this far, go read the FriendFeed discussion, you’ll like it.

————-

1 If you want the data there’s a comma-delimited text version of the table here and the spreadsheet from which the table is derived is here.
2 Comma-delimited text file here.
3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset.  Numbers in black are taken from the summaries provided, numbers in pink are calculated from them.
Table 3
MTsucksassforever.png
Mean total expenditure per library was calculated using an approximate average number of libraries of 1074.
4 Sanity checks:
Internal:

  • the ARL and ACRL subsets of the NCES libraries spend less in sum than the NCES set, but the mean and median expenditures/library are lower for the NCES set because it includes more, and smaller, libraries
  • the mean and median number of serials/library is similar between the ARL dataset and its UC subset, both figures being much larger than the mean serials/library for the NCES or ACRL sets (again, more and smaller libraries)
  • the mean and median cost/serial is similar throughout, except for the UCOSC dataset which is a curated subset of high-end scholarly journals (discussed above)

External:
Are those reasonable totals for the libraries to be spending?

  • The ARL 2004-5 report shows that member libraries spent $680,774,493, with a median per library of $5,904,464, on serials, and total library expenditure was $2,683,008,943 (median per library $20,210,171)
  • The NCES 2004 summary shows that 3653 libraries surveyed spent, in sum, $5,751,247,194 on total operating expenses, $1,363,671,792 on serials and $2,157,531,102 on information resources in general

Are those reasonable total numbers of journals per library?

  • OHSU (where I was until recently employed) has 20857 entries in its “journals” catalog
  • The NCES 2004 summary shows that, all together, 3653 academic libraries held 12,763,537 serials subscriptions
  • The ARL 2004-5 report shows that 113 member libraries held 4,658,493 subscriptions, with a median per library of 37,668

Are those reasonable mean and median costs per serial?

  • I could only find unit costs for serials in the ARL report, in the “analysis of selected variables”, where the mean cost/serial is given as $247.55 per subscription (range $656.31 to $93.72, median $231.90, 88 libraries reporting).

So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog).  There are, e.g., no order-of-magnitude discrepancies — except perhaps in cost/serial, as discussed above.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>