What use are research patents?

DrugMonkey has a conversation going about the ongoing kerfluffle over (micro)blogging of conference presentations (see also the FriendFeed discussion). I want to go off on a tangent from something that came up in his comment thread, so rather than derail it I thought I’d post here.
In his first comment in the thread, David Crotty made the following claim:

Lots of researchers support their families and labs through money generated by patents, and most universities are heavily dependent upon their patent portfolios for funding.

That doesn’t accord with my (limited!) experience — I know a few researchers who hold multiple patents, and none of them ever made any money that way — and my general impression is that the return on investment for tech transfer offices and the like is fairly dismal.
This seems like the sort of beans that beancounters everywhere should be counting, so I asked on FriendFeed whether anyone knew of any data to address the question of whether universities really make much money from patents. Christina Pikas pointed me to the Association of University Technology Managers, whose 2007 Licensing Activity Survey is now available.
I extracted data for 154 universities and 27 hospitals and research institutions. Between them, in 2007, these institutions filed 11116 patent applications, were awarded 3512 patents, and gave rise to 538 start-up companies. I calculated licensing income as a percentage of research expenditure:


Apart from New York University (I wonder what they own that’s so profitable?), it’s clear that none of these universities are “heavily dependent upon their patent portfolios for funding”. In fact, more than half of them (78/154) made less than 1% of their research expenditure back in licensing income, and the great majority (144/154) made less than 10%.
Licensing income for Massachusetts General Hospital and “City of Hope National Medical Ctr. & Beckman Research” (whoever they are) amounted to 65-70% of research expenditure, but none of the other hospitals or research institutions made more than 20%. More than half of this group (15/27) made less than 2%, and most of them (23/27) made less than 10%.
The distribution looks just about as you would expect:


I also wondered whether there was any evidence that greater numbers of patents awarded, or more money spent per patent, resulted in higher licensing income. As you can see, the answer is no (insets show the same plots with the circled outliers removed):


I don’t know how representative this dataset is; there are several thousand universities and colleges in the US, and surely even more hospitals and research institutions, so the sample size is relatively small. It does include some big names, though – Harvard, Johns Hopkins, MIT, Stanford, U of California — and I would expect a list of schools answering the AUTM survey to be weighted towards those schools with an emphasis on tech transfer.
In any case, I’m not buying David’s assertion that “most universities”, or most hospitals or research institutes for that matter, rely heavily on licensing income. And that being so, I am also somewhat skeptical about the number of researchers’ families being supported by patents.
What’s the Open Science connection? Well, if you’re interested in patenting the results of your research, there are a lot of restrictions on how you can disseminate your results. You can’t keep an Open Notebook, or upload unprotected work to a preprint server or publicly-searchable repository, or even in many cases talk about the IP-related parts of your work at conferences. It seems from the data above that most universities would not be losing much if they gave up chasing patents entirely; nor would they be risking much future income, since so few seem to get significant funds from licensing. My own feeling is that any real or potential losses would be much more than offset by the gains in opportunities for collaboration and full exploitation of research data that come with an Open approach.
1. Christina left a comment pointing out that patents may be required for more than simply making money from licensing:

…an extremely important reason universities patent [is] to protect their work so that they may exploit it for future research… it turns out that universities have to patent in life sciences – even if they don’t actively market and license these patents – to be able to attract new research money from industry.

There are two distinct points here: first, that if you don’t patent you may not attract industry partners, and second, that if you don’t patent you may end up licensing your own tech back from someone else (I note that most tech licenses I know of are cheap or free “for research purposes” so the latter factor might not weigh so heavily). According to the 2007 AUTM data, industry investment in academic research amounted to about 7% of research expenditure and was up 15% over 2006.
2. David responded on DM’s thread with some counter evidence, on reading which I realise that the data above may (likely?) only show what the university received and not any money that went to the labs or researchers involved. Tech transfer may not be financially worth it for the university, except that it might still be doing good things for individual labs and PIs, and so would constitute a support service the university offers its research community. It also strikes me that my experience, such as it is, is mainly with Australian researchers, whereas David’s is in the US, so cultural differences may also apply.
3. More from Christina at her own place, here.
If you want the data, the spreadsheet I used is here.

What happened to serials prices in 1986-87? (Update: probably nothing.)

This could be nothing but an artifact (e.g. of the way the data were collected), but if you look at Fig 1 from this post, there’s a clear break in the serials expenses (EXPSER) curve that’s not evident in any of the others. Here’s the same plot reworked to emphasize what I’m talking about:


If you squint just right you can imagine a similar but much weaker effect, beginning a year or two later, in the total expenditures (TOTEXP) curve; and the salaries (TOTSAL) curve seems to start a similar upward trend at about the same time but then levels off after 1991 or so. I wouldn’t put any weight on either of those observations though — I’d never have noticed either if I hadn’t been comparing carefully with the EXPSER curve.
I’ve added linear regression lines for the 1976-1986 and 1987-2003 sections of the EXPSER data, just to emphasize the change in rate of increase. For those of you who will twitch until they know, just ‘cos, the regression coefficients of the two lines are 0.99 and 0.98 respectively. If you extrapolate from just the 76-86 section, TOTEXP exceeds the forecast for EXPSER after about 2000.
I have no idea if this means anything, but it is tempting to speculate. For instance: when did the big mergers begin in Big Publishing, and when did the big publishing companies start the odious practice of “bundling”, that is, selling their subscriptions in packages so that libraries are forced to subscribe to journals they don’t want just to get the ones they do?

Update: it’s probably nothing; the curve simply shows an increasing rate of increase, and you can break it up into at least five reasonably convincing-looking segments with breaks at 86-87 and 94-95. It’s possible there were two “pricing events” around those times, but I think this is most likely just an illustration of what can happen when you look a little too hard for patterns in your data!


Scholarly (scientific) journals vs total serials: % price increase 1990-2009

Following on from this post, I manually extracted historical data for average scholarly journal prices in a dozen broad disciplines from the Library Journal Annual Periodicals Price Surveys by Lee Van Orsdel and Kathleen Born, and compared these with three datasets from the earlier post: ARL libraries’ median total serials expenditures (ARL all serials), Abridged Index Medicus average journal price (AIM) and the consumer price index (CPI):


My concern with the AIM dataset was that it was too small and specialized to support broad conclusions, but it turns out that the AIM data sit somewhere in the middle of the disciplines analysed. Astronomy is closest to the ARL all serials median, with math and computer science not much worse; general science is the worst offender, with engineering and technology, chemistry and food science not far behind. From 1990 to 2008, total price increases ranged from 238% (astronomy) to 537% (general science); that’s 3.7 and 8.3 times the increase in the CPI, respectively.
This dataset covers an average of around 3600 journals from 2005-2009, 3255 from 1997-2001 and 2655 from 1989-1990. I think this represents good evidence that historical price data for total serials, even though it shows a rate of increase far greater than that of the CPI, masks an even greater rate of increase among scholarly (scientific) journals. It’s difficult to look at that graph and believe that scholarly publishers are playing fair, particularly when one remembers that online publishing, with its attendant cost reductions, came of age during the same period of time.
The Van Orsdel/Born surveys include a number of other scholarly disciplines (art, architecture, business, history, language, law, music, etc etc). If I have the time I’ll work those up as well, to provide as broad a picture as possible. I should also include numbers of titles in each discipline, to give some idea of total influence. For instance: although general science (around 60 or 70 titles) shows the greatest increase, it likely contributes far less to the serials crisis than health sciences (more than 1500 titles).
(The data are available in this Excel spreadsheet.)

Some wishes come true.

A while back, I posted about my discovery (new to me, though not new to many others) that the serials crisis should probably be called something like the “scholarly journals crisis”. The term “serials” includes a wide range of publications, most of which are not peer-reviewed scholarly journals — newspapers, goverment reports issued in series, yearbooks, magazines and more. Only about 1/10 of the serials in Ulrich’s directory are peer-reviewed. The average scholarly journal costs around 10 times as much as the average serial, and while the cost of the scholarly literature continues to climb, median serial unit costs at ARL libraries have actually been falling for the last seven or eight years (Fig 1 below). It therefore appears that scholarly journals are the driving force behind the serials crisis.
At the time, I wished that I had some specific data to show the difference between scholarly and average serials — hence the title of this post: via medinfo, I learned that EBSCO Information Services has released a brief report (pdf!) on the price history of well regarded clinical journals, using 117 titles from the NLM’s Abridged Index Medicus (AIM). This is a curated list of biomed journals “of immediate interest to the practicing physician” and can be searched on PubMed as a subset limit named “core clinical journals”.
As a reminder, here’s that graph; it’s from the ARL stats report from 2004-5 and the reason it’s famous is the way that “Serials Expenditures” outstrips the Consumer Price Index (CPI) and other measures:


Here’s a comparison of that data with the price history of the AIM journals; the line labeled “expser/ARL libraries all serials” shows the 1990-2005 subset of the “Serials Expenditures” data from Fig 1, and “EBSCO/core clinical journals” shows the AIM data:


Data labels (ARL data from here):

  • serpur: Current Serials Purchased, median value from all ARL libraries
  • expser: Expenditures for Serials, median etc
  • totsal: Total Salaries & Wages, median etc
  • serunit: Serial Unit Cost; median value of expsur/serpur calculated for all ARL libraries
  • EBSCO: average price per journal in the Abridged Index Medicus set
  • CPI-U: Consumer Price Index, all urban consumers, annual average, not seasonally adjusted

This is exactly what I wished for, hard evidence of the difference between scholarly and average serials; and what that evidence strongly indicates is that price increases in scholarly journals are driving the serials crisis. Scholarly journals far outstrip total serials in terms of annual price increase, even though the latter shows a much more rapid increase than the CPI. In contrast, library salary expenditure follows the CPI closely, and median serial unit cost (all serials) has been dropping slowly since 2000.
Frankly, I’m tempted to name this the Big Fat Ripoff Graph. Between 1990 and 2008, the CPI increased by about 65%, whereas over the same period the average price of an AIM journal increased by 415%, a 6.4-fold difference. I’ve seen publishers try to defend the “total serials expenditures” vs CPI discrepancy by pointing out that journals are proliferating — indeed, the “serials purchased” curve is headed upwards at an increasing rate, particularly over the last five years or so. But that defense is no good against the BFR Graph, on which the most damning curve shows average journal prices. I’ve also seen comments to the effect that if mean or median serial unit costs are dropping, publishers must be offering increasing value for money even if they are charging more in total. That might be true of the set of “all serials publishers”, but it’s apparent from the BFR Graph that scholarly journal publishers can make no such claim.
It must be remembered, of course, that we are only looking at a little over a hundred clinical journals here, a small and discipline specific subset. Nonetheless, the result is so striking that I think it is a considerable inducement to the gathering of more data. Since it seems my wishes for more work are coming true, I’ll make another: now I want price history data for other, larger journal subsets in other scholarly disciplines. I wonder what the BFR Graph looks like for those datasets?
(P.S. If you want the numbers I used, or to check my work, the spreadsheet is here.)

Update: ha! I just got around to reading this article, linked by Peter Suber a couple of days ago; turns out it’s full of annual price data, and Van Orsdel and Born have been doing these surveys for at least ten years. There doesn’t seem to be a central collection or data collation, so I’ll have to piece it together. Stay tuned!

Someone else is fooling around with numbers.

Via Peter Suber, I came across this editorial in the Journal of Vision:

Measuring the impact of scientific articles is of interest to authors and readers, as well as to tenure and promotion committees, grant proposal review committees, and officials involved in the funding of science. The number of citations by other articles is at present the gold standard for evaluation of the impact of an individual scientific article. Online journals offer another measure of impact: the number of unique downloads of an article (by unique downloads we mean the first download of the PDF of an article by a particular individual). Since May 2007, Journal of Vision has published download counts for each individual article.

The author goes on to compare download vs citation (counts and rates, and downloads or citations over time). It’s a pretty good analysis of an important topic, but something vital is missing:

Where are the data? Can I have them? What can I do with them?1

In fact, the data are approximately available here. Why “approximately”? Well, I can get a range of predigested overviews: DemandFactor (roughly, downloads/day/first 1000 days) Top 20, total downloads Top 20 and article distributions by DemandFactor and total downloads. I can also get the download information for any given article — one article at a time, and once again predigested in the form of a graph from which I have to guesstrapolate if I want raw, re-useable data.
This is disappointing, for both general and specific reasons. It’s always disappointing to see data locked away in a graph or a pdf or some similar digital or paper oubliette, there to languish un(re)used. It’s also disappointing to see a journal getting way out ahead of the curve on something as important and valuable as download metrics (is there another journal besides J Vis that provides this information, even predigested?), and then missing an opportunity to continue to innovate by providing real Open Data.
It’s also disappointing in this specific instance, because I have a question: why is Figure 1 plotted on a log scale and, more importantly, was the correlation coefficient calculated from log-transformed data? I could understand showing the log scale for aesthetic reasons, but I can’t think of a reason to take logs of that kind of data — and doing so can alter the apparent correlation. For instance, remember Fig 1 from this post? Here it is again, together with a plot of log-transformed data, both shown on natural and log scales:


I could answer my own question quickly and easily if I could get my hands on the underlying data — which leads me right back to one of the primary general arguments for Open Data. If I, statistical ignoramus and newcomer to these sorts of analyses, have questions after a brief skim through the paper, what questions might a better equipped and more thorough reader have? It’s simply not possible to know — the only way to find out is to make the data openly available!
I realise it’s not possible for journals to demand Open Data from their authors — that’s what funder-level mandates are for, though there’s much discussion still to be had regarding whether Open Data mandates would be a good idea. Nonetheless, when journals publish analyses of their own data, it would be great to see them leading the way by providing unrestricted access to that data.
1 Astute readers, both of you, will remember that howl of anguish refrain from this post.

Fooling around with numbers, part 5b.

I’ve already assigned part 6 to a particular analysis in an effort to get me to actually do that work, but I felt that I just had to include this (via John Wilbanks) in the series:


I’m just sayin’. (I may have to get that graph as a tattoo).

P.S. Never mind the date, this is not a trick; I hate online April Fool jokes with the fiery power of a thousand burning suns.

Should we talk about the “journals crisis” instead of the “serials crisis”?

I stumbled upon something new-to-me, and possibly even useful-to-others, in my fooling around with numbers (table 2 and discussion thereof here), but it’s somewhat buried under all the “how I made this figure” and “where I got these data” details. For that reason, and because I didn’t trust my idea until I had some external reinforcement, I thought I’d give it a separate post all its own.
Here’s the thing: what is widely known as the serials crisis in library costs is probably driven largely by the pricing of scholarly journals. In library parlance, “serials” includes, inter no doubt many alia, newspapers, goverment reports issued in series, yearbooks and magazines (periodicals), in addition to the scholarly literature. Of the 225, 000 or so periodicals in Ulrich’s, only about 25,000 are peer reviewed. In the FriendFeed discussion started by my post, Walt Crawford said

…some of us have long argued that there isn’t a serials crisis for library budgets, there’s a scholarly journal crisis. Magazines (and there are about 1/4 million magazines as compared to about 25,000 scholarly journals) tend to have very low prices and very modest increases.

Although non-refereed serials dominate product counts (and, apparently, library collections), the situation is reversed for unit expenditures. The average unit cost for the UCOSC dataset, which is composed entirely of scholarly journals, is roughly ten times the average unit cost for any of the other datasets I used, all of which were general data that included all types of serial. Here’s Walt again:

the 10:1 ratio for UC (that is, scholarly journals averaging 10x as expensive as all serials) sounds about right

When the numbers and Walt’s experience began to line up, I became much more confident in my conclusion, that the serials crisis is really a scholarly journals crisis. It’s not clear to me, in fact, why the phenomenon got the nickname it did; perhaps it’s just that “serials crisis” is a punchier phrase.
I’m not at all sure that any of this is more than semantic nitpicking, but giving things their proper name can be important. Most researchers who only hear the name won’t care about a “serials crisis” — that’s a library problem, nothing to do with us. But if they hear about a “scholarly literature crisis”, it becomes clearer that the issue is the potential loss of access to resources necessary to do our jobs. I suspect most researchers who’ve heard of the serials crisis are aware that it is, at least in part, about journal pricing, but I wonder how many know that it’s pretty much only about journal pricing? This little “discovery” of mine really did put things in a different perspective for me, and I’m probably more informed about library- and publishing-related issues than most benchmonkeys.
I doubt that an alternative name will catch on, and I’m not going to start campaigning for one — but I think that from now on I’ll at least occasionally refer to the “serials/scholarly literature” crisis, or something similar, if only to remind myself of my own little satori. (Question for the lazyweb: can anyone suggest a better phrase, one which would make it more apparent to researchers that they should care about this?)

Fooling around with numbers, part 5

As promised, here is the distribution of journal prices for the subsets of the Elsevier life sciences dataset which either have or don’t have impact factors, and for the entire UCOSC dataset (in which all journals have IFs):
Each interval is $499: $0 to $499, $500 to $999, etc, and datapoints are plotted at the midpoint of each interval.
The conclusion is the same as in part 1, just a bit clearer now. Elsevier journals without an impact factor are priced lower than those which have an IF, and the price distributions are somewhat different between journals with and without an IF. Note, though, that if I’d used a $1000 interval instead of $500, the initial rise in the +IF curves would not appear; if these are power-law distributions the main difference is probably the scaling exponent. I think. (Math is not my friend.)
It almost looks as though low-end journals are shunted out of the lowest price bracket as soon as they get an IF, any IF, and then tend to increase in price as the IF goes up. Update: no it doesn’t. I don’t know what I was thinking there.

The rest of the series: part 1, part 2, part 3, part 4.

Fooling around with numbers, part 4; or, those data — you keep using them — I don’t think they mean what you think they mean…

At the end of part 3, having looked at some of the ways in which prices and price/use were distributed, I said I’d try to say something about what constituted a fair price. I hadn’t thought that through at all, and it turns out that I really can’t get much leverage against that question from the UCOSC dataset alone.
In addition to the graphs in parts 1-3, here’s yet another way to look at the UCOSC data (again, this is a png from a screenshot because MT ate my balls perfectly good table1):
Table 1
Perhaps Elsevier doesn’t stand out quite so much as I might have expected — they still dominate by virtue of market share, but in terms of cost/use or use/title, Springer looks the worst of the bunch. Mean ($0.76) and median ($1.89) cost per use doesn’t mean much without context. I could argue that since libraries are having trouble keeping up with serials costs and usage is only likely to increase, those probably don’t represent fair prices… but I don’t know how much weight that argument would hold, and anyway you should go read Heather Morrison on why usage-based pricing is dangerous. (That’s one of the benefits of thinking-out-loud like this; knowledgeable people come along and point out stuff you need to know. Yay lazyweb!)
So, I need context: let’s start with, how many libraries are there? According to the American Library Association, there are more than 120,000 libraries in the USA — but for my purposes, I’m really only interested in those which carry the scholarly literature. The US Dept of Education’s National Center for Education Statistics runs a Library Statistics Program, which provides data specifically on academic libraries.
According to the ALA and the NCES, there are about 3700 academic libraries in the US. If all of them subscribed (at list price) to the 2904 journals in the UCOSC dataset, that would work out to $13,306,150,900 — about $13 billion — per year on scholarly journals alone. To put that into perspective, the entire NIH research budget for 2008 was less than $30 billion. I have been told that most libraries don’t pay list price, because publishers offer all kinds of deals, but I wondered whether that $13 billion was at least in the right ballpark, so I went looking for more data.
Since the UCOSC dataset covers 2003-4, I looked at the NCES report for 2004 (the spreadsheet I used is here). The ALA has another division, the Association of College and Research Libraries, which keeps its own records; alas, these are not free, but I could get nearly everything I wanted from the summaries — again, I just looked at 2004. There’s also the Association of Research Libraries, which is “a nonprofit organization of 123 research libraries at comprehensive, research-extensive institutions in the US and Canada that share similar research missions, aspirations, and achievements”, mostly made up of very large libraries (think Harvard, Yale, etc). The ARL also compiles and makes available statistics on its members; I pulled out the 2004 data from the download page (spreadsheet here).
Finally, I added the UCOSC dataset for comparison, and for extra context I pulled out the University of California subset from the ARL data (Berkely, Davis, Irvine, LA, Riverside, San Diego and Santa Barbara; I think these are the largest 7 of UC’s 10 main campus libraries).  The resulting data look like this2:
Table 2
Na, not applicable; cc, couldn’t calculate. The ACRL data is derived mainly from two summaries, one showing expenditure (red) and one showing holdings (blue). The mean cost/serial is a fudge, since it was calculated using figures from both summaries, but I doubt it’s significantly different from the value I would get if I had all the data, since the number of libraries included in each set is so similar. The other values in green are also approximations derived from summary reports3. Note that the “per library” figures for the UCOSC dataset are actually just for that subset of journals (hence the “<<1” entry for “no. libraries”).
I’ve put some sanity checks — do these data make sense? — in a footnote4; to me, the data appear both externally and internally consistent.  I don’t, in other words, appear to have done anything egregiously stupid. Not with the numbers, anyway:
Two things jump out at me from Table 2, which together are responsible for the subtitle of this entry. First, my $13 billion guess was way off — the actual amount spent on serials by US academic libraries is probably closer to $1-2 billion.  Large (e.g. Ivy League) libraries might spend many tens of millions of dollars, small libraries maybe only a few hundred thousand.  That’s still an enormous amount of money, but it’s not half the NIH budget!  So why the discrepancy?
Quite apart from “list price” and “what libraries actually pay” being two very different things, I’ve been making a mistake in terminology.  When I think of “serials” in a library, I think of the peer-reviewed scholarly literature; I tend to use “journals” to mean the same thing.
This is very, very wrong.
(As, no doubt, any librarian could have told me, without the need to go ferreting through all those numbers.) From the NCES survey instrument used to collect their data (emphasis mine):

Current serial subscriptions (ongoing commitments) (line 13) – Report expenditures for current subscriptions to serials in all formats. These are publications issued in successive parts, usually at regular intervals, and, as a rule, intended to be continued indefinitely. Serials include periodicals, newspapers, annuals (reports, yearbooks, etc.), memoirs, proceedings, and transactions of societies.
Current serial subscriptions (line 26) — Report the total number of subscriptions in all formats. If the subscription comes in both paper and electronic form, count it twice. Count each individual title if it is received as part of a publisher’s package (e.g., Project MUSE, JSTOR, Academic IDEAL). Report each full-text article database such as Lexis-Nexis, ABI/INFORM as one subscription in line 27. Include paper and microfilm government documents issued serially if they are accessible through the library’s catalog.

From the ARL ditto:

Questions 4-5. Serials. Report the total number of subscriptions, not titles. Include duplicate subscriptions and, to the extent possible, all government document serials even if housed in a separate documents collection. Verify the inclusion or exclusion of document serials… Exclude unnumbered monographic and publishers’ series. Electronic serials acquired as part of an aggregated package (e.g., Project MUSE, BioOne, ScienceDirect) should be counted by title. A serial is

a publication in any medium issued in successive parts bearing numerical or chronological designations and intended to be continued indefinitely. This definition includes periodicals, newspapers, and annuals (reports, yearbooks, etc.); the journals, memoirs, proceedings, transactions, etc. of societies; and numbered monographic series.

Oy vey. Newspapers, yearbooks, government documents and a whole bunch of other things that aren’t scholarly journals are (or can be) serials too. “Periodicals” means National Geographic qualifies — hell, so does Playboy magazine!
As of today (March 17), Ulrich’s Periodicals Directory lists 224,151 “active” periodicals; of those, 65,461 are “academic/scholarly”; and of those, 25,425 are “refereed”.
What do those things cost which aren’t part of the peer-reviewed literature? How does their inclusion in library data impact the means and medians I’ve been looking at?
Which brings me to the second item of note from Table 2: the mean cost/serial is on the order of ten times higher for the UCOSC dataset than for the other sets.  Does that mean that the scholarly literature is actually the powerhouse of the serials crisis (pdf!), and if we could zero in on the peer-reviewed fraction of the serials data we would see an even more dramatic rise in price? Or does it have more to do with the fact that the UCOSC dataset is deliberately composed of relatively high-end journals, thus artificially inflating the apparent costs? If every library in the NCES set subscribed to those journals at even one-tenth of list price, it would still account for pretty much the entire serials expenditure — so how many libraries subscribe to which journals? What of the roughly 22,000 peer-reviewed journals that aren’t included in the UCOSC dataset?  If libraries are subscribing to anywhere from a few thousand serials to well over 100,000 (e.g. ARL 2007 numbers for Columbia, Harvard and Illinois/Urbana), what proportion of those subscriptions are to peer-reviewed journals — or, conversely, to what proportion of the peer-reviewed literature does the average library subscribe?
In other words, I’ve made no headway at all on the question of a “fair price”; all I’ve managed to do here is to find more questions.  I guess that’s progress, because at least they are better-defined, more specific questions. Answering them will require much more fine-grained data, though: which libraries subscribe to which peer-reviewed journals, and at what cost?  I think the answers might be very useful to the research community, but collecting the data would be a full-time job. (I’m up for it, by the way, if anyone reading this is in a postion to hire me to do it. Seriously, I’d love it. After all, look what I’m doing for fun.)
To return to where I started: there’s another angle of attack on the “fair price” question, which is to look at things from the other side.  How much does it cost to publish a paper in the peer-reviewed literature, and how does that compare to actual income at publishing companies? This information is notoriously hard to come by, but I’ve been collecting links and notes for a while so in Part 5 6* I’ll try to put them all together and see if I’ve got anything useful.
* I’ve just remembered something else I want to do first: Part 5 will take a look at journal price distributions with and without impact factor, using the Elsevier Life Sciences (see Part 1 Fig 3) and the UCOSC datasets.
Update: if you’ve read this far, go read the FriendFeed discussion, you’ll like it.


1 If you want the data there’s a comma-delimited text version of the table here and the spreadsheet from which the table is derived is here.
2 Comma-delimited text file here.
3 The following table shows the figures used to calculate the sum total library expenditure for the ACRL dataset.  Numbers in black are taken from the summaries provided, numbers in pink are calculated from them.
Table 3
Mean total expenditure per library was calculated using an approximate average number of libraries of 1074.
4 Sanity checks:

  • the ARL and ACRL subsets of the NCES libraries spend less in sum than the NCES set, but the mean and median expenditures/library are lower for the NCES set because it includes more, and smaller, libraries
  • the mean and median number of serials/library is similar between the ARL dataset and its UC subset, both figures being much larger than the mean serials/library for the NCES or ACRL sets (again, more and smaller libraries)
  • the mean and median cost/serial is similar throughout, except for the UCOSC dataset which is a curated subset of high-end scholarly journals (discussed above)

Are those reasonable totals for the libraries to be spending?

  • The ARL 2004-5 report shows that member libraries spent $680,774,493, with a median per library of $5,904,464, on serials, and total library expenditure was $2,683,008,943 (median per library $20,210,171)
  • The NCES 2004 summary shows that 3653 libraries surveyed spent, in sum, $5,751,247,194 on total operating expenses, $1,363,671,792 on serials and $2,157,531,102 on information resources in general

Are those reasonable total numbers of journals per library?

  • OHSU (where I was until recently employed) has 20857 entries in its “journals” catalog
  • The NCES 2004 summary shows that, all together, 3653 academic libraries held 12,763,537 serials subscriptions
  • The ARL 2004-5 report shows that 113 member libraries held 4,658,493 subscriptions, with a median per library of 37,668

Are those reasonable mean and median costs per serial?

  • I could only find unit costs for serials in the ARL report, in the “analysis of selected variables”, where the mean cost/serial is given as $247.55 per subscription (range $656.31 to $93.72, median $231.90, 88 libraries reporting).

So, at least in ballpark terms, the numbers in my tables appear to check out against summaries compiled by the various agencies from their own data (and the OHSU library catalog).  There are, e.g., no order-of-magnitude discrepancies — except perhaps in cost/serial, as discussed above.

Updates on “science and selfishness”

Update the first: now I feel bad for not waiting (though I did put “read AFTER honeymoon!!!” in the subject line), but John Wilbanks wrote back right away to say that it will take him a while to get to it, but he will ferret out specific answers regarding the Science Commons work and interoperability.
Update the second: Peter Sefton has more here, including specific recommendations for working with Microsoft while avoiding “a new kind of format lock-in; a kind of monopolistic wolf in open-standards lambskin”:

  • The product (eg a document) of the code must be interoperable with open software. In our case this means Word must produce stuff that can be used in and round tripped with OpenOffice.org and with earlier versions, and Mac versions of Microsoft’s products. (This is not as simple as it could be when we have to deal with stuff like Sun refusing to implement import and preservation for data stored in Word fields as used by applications like EndNote.)

    The NLM add-in is an odd one here, as on one level it does qualify in that it spits out XML, but the intent is to create Word-only authoring so that rules it out — not that we have been asked to work on that project other than to comment, I am merely using it as an example.

  • The code must be open source and as portable as possible. Of course if it is interface code it will only work with Microsoft’s toll-access software but at least others can read the code and re-implement elsewhere. If it’s not interface code then it must be written in a portable language and/or framework.