estimating ullage

Ullage, the word for the empty space at the top of a wine bottle, is Peter Suber’s term for the gap between a library’s actual holdings and its patrons’ access needs. That’s a difficult thing to measure, but I might have found a way to estimate it with reference not to patron needs but to all published journals, as follows.

  • In 2003, Kathy Varjabedian at LANL compared the electronic holdings at 12 (large, well funded research) libraries with the ISI Journal Citation Report’s top 100 most-cited journals for the previous year, producing an estimate for the ullage of those libraries of between 2% and a startling 54% (or 0% and 40%, if clinical titles were excluded).
  • Also in 2003, Carol Tenopir estimated that there were around 44,000 scholarly journals in publication, just over 21,000 of them “refereed”, which is the best proxy that Ulrich’s Periodicals Directory allows for “peer-reviewed”. Repeating Tenopir’s search just now returned 26,677 active, refereed, academic/scholarly journals.
  • Last year, I used a UCOSC dataset from 2004, a curated list of about 3000 titles, to estimate the average subscription price for a peer-reviewed scholarly journal (table 2 here) at $1238/title.
  • Here are some more data from the Library Journal Periodicals Price Survey:


    Sorry about the jpg, I still can’t make MT cope with tables. The spreadsheet is here. In case the image goes awry: the dataset covers more than 5,000 titles from 30 disciplines, and mean price/title is $723 in 2003 and $791 in 2004.
  • The mean serials expenditure for an ARL member institution was around $5.5 million in 2003 and $5.8 million in 2004.

At $1200/journal, $5.8 million1 would buy subscription access to about 4,800 titles, which is less than 23% of the number of active, refereed, academic/scholarly journals. At $700/journal, ARL members — some of the largest and best funded libraries in America (indeed, in the world) — are able to afford access to less than half of the scholarly literature.
This seems reasonably consistent with the earlier LANL estimate, given that Varjabedian looked only at the top 100 most-cited journals, which must surely be at the top of any research library’s “must-have” list.
It’s important to point out that what I’m estimating here is not ullage sensu Suber, but rather library holdings relative to all possible holdings. But I would argue that the access needs of all the scholars and other patrons served by ARL libraries is surely a decent proxy for “all possible journals”, if not a significantly larger body of information! Put another way, here I am estimating the gap between current access levels and the information availability of a 100% Open Access world.

1This calculation assumes that 100% of the serials budget goes to scholarly journals. That’s not true, but I’ve argued elsewhere that it’s at least 90%.

an interesting mind

This entry is especially for those of my readers who do not work in science or related fields (librarians, publishers, etc), and who are not quite sure why I am so obsessed with Open Science. (Hi, Mom and Dad!)
This is Pawel Szczesny at TED Warsaw, describing for the lay public what Open Science is, and what it can mean. Pawel’s is the interesting mind to which I refer in the title. I finally met him in person at Science Online earlier this year, but I have been following him around online for years. He never fails to come at a question or problem from an interesting and useful angle, and his TED talk is just the latest example.

What if?
What if I explain in simple words my research area? What if I point you to additional information so you could learn more and understand the topic I am working on? What if I make sure you have access to all relevant literature for free? What if I make sure you have access to all the data so you can play with it on your own? What if I take off this laboratory coat, so there is one artificial difference less between you and me? What if the only thing that mattered in this game of solving nature’s mysteries was skills, knowledge and passion? We have a name for that utopian vision: it’s called Open Science.

Do yourself a favour, watch the whole thing.

Where indeed?

AJ Cann has a post up that neatly summarizes the dilemma facing Open Science advocates/enthusiasts, and asks useful questions arising therefrom. In the current competition-focused environment, says Alan:

Open science is an iterated prisoner’s dilemma, which is a messy and unpredictable business. Too unpredictable for most people to try to build a career on. Thinking about strategies which are likely to be successful leads me towards the concept of an open science community rather than unilateral complete openness – a long term multiplayer collaboration. Does such a community already exist? If not, how do we build one?

Having taken a job in biotech, I feel a bit cut off from any such community — industry is notoriously protective of IP and fond of secrecy besides. I feel a bit of a fraud, for instance, taking part in discussions of Open Science issues on FriendFeed (such as the conversation kicked off by Alan’s blog post), knowing that I can’t talk openly about my own work. It doesn’t keep me from shooting off my yap, of course, but it’s a nagging icky feeling — and I keep getting the meta-feeling that it doesn’t have to be this way. Just as secrecy in academia only makes sense within the existing reward structure, secrecy in industry could be at least partly offset by policy decisions that recognize the gains in efficiency that collaboration can bring. I’ve heard multiple times from multiple sources that industry may close itself off from the rest of the world, but within a company, the teamwork ethic is amazing. Clearly, the value of co-operation is recognized. Why shouldn’t that also work for (larger and larger) groups of companies? What you lose by not being the only company to know something from which profit can be made (call it X) is offset by the fact that you might never have learned X without the collaboration — and in the meantime, the world gets X that much faster.
It seems clear, though, that such top-down decisions are more likely to be made in academia, and perhaps the nonprofit sector, than in profit-driven industry — at least until there are enough concrete examples of success to tip the perceived balance of risk. If I’m — if we Open Foo types are — right, it’s actually riskier to compete than to cooperate in the long term. Better to own a share of X sooner than to delay any return on your investment in the hope of owning X outright later. This is especially true when the resources required to try to own X could be used to get you shares in multiple other projects at the same time.
Even then, openness in industry seems to me unlikely to go beyond consortia. Complete openness (open notebook science) precludes patent protection, and in the dog-eat-dog world of business driven by the insatiable demands of disconnected shareholders, I don’t think we are ever going to wean the beancounters off their patents. (We could improve the situation by overhauling the patent process so that teeny incremental changes were not granted full protection, of course; but I digress, and don’t get me started.)
So to return to Alan’s analogy, “multiplayer” means different things in academia (and perhaps the nonprofit sector) and in business. In business, it means defined communities of co-operation; in academia, I see no good reason why it shouldn’t mean everyone (except, perhaps, where the two intersect and academics enter a business-defined collaboration1).
In academia, communities with an open science focus are beginning to form. The best example is still the one which continues to coalesce around Jean-Claude Bradley’s UsefulChem initiative, but it’s no longer the only one as it was just a few years ago. Chemist Mat Todd has funding for an open science project to improve synthesis of the anti-schistosomiasis drug, praziquantel. Biophysicist Steve Koch has a labful of open science enthusiast grad students. And so on; there’s a list of Open Notebook practitioners on wikipedia, and my own feeling is that technical rather than philosophical barriers are keeping quite a few labs from that list. By being discoverable on the public web, all of these labs can do what Jean-Claude is doing: accumulate collaborators and get more work done. Try searching Google for “DNA tweezers kinesin” — the second and fifth hits will hook you up with Steve Koch. “Praziquantel synthesis” — the third hit will take you to the schisto community on The Synaptic Leap, where you’ll soon meet Mat Todd, and the seventh hit will take you to a brief discussion of Mat’s project on the UsefulChem blog. “Antimalarial Ugi” — most of the first ten hits will introduce you to UsefulChem. If you’re doing something that’s in any way related to the work that goes on in these labs, you’re one Google search away from a collaboration.
In business, too, more and more companies are recognizing the benefits of wider sharing. Details of private collaborations are hard to come by, but just try searching for “precompetitive sharing” — even Big Pharma can see that they stand to make net gains from sharing their datasets. For an even better example, check out Sage Bionetworks. I was lucky enough to hear Stephen Friend speak at the Science Commons Symposium a couple of weeks ago, and one of the points he made was that the really big questions in biology require such immense amounts of data that the only way to collect them is to do it in the open. Any impediment at all, be it CC-BY attribution requirements or IP protections, will derail the whole process; the only answer in the end is the public domain.
So, the seeds are there. I think continued crystallization is inevitable, but it’s certainly worthwhile to try to monitor and direct the process — by way of questions like those Alan is asking.

1I don’t buy the argument, by the way, that unless academics work in secret and enable strong patent protection they will never get industry partners. If you invent something from which profit can be made, someone will want to make that profit. If, without outright patent ownership, it’s not enough money to tempt a Roche or an Intel, there will always be smaller, hungrier companies.

no art without

I remember reading somewhere about a school of philosophical thought which holds that there can be no art without the resistance of the medium — that the art is in the difficulty the artist overcomes when trying to make the medium express his or her message.
I don’t know that I buy the idea, but I do notice that my cell phone camera doesn’t have a very broad color or contrast palette, so it tends to blow out highlights and lose shadow detail — and that I’m starting to recognize opportunities to exploit those weaknesses:


I’m not sure I like being trained to a particular visual style like that, though. I picked up a camera in the first place in order to see differently, and I’ve been very pleased with the change in my world that this practice has rendered. I don’t think I want to put blinders on it.

Panton Principles for Open Data in Science

The Open Knowledge Foundation has just announced the Panton Principles for Open Data in Science. Here’s the point-form version of the Principles (but do go and read the whole thing, including the concise but important preamble; and please consider endorsing):

Formally, we recommend adopting and acting on the following principles:

  1. When publishing data make an explicit and robust statement of your wishes
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.

I’ve written elsewhere about my feeling that Open Data/Open Science will eventually need a set of core Declarations to do for the wider movement what the BBB definitions have done for Open Access. A set of widely accepted terms and definitions provides a framework within which ongoing discussions can be much more efficient, focused and useful, as well as a point of reference and a standard introduction for newcomers to a field. Kudos to OKF and partners for making a strong start in this direction.
I do have one small quibble. Following Peters Suber and Murray-Rust, I want Open licenses to be three things:

  • explicit
  • conspicuous
  • machine-readable

The Panton Principles come right out and say “explicit”, and “machine-readable” is largely covered because the recommended licenses are available in machine-readable versions (though I’d have preferred to see that actual phrase in the text of the Principles). What’s missing, to my mind, is “conspicuous”. The point of Open licensing is to enable and promote re-use, so it’s important to make your license as obvious as possible to potential users. This might seem trivial, but I think it bears spelling out.
My own Open Data mantra is:

  • where are the data?
  • can I have them?
  • what can I do with them?

and again, the PPs are 2 for 3 by my count. The licensing covers what I can have and what I can do with it, but there’s no mention of where I can find it in the first place. When we’re talking about a database, the question doesn’t arise since the license is in the same place as the data. But if we’re talking about data which underlie a published paper, those data are very often not in the same place as the paper, even if the license is there. So it’s important to make sure that your data are available: find or build them a stable online home and then let potential users know where it is. There’s not much point in placing something in the Public Domain if the only copy is on your desktop. I’d have liked to see an explicit discussion of storage, access and signposting in the Principles… though come to think of it, this is really a different (and enormous) set of questions. So perhaps “conspicuous” covers this as well, and the missing Principle is simply that there should be a highly visible link to the license and the data themselves in every place where they are used, mentioned or otherwise likely to be encountered.
Of course, there are always unresolved questions no matter how carefully you craft your Declarations and Statements and Principles — which is why the OKF has wisely built a companion tool, the Is It Open Data? web service. This is a brilliant way to remove ambiguity once and for all, on a case by case basis, by making public enquiry into the openness or otherwise of specific data sets. You can browse previous enquiries, so as to avoid redundant questioning of data owners; and naturally, recipients of multiple enquiries can use the service in a different way, simply linking to the record of their first response by way of answer to subsequent queries. Searchability might be a concern once the database of enquiries starts to grow, but that functionality can be added as needed. A central public service for asking questions about data availability and archiving the answers could go a long way towards improving access to data, simply by making clear the level of demand for Openness, and the degree to which supply falls short.

Science Commons Symposium, Redmond WA

I am going to follow Antony’s lead here and shamelessly steal Cameron’s post to introduce the topic:

… sometimes someone puts together a programme that means you just have to shift the rest of the world around to make sure you can get there. Lisa Green and Hope Leman have put together the biggest concentration of speakers in the Open Science space that I think I have ever seen for the Science Commons Symposium — Pacific Northwest to be held on the Microsoft Campus in Redmond on 20 February. If you are in the Seattle area and have an interest in the future of science, whether pro- or anti- the “open” movement, or just want to hear some great talks you should be there. If you can’t be there then watch out for the video stream.

Along with [Cameron Neylon] you’lll get Jean-Claude Bradley, Antony Williams, Peter Murray-Rust, Heather Joseph, Stephen Friend, Peter Binfield, and John Wilbanks. Everything from policy to publication, software development to bench work, and from capturing the work of a single researcher to the challenges of placing several hundred million dollars’ worth of drug discovery data into the public domain. All with a focus on how we make more science available and generate more innovation. Not to be missed, in person or online…

I’m going to be there, but don’t let that put you off — I’ll be sitting quietly in the audience soaking up the amazing array of expertise on offer. You won’t even notice me, I promise.
If you have any interest at all in Open Science (and why on Earth would you be reading me, if you didn’t?), you should make every effort to attend this symposium. I’m a bit skeeved out by its being held on a Microsoft campus — actually, I’m a lot skeeved out, and if it were any other lineup I probably wouldn’t go for that reason alone. But this is simply too good to miss. Seriously, do yourself a favor and be there if you possibly can.

no’ bad for a cameraphone

From my flash new company phone:



The top image is from my commute to work in the morning, at Beaverton Transit Center where I missed my connection by (quite literally) seconds and had to wait half an hour for the next bus. The bottom image is from my commute home in the evening, at SW 9th and Yamhill where I had ten minutes to wait for my train. I took a neat little video too, soundtrack courtesy of a busker with a violin, but it’s too big for snapfish and my phone is locked-in to Microsoft-related email services so I haven’t figured out how to upload it yet.

“Guerrilla OA” done right.

I was reminded recently (when Graham Steel uploaded this photo) of something I’ve been meaning to write about for nearly two years.
For those who don’t know him (which must surely exclude nearly everyone involved with Open Access!), Graham (blog, FF) is a patient advocate, which work has made him a staunch supporter of OA and all things Open. (Those of us who promote OA from an academic or research perspective sometimes, I think, forget about the incalculable value that OA offers other professionals and the lay public.)
Graham’s first foray into “guerrilla OA” (most emphatically not to be confused with these well-meaning idiots) was in September 2007, when he attended a conference and ran a one-man unofficial promotional campaign. Do read his own description, but the basic strategy was to be a human signpost (wearing “Research Made Public” and “I’m Open” t-shirts) and distribute OA promotional materials in such a way as to give most of the delegates at least a brief exposure to the concept.
(Pause here to marvel at the dedication of the man whose belief in the possibilities of OA makes him willing, entirely at his own instigation, to arrange attendance, travel and accomodation, collect up the necessary materials and then physically go and do all this.)
Sadly, we can’t yet clone Graham; but perhaps we can duplicate some of his efforts. I wonder how much it would cost to make “guerrilla OA” kits like the one Graham made for himself, but aimed at conference delegates so that researchers could turn into “Steel lite” activists at every conference we attend. Here are a few ideas:

  • t-shirts to start conversations
  • a badge instead of a t-shirt (“free your research, ask me how”) for those who prefer more formal attire
  • “OA in a nutshell” cards the size and shape of regular business cards, for handing out in conversation and leaving on appropriate tables
  • slides for your talk: start with Cameron’s “Presentation Rights” and end with a “Basics of OA” slide
  • equivalent add-ons for your poster, such as a Copyright Notice and an OA Basics placard, about the size of a postcard so they should fit on most posters as an afterthought and would be easy to incorporate into the poster itself

Here’s another idea: it would only take half a dozen delegates to run an “OA stall”, similar to the vendor stalls with which we are all only too familiar. This would mean working with conference administration, so maybe they would even help with “recruiting”; alternatively, it should be simple to set up a website where one can advertise for help in running such a stall at a particular conference. OA publishers could contribute materials (perhaps in return for help with costs), but I think transparent independence from any particular commercial effort would help tremendously in establishing credibility and producing a positive response. A prominent “who are we and why are we doing this?” banner might be a good idea. Flyers could include “OA:what’s in it for you?”, “Why the Impact Factor should be retired”, and “Elsevier: just another greedy bottom-feeder, or SPAWN OF SATAN????”. (OK, maybe not that last one… though a single page with this graph on it, or a reprint of this if I ever get around to publishing it, might be a good idea.)

coming up for air

Whew. It’s been a trip so far. My new job is at a company that is starting up after a hiatus — it’s not what is usually meant by a start-up, but from what I can tell the atmosphere is pretty similar. I’ll link to it, and maybe talk about some of my work, when I have a better sense of where the boundaries are. I don’t want to be continually pestering the admin to vet my blog posts! For now all I’ll say is that we make HIV diagnostic tools, and it’s good to be back in that fight. I might post HIV-related content from time to time, but I’ll add a disclaimer about my corporate connection.
I don’t have a lot of free time, but I do want to keep talking and thinking about Open Science… now that I’m in biotech, it’s harder to see how to do things openly, but that doesn’t mean I shouldn’t try.
For the moment, in lieu of any original content, here are two must-reads for anyone who reads me:
Walt Crawford has devoted an entire issue of Cites&Insights to library access to scholarship, and you should read it for a useful overview of the state of scholarly communication in general and not just because he says nice things about my efforts to put some numbers to the questions. (At the risk of being ungrateful, I will add that I could have done with fewer swipes at Stevan Harnad, but then I must in fairness further add that I am under-informed about the library community perspective on the original archivangelist. Ymmv.)
Cameron Neylon has been thinking about science and society again. Just do yourself a favour and read it, OK? Here’s a quote to whet your appetite:

We need at core a much more sophisticated conversation with the wider community about the benefits that research brings; to the economy, to health, to the environment, to education. And we need a much more rational conversation within the research community as to how those different forms of impact are and should be tensioned against each other. We need in short a complete overhaul if not a replacement of the post-war concensus on public funding of research. My fear is that without this the current funding squeeze will turn into a long term decline. And that without some serious self-examination the current self-indulgent bleating of the research community is unlikely to increase popular support for public research funding.