This1 is an attempt to define and explain the semantic web for a lay audience, though it should be remembered that I am a member of that audience myself…
It is a commonplace that we are drowning in information, and nowhere is this “information overload” more apparent than in scientific research. The National Library of Medicine’s literature database, PubMed, is searched more than 60 million times a month and contains almost 19 million records from more than 5300 journals — still only a fraction of the approximately 15,000 active, refereed, scientific journals listed in Ulrich’s Periodicals Directory2. GenBank, the world’s foremost repository of nucleic acid sequence information, contains roughly 100 billion bases in 100 million sequence records, and is growing at an exponentially increasing rate that is currently in excess of 50,000 records per day. Unlike PubMed and GenBank, which are cross-disciplinary databases, the Nucleic Acids Research Molecular Biology Database Collection is a carefully curated list of high-value specialist resources; it currently lists 1170 distinct, largely non-overlapping databases. I could go on, but you get the point3.
As things stand, researchers talk to researchers and use computers to facilitate that conversation; what we need is for computers to be able to talk to computers. To cope with (literally) inhuman volumes of data, we need that data to start making sense to machines, so that they can do something no human brain can do: process all of it. We need to make it possible for machines to transfer richly interconnected data among themselves, mix and remix it, generate new connections, filter it, process it, transform it, and output the results to formats and interfaces that make sense to human brains — substrates on which we can carry out the sorts of synthetic, creative thinking that computers cannot do.
We need a man-machine partnership in which both partners can do what they do best, and that means we need the semantic web.
The semantic web is the outcome of processes and frameworks with which computers can manipulate data in a way that makes it accessible by human brains. It is built on the standards and metadata — information about data — that are required for automated data exchange and processing, which in turn is required to create machine-generated, human-scale summaries, skeletons, outlines and other representations of, and interfaces with, the entire knowledge corpus.
Here’s an example. Human brains have no trouble processing the following data:
Another reason for opening access to research. Wilbanks J. BMJ. 333:1306-8 (2006).
To you, that’s a reference; but to a computer, it’s just a string of text. What a computer needs is information (metatada) about each substring:
Title: Another reason for opening access to research.
Author: Wilbanks, J
Journal: British Medical Journal
Now the computer “knows” which letters identify John, which constitute the title of the article, and so on. If you set the standards up properly, it even “knows” that Wilbanks is the surname and J the first initial, and so on into ever finer grained properties.
Now imagine you had, oh, say, about 19 million such records. A human brain cannot do anything useful with such a database, but a computer can — which is exactly why we can ask PubMed human-scale questions like “how many papers did J Wilbanks publish between 2000 and 2009?”, or “show me all the papers with “access to research” in the title”.
Now multiply that — the ability to ask human-scale questions of a mass of information far too large for any human brain to absorb or process — by thousands of different types of information (text, gene sequences, chemical formulae, microarray results, etc etc), millions of individual records within each data type, recorded in thousands of journals and databases, produced by hundreds of thousands of laboratories, libraries and garage hackers. Imagine what we could learn if we could query all of that information on a human scale.
There: that’s a glimpse of the potential power of the semantic web.
1 This entry started life as an early draft of a letter in support of John Wilbanks’ application for a TED fellowship. We didn’t get enough signatures in time, so it never was even sent. My apologies to those people who did sign on; if John re-applies I’ll try again, with better planning!
2 tickboxes = active, refereed, scholarly/academic; search = LC Classification Number for [Q* OR R* OR S* OR T* OR U* OR V*]
3In fact, I’m always on the lookout for more good examples of the “data deluge” and the rapid progress of science and tech; post ’em (in comments) if you got ’em.