Peter Suber recently linked to a post on the LibraryLaw blog which asked why the Wayback Machine does not seem to archive National Science Foundation pages:
I was just looking on the National Science Foundation’s web site to try to find the Index of FOIA Frequently Requested Documents. The Index is mentioned in the NSF’s Public Information Handbook. When I couldn’t find the Index, I realized the Handbook was written in 1999, and perhaps an older version of the NSF website had a copy of the Index. So I went to the Internet Archive’s trusty Wayback Machine, and put in the NSF’s web address. Yesterday when I looked at the results page, there were no results, and the statement that the site had been blocked by robots.txt was the only information returned. Today, the Wayback Machine’s results page shows each instance when the site was archive, from 1997 to 2005, but when you click on a link, the resulting page is empty and has this message:”We’re sorry, access to http://www.nsf.gov/ has been blocked by the site owner via robots.txt.”
I thought this was weird, and wrote the NSF webmaster, who wrote back to say this:
NSF blocks all indexing of the site between 7AM and 7PM ET, our peak traffic hours, for the convenience of our users. However, there is no block on the site from 7PM to 7AM ET. This is standard policy for most high traffic sites. The owner of [the Wayback Machine] need only comply with our policy in order to index our pages.
So that made me wonder whether archive.org is aware that NSF has this policy, or whether there might be some other error somewhere. Searching the Wayback Machine for “www.nsf.gov” or “nsf.gov” produces a list of archived pages. Clicking on any of those links earlier today produced a file location error, but right now (some hours later) it’s working fine. The earliest available version of the relevant public information page says that the document Susan was looking for is “coming soon”, but I couldn’t find it even though I went through about six versions of the public information page from 1997 to 2005. The Public Info Handbook actually says
An index of FOIA Frequently Requested Records will be published, if applicable, on the Home Page under “Public Information – FOIA and Privacy Act Requests.” Where possible, this will include an electronic version of the actual records released.
(emphasis mine), so perhaps it was never added. Searching the current NSF site for “frequently requested” does not turn up the index in question, and neither does searching their publications for “FOIA”, but I did find a recent management plan (pdf) which includes “Review Agency posting of statements of policy, administrative staff manuals and copies of frequently requested records” in a list of areas “identified for review”. So perhaps it’s still “coming soon”, 9 years on. We are, after all, talking about a government agency.
Incidentally, the NSF’s robots.txt file is right where it should be:
# robots.txt for http://www.nsf.gov/
# Change history:
The Wayback Machine uses Alexa crawlers, so as far as I can tell the file as shown allows vspider (a commercial spiderbot) more limited access, but every other robot can go to most of the site. It doesn’t change (I checked before and after 7pm ET; same file), so NSF must be implementing their block some other way. F’rinstance, .htaccess can serve/block pages depending on the time of day.
So, to sum up: NSF only restricts access during peak hours, and the Wayback Machine knows about this and archives the site just fine. The index of FOIA requests that Susan was looking for, however, does not appear to be available. The person to ask would appear to be the FOIA Officer.