Sunday, October 12, 2014

Internet Archive Captures of This Blog

At work I have some involvement with web archiving. Our program is selective in certain subjects as compared to the work done by the Internet Archive, which seems to try to take in as much of what is on the Internet as it can. (There doesn't appear to be a better way of defining the scope of the Internet Archives efforts, but it is clear that they don't harvest everything. For one thing, they respect robots.txt so if a site uses that to prevent indexing or crawling of the site, then IA won't harvest it.

This blog has existed since July 2010 and now has over 500 posts. It isn't clear to me if IA now attempts to harvest all of the blogs in Blogger or has some mechanism for choosing (such as size, or frequency of posting, or popularity, or ?? - whatever it is, IA has been archiving this blog since January 27 2012 a few times a year.

Calendar of captures (harvests) of my blog by the Internet Archive

Each year for which there are captures has a calendar of the months with dates circled when the site was harvested. In 2012 a harvest was made of the blog as it was on January 27 2012, the not again until September 22 (which resulted in the capture shown below). After that it was harvested more frequently but not on what looks like a regular schedule.

IA capture of this blog from September 22, 2012

Looking at the archived version of my site reveals that I haven't changed its formatting since 2012. The only obvious different in fact between now and then is that the ranking of "popular posts" has changed - in September 2012, a post about a Soviet time trial bike was the most read, but now it is a post about the book "Bicycling for Ladies" - this is the result of some outside sites linking to the "Bicycling for Ladies" post, I think. It seems clear that for this not-that-much-read blog, the "popular posts" remain at the top by virtue of readers seeing them there and clicking on them, for the most part.

If one looks at a (far) more famous bicycle blog, Bike Snob NYC as captured by the Internet Archive, it is clear that they have been capturing some Blogger blogs for a long time - the captures for Bike Snob go back to July 7, 2007 for a blog that had only started in June 2007! Perhaps it was the frequency of posting that caused this. In this case, the comparison of the then-blog and the today-blog is more revealing - Bike Snob has zero advertising in July 2007. And the subtitle for the blog was "Finally--a catty, gossipy, nasty, and critical blog for bicycles!" rather than the present "Systematically and mercilessly disassembling, flushing, greasing, and re-packing the cycling culture." (At some point during the next year Bike Snob changed the subtitle to what it is now, according to the versions in the Internet Archive.)

What is the significance of this? Particularly in terms of cycling? Probably none. Except that even pretty obscure stuff that may disappear from the Internet, including stuff about bicycles and cycling, may be stored away in the Internet Archive.

