(Sorry for the unwieldly title - if some mod wants to change this to be somewhat legibile, go ahead!)
http://condi.topcities.com/whrobots/index.htmlWhy is whitehouse.gov (the official White House website) disallowing "Iraq" directories from search engine crawling?
As of Oct 24, 2003 the robots.txt file at whitehouse.gov (you can access the current version here or an archived version, here) is 1631 lines long. There are two blank lines between sections, one line (at the very top) that identifies the file, and 8 lines at the very bottom that are instructions to a user-agent called "whsearch" which appears to be the internal whitehouse.gov crawler. The bulk of the file is the section directed to all external search engine robots /crawlers / spiders, which is 1,620 lines long and has 1,620 "Disallow" statements.
There are 862 instances of the term "text" in the file, which is easily explained because whitehouse.gov generally uses directory paths that end in "text" for printable pages -- the pages that are duplicates of the normal display pages except that they are formatted for printing. It's easy to see why the term "text" appears so often in this file, since disallowing these directories helps lessen the "clutter" in search by excluding the essentially duplicate pages.
There are 783 instance of the term "iraq" in this file, almost all of them appended to paths that already exist in the file. These appear to have been added haphazardly, since the term appears in many path names for which no such terminal "iraq" directory exists, such as:
Disallow: /holiday/2002/barney/iraq
Disallow: /kids/eggroll/iraq
However, this robots.txt file does exclude external search engine robots from some 75 directories that actually exist on whitehouse.gov.<...>
Google's cache (retrieved from Google on 10/26/03, but actual caching date unspecified) of whitehouse.gov robots.txt. I've archived the cache as it is at this writing here . This file is 1579 lines long, with 754 instances of "iraq."
The most current whitehouse.gov file archived at the Internet Archive is from April 16, 2003. This file is 780 lines long, with only 10 instances of the word "iraq."
Sometime between April 2003 and late October, 2003, hundreds of instances of the term "iraq" were added to the whitehouse.gov robots.txt file.