Craigslist Delists Millions of Pages from Search Engine Indexes over at the Search Engine Roundtable Forums gives the impression that Craigslist has embarked upon a new policy of blocking search engine spiders, but talking with Craigslist along with some further poking at the situation shows that's not the case. A summary of the situation below, and if you're a Search Engine Watch member, be sure to read the more detailed longer version of this post.
Avi Wilensky, who posted at the forums, assumed some new change must be in place when he couldn't find a real estate listing from Craigslist via a Google search that brought it up that listing only a few days before. Checking the Craigslist robots.txt file, he noticed that sections with listings about community, housing, for sale, services, gigs and jobs items seemed to be blocked.
At a quick glance, I could see why someone might assume that entire swaths of listings were being blocked. However, the listings themselves are not contained within these sections.
For example, here's the home page of the "blocked" housing area at Craigslist. The URL takes this form:
See the part in bold? Anything that begins with /hhh after the domain name is restricted by the Craigslist robots.txt file and not open to crawling by Google, Yahoo or others. So clearly all housing listings wouldn't be accessible! Wrong. That's because the listings within the housing section actually don't begin with the path of /hhh.
For example, here are the URLs for the first three listings shown on that housing area home page:
None of them begin with /hhh, as I've shown in bold, so all of them are fully open to being spidered.
Why block those specific table of content pages plus any pages below those particular sections? Craiglist chief executive Jim Buckmaster told me via email:
The URLs in question are sectional header links, which from a crawler standpoint represent a duplicate pathway to our listings, one which I understand from our tech team is disproportionately load-intensive when hit by crawlers.
Am I off the mark and have millions of pages with Craigslist listings now gone? Not from a few checks. At Google, site:craigslist.org shows nearly 12 million pages are indexed from various Craiglist sites, such as sandiego.craigslist.com and charlotte.craigslist.com.
Here are 631 listings for rooms in the North Bay area of San Francisco, for example. Aside from anyone being able to check on this, Buckmaster himself wasn't aware of any reason that content should have gone missing from the major crawlers.