Craigslist Delists Millions of Pages from Search Engine Indexes over at the Search Engine Roundtable Forums gives the impression that Craigslist has embarked upon a new policy of blocking search engine spiders, but talking with Craigslist along with some further poking at the situation shows that's not the case.
Avi Wilensky, who posted at the forums, assumed some new change must be in place when he couldn't find a real estate listing from Craigslist via a Google search that brought it up that listing only a few days before. Checking the Craigslist robots.txt file, he noticed that sections with listings about community, housing, for sale, services, gigs and jobs items seemed to be blocked.
Specifically, the current robots.txt file for the main Craigslist site has this section at the end:
Below that section was this key to the codes:
ccc = community
hhh = housing
sss = for sale
bbb = services
ggg = gigs
jjj = jobs
At a quick glance, you could see why someone might assume that entire swaths of listings were being blocked. However, the listings themselves are not contained within these sections.
For example, here's the home page of the housing area at Craigslist. The URL takes this form:
See the part in bold? Anything that begins with /hhh after the domain name is restricted by robots.txt and not open to crawling by Google, Yahoo or others. So clearly all housing listings wouldn't be accessible! Wrong. That's because the listings within the housing section actually don't begin with the path of /hhh.
For example, here are the URLs for the first three listings shown on that housing area home page:
None of them begin with /hhh, as I've shown in bold, so all of them are fully open to being spidered.
Why block those specific table of content pages plus any pages below those particular sections? Craigslist chief executive Jim Buckmaster told me via email:
The URLs in question are sectional header links, which from a crawler standpoint represent a duplicate pathway to our listings, one which I understand from our tech team is disproportionately load-intensive when hit by crawlers.
When was the change made? Publicly, it's difficult to tell. It wasn't in place as of April 1, 2005, the most recent archive available through the Internet Archive. Checking the Google cached copy, I can see that the sectional header links were barred as of Dec. 24, 2005 -- though the explanation text wasn't in place then. That explanation key was in place when I looked yesterday. Today, the key has gone again.
Buckmaster said the robots.txt file at Craigslist was last updated in October 2004, so he assumes that's when the lines were added. The Internet Archive tells a different story, so I'm guessing he meant October 2005 and will follow-up. I'll add a postscript if it was October 2004, rather than October 2005, as I'm assuming is the case.
Am I off the mark and have millions of pages with Craigslist listings now gone? Not from a few checks. At Google, site:craigslist.org shows nearly 12 million pages are indexed from various Craiglist sites, such as sandiego.craigslist.com and charlotte.craigslist.com.
Here are 631 listings for rooms in the North Bay area of San Francisco, for example. Aside from anyone being able to check on this, Buckmaster himself wasn't aware of any reason that content should have gone missing from the major crawlers.
After I followed-up with Craigslist, Wilensky noted to me in email that Craiglist apparently adds 6 million listings per month. If so, then Google should easily have much more than 12 million listings. Good point, unless Craiglist is removing old listings. Again, I'll follow up with Craiglist on that, but this seems to be the case.
For example, here is the oldest page of listings I can find for apartments for rent in the Washington DC area. Try to visit the last listing. That brings up a 404 Not Found page. Other listings get removed by authors, such as this one.
Also keep in mind that each individual Craiglist site technically will have its own robots.txt file. It does appear that the same file contents are used for each individual site. However, it could be that one or more sites have greater restrictions on them.
craigslist permits you to display on your website, or create a hyperlink on your website to, individual postings on the Service so long as such use is for noncommercial and/or news reporting purposes only (e.g., for use in personal web blogs or personal online media). If the total number of such postings displayed or linked to on your website exceeds one hundred (100) postings, your use will be presumed to be in violation of these Terms, absent express permission granted by craigslist to do so.
But what about major search engines like Google and Yahoo? Clearly they carrying hyperlinks to far more than 100 listings at Craiglist, which they obtain through crawling. Comments from this sidebar blog post to the San Jose Mercury News article on the Craigslist-Oodle situation has that issue being raised.
Buckmaster said that the terms make an exception for general purpose search engines, and indeed, you'll find that there:
This license does not include any collection, aggregation, copying, duplication, display or derivative use of the Service nor any use of data mining, robots, spiders, or similar data gathering and extraction tools for any purpose unless expressly permitted by craigslist. A limited exception is provided to general purpose internet search engines and non-commercial public archives that use such tools to gather information for the sole purpose of displaying hyperlinks to the Service, provided they each do so from a stable IP address or range of IP addresses using an easily identifiable agent and comply with our robots.txt file. "General purpose internet search engine" does not include a website or search engine or other service that specializes in classified listings or in any subset of classifieds listings such as jobs, housing, for sale, services, or personals, or which is in the business of providing classified ad listing services.
Those terms have been specifically modified to bar classified ad listings search engines. Checking the most recent archived version of the terms from the Internet Archive, from April 1, 2005, shows what they used to say:
Additionally, you agree not to:....use automated means, including spiders, robots, crawlers, data mining tools, or the like to download data from the Service - exception is made for internet search engines (e.g. Google) and non-commercial public archives (e.g. archive.org) that comply with our robots.txt file
No mention of classified listings comes up in that, nor is the other section about not linking to more then 100 posting mentioned.
So overall, Craigslist doesn't seem to be blocking the major search engines. The terms definitely reflect what we've already seen, a desire to block specialized classified ad search engines. Certainly adding that type of block to the robots.txt file in addition to the terms would be helpful, because for spiders, it's the robots.txt file that serves as the first stop of what is or isn't allowed.
Introducing SES Online
Want to view one of the sessions you missed or listen to an especially informative presenter a second time? SES New York sessions are available for purchase on ClickZ Academy's new e-Learning site. SES is now Online!