NOTE: IF YOU ARE LOOKING FOR THE MSN ARTICLE, IT'S HERE:
MSN Adds Preview Screenshots, Ability To Dig Deeper Into Results
Over the past month, Google has stepped up the number of pages that it is spidering on a daily basis in an effort to increase the freshness of its database. The search engine has also changed the way it reacts if given a 403 "forbidden" error message when asking for a robots.txt file and has picked up iWon as a new customer for its search results.
Last December, Google announced that it was spidering 3 million pages each day where freshness had been determined to be crucial. Now the search engine has gone beyond those core pages, though it won't say exactly how far.
"We are significantly increasing the size of the daily crawl and have been doing over the past month," said spokesperson Nate Tyler.
In another change, Google will now spider web sites even if they return a 403 or "forbidden access" message when it asks for the site's robots.txt file. This raises two questions. Why would anyone who creates a robots.txt file for spiders then make the file forbidden, and why wouldn't Google, if it couldn't get the file, simply act as it if didn't exist and spider the site anyway?
Reader David Hoegerman, who reported the Google-403 problem to me, says that forbidding access to a robots.txt file is probably done accidentally.
"The problem lies in how some system administrators configure web servers. Frequently, the default permissions for a .txt file are set to deny access from the web. An unsuspecting web developer than adds a robots.txt file and doesn't know that the permissions have to then be changed to allow access," Hoegerman said.
So Google (and other spiders) couldn't read the file. Google, being cautious, took a stance to then not access the site at all.
"We've always tried to be as conservative as possible when we crawl, so until recently if we got a 403 error on a robots.txt, we would not crawl pages from that domain. In fact, the draft RFC for robots.txt has a recommendation, but not a requirement, that a 403 for a robots.txt implies that access to the entire site is forbidden," said Matt Cutts, a software engineer at Google who deals with webmaster issues.
Now the behavior is changing, because experience shows that Google's cautious approach was actually causing problems for webmasters.
"We decided to change our crawling policy for a few reasons. First, the user feedback we collected on 403s shows that most webmasters with 403s on robots.txt still expected to be crawled and were unhappy if we didn't crawl their pages. Most of these cases were configuration errors. We also saw at least one instance where the webmaster made no robots.txt, but their ISP automatically returned a 403 instead of a 404 [page not found error”, which meant that we didn't crawl the site," Cutts said.
Finally, Google has gained another search partner, iWon. Google's crawler-based results now appear in the "Web Sites" section of iWon's results and replace those previously provided by Inktomi. "Sponsored Listings" at iWon still come from Overture. Initial word from Google is that the deal with iWon is only for Google's web search results and not also for paid listings, but I'll update when I get the final confirmation.
Google Gets Bigger, Fresher, Offers Better News
The Search Engine Report, Jan. 7, 2002
Article on Google's move earlier this year to increase freshness. Search Engine Watch members -- use the link at the top of this page to reach the members-only version
Google: Removing A Page From Google
Tips from Google on how NOT to get listed, including advice on robots.txt files.
Coping With Listing Problems At Google
The Search Engine Update, July 15, 2002
This is a special article available to Search Engine Watch members. Many web site owners have a Googlecentric view of the search engine world -- it's the only search engine that they care about. Unfortunately, that's not all good for Google. Site owners fret about problems being listed at Google, and the lack of a guaranteed channel to address their concerns could lead to a PR problem similar to that which Yahoo once endured. A look at some common problems, remedies and what might come from Google for site owners.