The News.com story: How to evade Google search, reports that once again a company, in this case Dell, has learned the hard way that what’s put on a public web server is open to crawling, caching, and discovery.
Specifications for future Dell notebooks were accessible via Google’s search site before the content was pulled from a Dell file transfer protocol site and from Google’s cache.
It’s very likely, almost a given, that most of you know about keeping content from being crawled and/or cached using robots.txt or one of many other methods. If you don’t or need a quick review, one of my favorite info compilations about robots.txt comes via SearchTools.com.
It’s very possible tha this article will reach many people who have little to no idead about how crawlers operate and how to keep content out of Google.
The article would have been more useful if it stressed that this is a webmaster and web-wide issue and not a Google issue. Every webmaster who places content on publicly accessible servers should have a basic understanding of how web crawlers work and that many large engines (and even some verticals) cache content.
Google is the most widely used web engine but the webmaster who only focuses their attention on Google might not realize that the searcher who knows about cached content, and then goes looking for it, will know about many other web caches.
In other words, keeping content only out of Google doesn’t mean it’s not accessible elsewhere and off the web. SEO’s know this to be true but I often wonder about others.
Postscript: I noticed that this News.com article about what the Dell notebook specs contained does point out (at the very end) that the material was also cached by Yahoo.