UPDATE: Editors’ Note: At the request of Google, we’ve removed the photo of Google engineer Jayant Madhavan, co-author (with Alon Halevy) of the Google Webmaster Central blog post, Crawling through HTML forms, posted by Maile Ohye, Senior Support Engineer at Google. The photo was deleted at Google’s request to respect the privacy of Google’s corporate data and the personal privacy of Jayant Madhavan.
— Kevin Heisler, Executive Editor, Search Engine Watch
A few hours ago, Google announced to the world that the company has been crawling forms on “high-quality” Web sites to index “Invisible Web” content in the Google.com search engine.
Google’s intention (as always) aims to improve the quality of search results for users of Google’s search engine.
Crawling Web site forms, though, constitutes a sea change in terms of data privacy; specifically, the privacy of corporate data.
“In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google,” according to Jayant Madhavan and Alon Halevy, from the Crawling and Indexing Team on an official Google blog.
Here’s how Googlebot does it, according to Google engineers:
“We might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.”
Last year, as the search marketing analyst for JupiterResearch, I said that the biggest issue in 2007 would be the threat to the privacy of corporate data.
I was wrong, 2008 is the year corporate IT departments worldwide will be forced to spend time, money and resources to ensure that search engine spiders do not inadvertently index data a company would prefer to be private.
The same holds true for non-profit organizations and other institutions.
I have full confidence that Google practices “good Internet citizenship.”
I’m confident Google has paved the road to relevance with good intentions.
This is not simply a “pioneering move” by Google.
I’m sorry, Sergey, Larry, Eric. I can’t in good conscience defend Google’s decision to our readers. The costs to CEOs, CIOs and CTOs at corporations far outweigh the benefits to consumers.
Do not make the robotic querying of Web site forms the default spidering practice for Google. As a search engine, Google has become the gateway to the Internet and with great power comes great responsibility.
End this experiment now.
Stop this experiment before the backlash against Google develops. It’s not a question you want to answer when Wall St. analysts quiz you on the company’s performance on April 17th during the First Quarter earnings conference call.