Help LookSmart Crawl the Web

LookSmart is taking a new approach to discovering web content, offering a free downloadable screensaver program that also crawls the web when your computer is idle.

The program is Grub, the distributed crawling service that LookSmart bought in January for $1.4 million.

Most crawlers are centralized, run from each search engine’s data centers. Grub, on the other hand, runs from the computers of anyone who has downloaded and installed the Grub client. LookSmart plans to use the information gathered by Grub crawlers to supplement the centralized crawls run by its Wisenut search engine.

“Fundamentally, the first problem we’re trying to solve with our acquisition of Grub is that we know about many more documents than we can actually retrieve and analyze right now,” said Peter Adams, chief technology officer of LookSmart. “We know about over 10 billion URLs right now, and we see that trend growing in terms of web pages that are being added.”

Most search engines crawl many more documents than they actually index. Even culling duplicate pages, spam or otherwise inappropriate content, search engines have a hard time keeping pace with the constantly changing nature of the web.

This causes problems with the freshness of search engine indexes. While all of the major search engines update at least a portion of their indexes on a daily basis, most settle for anywhere from two weeks to a couple of months to completely refresh their databases.

Crawling more frequently, while technically possible, has its downsides, including greater costs and greater bandwidth consumption. Grub’s distributed approach to crawling can alleviate some of these downsides, according to Adams.

“Our first objective is to build a community of distributed web crawlers that will allow us to crawl all of the web documents every day,” Adams said. “Not necessarily to index them all, but to assemble a database of information about them — what’s new, what’s dead, what’s changed.”

The Grub crawler visits a list of essentially random URLs sent down from a central server. It retrieves pages and analyzes them, creating a “fingerprint” of a document, a unique sort of code that describes the document. Each time a page is crawled, Grub compares the new code to the old code. If it’s different, that signals there’s been a change to the page.

“Instead of crawling and send everything back, we only have the crawlers send back changed information,” said Adams. This intermediate analysis of a page is impossible for centralized crawlers to perform, since they must retrieve a page and store it in the search engine’s database before any analysis can be performed.

LookSmart believes that this distributed approach to crawling will be vital to coping with the growth of the Internet, and assuring that search engines continue to produce relevant results.

“If you look back over the past ten years of search engines, beyond five years ago what you’re really seeing is few large servers working on a smallish index,” said Andre Stechert, Grub’s director of technology.

“A little while ago, there was something called cluster computing that came along, and Google essentially capitalized on this in a big bad way. They took existing information retrieval algorithms and put them on this cheap computing model, which fundamentally changed search,” said Stechert.

Whereas Google uses clusters of thousands of computers, Stechert envisions yet another leap forward in search engine technology. Distributed “grid” programs like Grub will be hosted not on thousands of computers but millions.

“Google asked the question, ‘what happens when you have 10,000 computers.’ We’re asking, ‘what happens when you have a million,'” said Stechert. “This is going to yield another revolution in the quality of search results.”

The Grub client is easy to download and install. You have full control over its behavior — when it runs, how much bandwidth it consumes, and so on. In my tests, it crawled dozens of URLs in minutes over my cable modem connection without interfering with any of the other applications running on my computer.

It’s fascinating to watch the crawling process. The standard Grub interface shows you two graphs, displaying your bandwidth “history” and the number of URLs crawled per minute. Other statistics display information about the current crawl — pages that have changed, remain unchanged, are unreachable, and so on.

The screensaver is a visualization that graphically displays the crawling process. You can also switch to a view that scrolls the list of URLs as they’re being crawled.

You have no control over what’s crawled, with one exception, that I’ll talk more about. Nonetheless, its fascinating to see the display of URLs from all over the world — most of them unfamiliar. It reminds me of the early days of the web when random web page generators were popular.

If you own or operate your own web site, Grub will allow you to run a “local” crawl of your site every night. This is a great way to ensure that all of the content on your site gets crawled. For large sites, it will also cut down on some of the bandwidth consumption, since Grub compresses all data it sends back to its servers, by a factor of up to 20:1.

There are a few simple steps you must take to enable local crawling. Log in using your Grub user name and password, and click the “local crawling” link in your Navigation box in the upper right of the screen.

Grub then generates a unique user key, and includes it in a file called “grub.txt.” Next, download this file, and install it in the topmost directory of any web site you control. This will work even if your content is on a shared server — just put the file in the same directory where your main or home page resides.

Once you’ve installed the grub.txt file, use the form on the local crawling page to enter the top level directory for your site. It’s important to “normalize” the URL you enter by appending a slash to the url — for example “ .

That’s all there is to it. Grub will verify your user key, and allow your client — and only your client — to crawl your web site. No need to worry about someone poaching your Grub.txt file and placing it on their site. It won’t work, since the file has to be in the site you’ve specified, and your site will only be crawled by your client.

And if this sounds like an ideal way to spam LookSmart, or use techniques like doorway pages or cloaking, think again. Although your site won’t be crawled for indexing by other Grub clients, some of your pages will be crawled by other clients for comparison purposes.

LookSmart says that if it finds “serious” discrepancies it will disable your Grub client. As with all search engines, egregious spam will also likely get kicked out of any indexes maintained by LookSmart.

Why should you help LookSmart index the web? The altruistic reason is that it will help them broaden their coverage of the web, and potentially improve the relevance of search results. If Grub catches on, it’s likely to spur similar efforts by other search engines.

Grub also keeps stats for each user. You can see how much your client has crawled, and compare your “ranking” with other Grub users.

But the best reason, at least to me, is that watching a crawler in action is fascinating. It allows you to directly observe a process that’s normally hidden away in the black boxes we call search engines. Bottom line: it’s a heck of a lot of fun.


Grub Frequently Asked Questions

Patent Wars!
The Search Engine Report, Oct. 6, 1997

LookSmart isn’t the first search engine to use distributed computing. All the way back in 1997, Infoseek (technology now owned by Disney) was issued a patent for “distributed searching,” a sort of federated meta-search process. Google is also experimenting with distributed computing with its Google compute project.

Google’s New High Protein Diet
SearchDay, Mar. 25, 2002
Google is harnessing the collective computing power of its users to help model complex proteins, a project that could lead to the development of cures for Alzheimer’s, cancer, AIDS and other diseases.

Related reading

Simple Share Buttons