LookSmart is taking a new approach to discovering web content, offering a free downloadable screensaver program that also crawls the web when your computer is idle.
The program is Grub, the distributed crawling service that LookSmart bought in January for $1.4 million.
Most crawlers are centralized, run from each search engine’s data centers. Grub, on the other hand, runs from the computers of anyone who has downloaded and installed the Grub client. LookSmart plans to use the information gathered by Grub crawlers to supplement the centralized crawls run by its Wisenut search engine.
“Fundamentally, the first problem we’re trying to solve with our acquisition of Grub is that we know about many more documents than we can actually retrieve and analyze right now,” said Peter Adams, chief technology officer of LookSmart. “We know about over 10 billion URLs right now, and we see that trend growing in terms of web pages that are being added.”
Most search engines crawl many more documents than they actually index. Even culling duplicate pages, spam or otherwise inappropriate content, search engines have a hard time keeping pace with the constantly changing nature of the web.
This causes problems with the freshness of search engine indexes. While all of the major search engines update at least a portion of their indexes on a daily basis, most settle for anywhere from two weeks to a couple of months to completely refresh their databases.
Crawling more frequently, while technically possible, has its downsides, including greater costs and greater bandwidth consumption. Grub’s distributed approach to crawling can alleviate some of these downsides, according to Adams.
“Our first objective is to build a community of distributed web crawlers that will allow us to crawl all of the web documents every day,” Adams said. “Not necessarily to index them all, but to assemble a database of information about them — what’s new, what’s dead, what’s changed.”
The Grub crawler visits a list of essentially random URLs sent down from a central server. It retrieves pages and analyzes them, creating a “fingerprint” of a document, a unique sort of code that describes the document. Each time a page is crawled, Grub compares the new code to the old code. If it’s different, that signals there’s been a change to the page.
“Instead of crawling and send everything back, we only have the crawlers send back changed information,” said Adams. This intermediate analysis of a page is impossible for centralized crawlers to perform, since they must retrieve a page and store it in the search engine’s database before any analysis can be performed.
LookSmart believes that this distributed approach to crawling will be vital to coping with the growth of the Internet, and assuring that search engines continue to produce relevant results.
“If you look back over the past ten years of search engines, beyond five years ago what you’re really seeing is few large servers working on a smallish index,” said Andre Stechert, Grub’s director of technology.
“A little while ago, there was something called cluster computing that came along, and Google essentially capitalized on this in a big bad way. They took existing information retrieval algorithms and put them on this cheap computing model, which fundamentally changed search,” said Stechert.
Whereas Google uses clusters of thousands of computers, Stechert envisions yet another leap forward in search engine technology. Distributed “grid” programs like Grub will be hosted not on thousands of computers but millions.
“Google asked the question, ‘what happens when you have 10,000 computers.’ We’re asking, ‘what happens when you have a million,'” said Stechert. “This is going to yield another revolution in the quality of search results.”
The Grub client is easy to download and install. You have full control over its behavior — when it runs, how much bandwidth it consumes, and so on. In my tests, it crawled dozens of URLs in minutes over my cable modem connection without interfering with any of the other applications running on my computer.
It’s fascinating to watch the crawling process. The standard Grub interface shows you two graphs, displaying your bandwidth “history” and the number of URLs crawled per minute. Other statistics display information about the current crawl — pages that have changed, remain unchanged, are unreachable, and so on.
The screensaver is a visualization that graphically displays the crawling process. You can also switch to a view that scrolls the list of URLs as they’re being crawled.
You have no control over what’s crawled, with one exception, that I’ll talk more about. Nonetheless, its fascinating to see the display of URLs from all over the world — most of them unfamiliar. It reminds me of the early days of the web when random web page generators were popular.
If you own or operate your own web site, Grub will allow you to run a “local” crawl of your site every night. This is a great way to ensure that all of the content on your site gets crawled. For large sites, it will also cut down on some of the bandwidth consumption, since Grub compresses all data it sends back to its servers, by a factor of up to 20:1.
There are a few simple steps you must take to enable local crawling. Log in using your Grub user name and password, and click the “local crawling” link in your Navigation box in the upper right of the screen.
Grub then generates a unique user key, and includes it in a file called “grub.txt.” Next, download this file, and install it in the topmost directory of any web site you control. This will work even if your content is on a shared server — just put the file in the same directory where your main or home page resides.
Once you’ve installed the grub.txt file, use the form on the local crawling page to enter the top level directory for your site. It’s important to “normalize” the URL you enter by appending a slash to the url — for example “http://www.searchenginewatch.com/ .
That’s all there is to it. Grub will verify your user key, and allow your client — and only your client — to crawl your web site. No need to worry about someone poaching your Grub.txt file and placing it on their site. It won’t work, since the file has to be in the site you’ve specified, and your site will only be crawled by your client.
And if this sounds like an ideal way to spam LookSmart, or use techniques like doorway pages or cloaking, think again. Although your site won’t be crawled for indexing by other Grub clients, some of your pages will be crawled by other clients for comparison purposes.
LookSmart says that if it finds “serious” discrepancies it will disable your Grub client. As with all search engines, egregious spam will also likely get kicked out of any indexes maintained by LookSmart.
Why should you help LookSmart index the web? The altruistic reason is that it will help them broaden their coverage of the web, and potentially improve the relevance of search results. If Grub catches on, it’s likely to spur similar efforts by other search engines.
Grub also keeps stats for each user. You can see how much your client has crawled, and compare your “ranking” with other Grub users.
But the best reason, at least to me, is that watching a crawler in action is fascinating. It allows you to directly observe a process that’s normally hidden away in the black boxes we call search engines. Bottom line: it’s a heck of a lot of fun.
Grub
http://www.grub.org
Grub Frequently Asked Questions
http://www.grub.org/html/help.php?op=main-faq
Patent Wars!
The Search Engine Report, Oct. 6, 1997
http://searchenginewatch.com/sereport/97/10-patent.html
LookSmart isn’t the first search engine to use distributed computing. All the way back in 1997, Infoseek (technology now owned by Disney) was issued a patent for “distributed searching,” a sort of federated meta-search process. Google is also experimenting with distributed computing with its Google compute project.
Google’s New High Protein Diet
SearchDay, Mar. 25, 2002
http://searchenginewatch.com/searchday/02/sd0325-googlecom.html
Google is harnessing the collective computing power of its users to help model complex proteins, a project that could lead to the development of cures for Alzheimer’s, cancer, AIDS and other diseases.
Related reading
IWD 2018: Eight SEO ladies give their advice on being a woman in search
In honor of International Women’s Day 2018, we wanted to highlight the perspectives of women working in SEO, and how – if at all – they think gender affects the industry and the work that they do. Search Engine Watch spoke to eight successful ladies in SEO to find out their thoughts and advice on being a woman in SEO.
Ranker: How to make a Google algorithm-proof website
Any SEO or webmaster who has ever had a website affected by a Google algorithm change – or feared being affected by one – has probably wished that they could find a way to make their website “algorithm-proof”. One site believes it has found the formula.
Mystified by martech? Introducing the ClickZ Buyers Guide series
Search Engine Watch sister site ClickZ has just launched the first report in its new series of buyers guides, which aims to to disentangle and demystify the martech landscape for marketers.
Pricesearcher: The biggest search engine you’ve never heard of
If you ask Siri to tell you the cost of an iPad near you, she won’t be able to provide you with an answer, because she doesn’t have the data. Until now, a complete view of prices on the internet has never existed. Enter Pricesearcher, a search engine that has set out to solve this problem by indexing all of the world’s prices.