When Wikia Search was released last night, Jimmy Wales explained they used a "placeholder index" for the search. While this may be appropriate for the alpha search, I'd like to ask Jimmy exactly how Wikia plans to crawl and index a significant portion of the web.
The Grub distributed crawler, which was acquired from LookSmart, appeared to provide most of the solution. At the O'Reilly Open Source Convention, Wales announced that he would immediately release the crawler to the open source community.
By downloading the client, Grub allows “the site owners the option of crawling their own data, with their own bandwidth. The client...is designed to connect to a central coordinating server, grab a batch of URLs, and then proceed to crawl them.” It claims 20:1 savings in bandwidth for both Wikia and the hosting website.
Since the summer, I'm not sure how much progress Wikia has made here. Within Grub's site stats, there's a "Wikia Search" team that crawled around 918k URLs so far. That seems far too low.
Site stats about Grub members tell a more complete story, as the top 100 members crawled 350 million URLs so far. The remaining 293 members aren't shown, but if we assume 250k on average, then 425 million URLs would have been crawled in total.
There are other planning considerations too, regarding what belongs in the index. Will they be able to include the "right" domains or exclude the "wrong" domains? Will they be able to crawl some domains more or less frequently? Will video, images or other media be included?
We would be interested in knowing the game plan for developing a substantial index over time. It's not just about numbers, although a billion or two could help with a 2009 launch.
Introducing... ClickZ Live!
SES Conference & Expo has merged with ClickZ to bring you ClickZ Live! The new global conference series takes on the identity of the industry's premier digital marketing publication, ClickZ.com, and kicks off March 31-April 3 in New York City. Join the industry's leading tech-advertisers in the advertising capital of the world! Find out more ››
*Super Saver Rates expire Jan 24.