When Wikia Search was released last night, Jimmy Wales explained they used a “placeholder index” for the search. While this may be appropriate for the alpha search, I’d like to ask Jimmy exactly how Wikia plans to crawl and index a significant portion of the web.
The Grub distributed crawler, which was acquired from LookSmart, appeared to provide most of the solution. At the O’Reilly Open Source Convention, Wales announced that he would immediately release the crawler to the open source community.
By downloading the client, Grub allows “the site owners the option of crawling their own data, with their own bandwidth. The client…is designed to connect to a central coordinating server, grab a batch of URLs, and then proceed to crawl them.” It claims 20:1 savings in bandwidth for both Wikia and the hosting website.
Since the summer, I’m not sure how much progress Wikia has made here. Within Grub’s site stats, there’s a “Wikia Search” team that crawled around 918k URLs so far. That seems far too low.
Site stats about Grub members tell a more complete story, as the top 100 members crawled 350 million URLs so far. The remaining 293 members aren’t shown, but if we assume 250k on average, then 425 million URLs would have been crawled in total.
There are other planning considerations too, regarding what belongs in the index. Will they be able to include the “right” domains or exclude the “wrong” domains? Will they be able to crawl some domains more or less frequently? Will video, images or other media be included?
We would be interested in knowing the game plan for developing a substantial index over time. It’s not just about numbers, although a billion or two could help with a 2009 launch.