How Will Wikia Grow The Index?

Date published 7 January 2008 Author

Categories

Social

When Wikia Search was released last night, Jimmy Wales explained they used a “placeholder index” for the search. While this may be appropriate for the alpha search, I’d like to ask Jimmy exactly how Wikia plans to crawl and index a significant portion of the web.

The Grub distributed crawler, which was acquired from LookSmart, appeared to provide most of the solution. At the O’Reilly Open Source Convention, Wales announced that he would immediately release the crawler to the open source community.

By downloading the client, Grub allows “the site owners the option of crawling their own data, with their own bandwidth. The client…is designed to connect to a central coordinating server, grab a batch of URLs, and then proceed to crawl them.” It claims 20:1 savings in bandwidth for both Wikia and the hosting website.

Since the summer, I’m not sure how much progress Wikia has made here. Within Grub’s site stats, there’s a “Wikia Search” team that crawled around 918k URLs so far. That seems far too low.

Site stats about Grub members tell a more complete story, as the top 100 members crawled 350 million URLs so far. The remaining 293 members aren’t shown, but if we assume 250k on average, then 425 million URLs would have been crawled in total.

There are other planning considerations too, regarding what belongs in the index. Will they be able to include the “right” domains or exclude the “wrong” domains? Will they be able to crawl some domains more or less frequently? Will video, images or other media be included?

We would be interested in knowing the game plan for developing a substantial index over time. It’s not just about numbers, although a billion or two could help with a 2009 launch.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

Twitter Cards: A Quick Start Guide

2014 Social Media Tricks, Tools & Trends

13 Twitter PR Secrets to Report News, Gain Publicity, & Build Relationships

Quora Best Practice Tips for Brands

Pinterest & Newspapers: No Pins, No Wins [Study]

A Cure For the C.O.L.D. (Casualty of Linking Distraction)

Google+ Now Has Custom URLs for Pages, Profiles

Social SEO – Facebook & Twitter Best Practices