Numbers, Numbers -- But What Do They Mean?

One of the latest trends these days is for crawler-based search engines to flaunt both how many pages they have in their index plus the larger number of pages they visited to create that index. AltaVista says its collection of 250 million pages came from an original set of 400 million. FAST says its 400 million page index was developed from a group of 700 million. Excite's 250 million pages were retained after reviewing 920 million. And Inktomi says its core index of 110 million pages was created after analyzing over 1 billion across the web.

Citing dual numbers is a marketing spin we haven't seen since the first search engine size wars of 1996, when Lycos used to say it "knew" of millions of URLs even though its index contained only some of these. Journalists would then credit Lycos with the larger number, making the search engine look bigger than it was in comparison to its competitors -- no doubt exactly what the company had hoped would happen.

Why have dual numbers returned? Because no matter how big your competition is, the web is even bigger. Two well-publicized surveys in as many years have raised awareness that search engines actually index only a small portion of the documents available -- which has also led to the incorrect assumption that this must be why "no one can find anything." So when a search engine announces some size increase, it doesn't matter if they now have the biggest index of pages on the web. Someone somewhere is going to ask, "But what about the other millions of pages you aren't listing."

By citing the larger number of pages visited, the search engines are able to answer this question in a better light. You'll be told that many of the pages they don't keep are spam, or are from "mirror" sites, or are not worth keeping for other reasons. This implies that while the web may have at least 1 billion documents, if not many more, only a small portion is actually worth indexing.

Much of this is certainly true, but it also leads into questions about what is being excluded and why. Are pages you may want being left out? Perhaps, but adding millions of pages to an index doesn't necessarily mean that you'll still find what you are looking for.

To better understand this, let's start with Inktomi, where the dual numbers revolve around its use of link analysis or "link popularity" to improve results. Google is best known for this technique, but Inktomi has also made use of links since the middle of last year, and nearly all the major search engines now use link analysis to some degree.

In general, and I stress the "general," link analysis means that you look at the number of links pointing at different web pages, as well as the words in and around those links. So if 1 million pages link to Amazon and say "books" within the link or near it, then Amazon is more likely to come up tops for a search for "books" than some small bookseller that no one links to.

This link analysis is done against pages that are in the search engine's index. In the past, Inktomi's index held about 110 million pages, and it examined the links just on those pages. Now, Inktomi says its WebMap project allows it to leverage the links on over 1 billion pages from across the web, not just from the ones in its database.

To better understand this, let me introduce my "people in a room" metaphor. Previously, Inktomi would ask 110 million people to sit in a room and help it answer questions. When a question came in, those 110 million people would talk among themselves, see who spoke up as experts, then vote over which experts in the room should speak first. When it worked, you got great responses -- in the form of good sites appearing in the top search results. But if there weren't enough experts in the room, or if the people voting weren't knowledgeable, then the responses were disappointing -- nothing in the search results seemed satisfactory, or no matching web pages were found at all.

Today, Inktomi still has only 110 million people or so in the room who are allowed to answer questions. Voting, however, goes beyond the room. WebMap allows Inktomi to ask both those in the room and another 900 million people outside of it to decide which of the 110 million people in the room should speak first. By getting a broader range of opinions, Inktomi hopes it can deliver better results.

From this comes another important element, which relates to index size. If Inktomi can determine which documents are best, as rated in part by link analysis, then it can ensure that its index contains more of these documents. To return to the metaphor, by getting more opinions, Inktomi can better ensure that there are more experts among the 110 million people in the room who are allowed to answer questions. That should mean better answers in its search results.

Of course, some very useful pages don't have many links pointing at them. Is the index destined to be dominated by commercial sites that spend time building or buying links, while great pages at academic or educational sites are overlooked?

"This idea that the dot coms are filling up our index isn't real," said Matthew Hall, vice president of engineering in Inktomi's Search and Directory Division. "If we had only 5 million documents in our index, then that would be a worry," he said.

Explaining further, it turns out that only about 20 million pages across the web have been found by Inktomi to have at least one external link pointing at them, which makes them all prime candidates for inclusion in its index. That leaves room for an additional 90 million pages to be added, either selected through sampling sites in a more traditional crawling manner, by doing a deeper crawl of important sites, and through the Add URL system.

As any crawler-based search engine will tell you, the bulk of Add URL submissions are spam. AltaVista recently estimated 95 percent of its submissions are spam, and Inktomi reports a similar percentage. Remember, some sites may attempt to submit thousands of documents per day. It's the sheer volume of this junk that causes the percentage of "good stuff" to be so low, not over-eager filtering on the part of crawlers.

This is where Inktomi's new use of clickthrough data plays an important role. New submissions done via Add URL that aren't automatically excluded as spam get to stay in the index for a few weeks. If they rank well, and people click on them, then Inktomi learns that these documents should perhaps be retained over the long term. Excite also uses a similar system.

By the way, Inktomi also uses clickthrough measurements to improve ranking, not just indexing. As with Direct Hit, which popularized clickthrough rank improvement, top ranked pages at Inktomi that fail to be clicked on in response to particular searches may drop lower in its results, over time.

As you can see, Inktomi is trying to be very smart in creating a relatively small but representative index of the web. But why not dump all 1 billion pages into the index and still somehow work relevancy magic? Wouldn't that be better? Perhaps in the future, Inktomi says, but at the moment, it believes a smaller index means faster responses for users and better relevancy.

"From our testing, the two search engines that have markedly better relevancy than other people are Google and Inktomi. The search engines with the largest indices have the poorest relevancy. You could say that's coincidental, but we don't think so," said Dennis McEvoy, senior vice president at Inktomi.

As you might expect, search engine size leader FAST Search takes issue with this. Over the past year, the service has continually announced new size increases, inline with the goal of indexing "all the web," as its motto goes. FAST Search thinks you can have both a giant index and good relevancy

"We see no reason for leaving out these other pages. The argument is that a smaller index means faster response. For us, it doesn't matter. Five million documents or 5 billion, our response time will be the same," said John Lervik, FAST's chief technology officer.

FAST believes that its ability to succeed in this regard is tied to its architecture, which uses relatively inexpensive PCs rather than more traditional mainframes. FAST says this lets it scale both cheaply and effectively. So while Inktomi is using smarts to make the most of its smaller index, FAST is using brawn to wrestle the web into submission.

Lervik also notes that at his service, searches for the 2 million most popular topics only make up 25 percent of total queries. The remainder are for for unique or rare terms, Lervik says. FAST feels a large index is thus essential, in order to provide matches for these queries.

"We are trying to visit as many pages as possible, fresh pages, and base our index on that, rather than potentially leaving out pages that might satisfy queries," Lervik said.

In other words, the more people you can get into the room, the greater the odds you'll have an expert among them, especially if you also implement a good system that allows those in the room to vote among themselves about who will speak first. That's essential, because otherwise, the best experts may have to wait until unknowledgeable people have spoken first. That translates into poor relevancy in search results, a turn off for any search engine user.

Like Inktomi, FAST quotes dual numbers of what it has visited and what it has kept in the index, as do AltaVista and Excite. But unlike Inktomi, the higher number quoted by these players has nothing to do with trying to leverage links. Instead, it relates to culling out duplicate pages from mirror sites and spam.

I'll explain this using the people in a room metaphor in a moment, but first I have to explain a bit more about mirror sites and multiple domain name usage, which pose a real challenge to all the crawler-based search engines. For instance, you can reach the Search Engine Watch home page via any of these addresses:

That means the same page could be recorded three times, even though it exists once. In other cases, the same page may be duplicated in various locations across the web. Either way, a search engine doesn't want to record these mirror instances. That just wastes space in the index. In fact, remember the 1 billion number from Inktomi? That doesn't include mirror sites and duplicate pages. Add them in, and Inktomi says there are at least 4 billion pages across the web.

Now back to the people in a room metaphor. FAST, Excite, AltaVista -- these search engines go out and get people to sit in a room to answer questions -- a lot more people than Inktomi invites. But because of mirrors on the wall and bad lighting, they think they have more people in the room than are really present. After carefully recounting, they end up with a more accurate figure. Hence the dual numbers you hear.

The plus to these search engines? By having more people in the room, they can potentially answer questions on a broader range of topics, assuming they also have systems in place to float the best experts to the top of the list.

The plus to Inktomi? By leveraging the opinions of people from outside the room, it can potentially do a better job at floating better documents to the top of its results for popular queries.

Then there's Google. It's unique because it does the opposite of the other major services. Google has indexed about 140 million web pages. But from those pages, it also knows about pages it has never seen or visited. It will even list some of these pages within its results -- something no other major search engine will do.

Again, the metaphor. It's almost as if Google gets 140 million people in a room. When a question comes in, they all vote over who gets to answer. They might also suggest you talk to friends of theirs who aren't in the room, if those friends can better answer your question. In total, this gives you access to about 250 million people across the web, even though all of them aren't in the room.

So many numbers, and so many methods -- whose works best? As always, that depends on you and what you are looking for. Among the crawler-based services, Google and Inktomi (as measured at did very well in my recent test involving presidential candidate web sites. But that test specifically involved simple searches for popular web sites. Look for something obscure or in a complex manner, and AltaVista, FAST or Excite might come up on top instead. Or, look for something else that's popular, and any of these search engines might succeed. The best judge is yourself -- try different search engines and see what you like.

Inktomi WebMap

Details of Inktomi's WebMap project, along with some interesting statistics. Additionally, Inktomi says its WebMap is completely updated every 90 days, while the pages in its index are updated every three weeks. Also, in addition to Inktomi's core database of 110 million pages, it also maintains a European database of 50 million pages and a Japanese language index of 30 million pages.




Northern Light

I haven't mentioned Northern Light in this article, primarily because they haven't played the dual numbers game. But they are very much a player in the search engine size wars and should be on your list if looking to do a comprehensive scan of the web.

Search Engine Sizes

Facts and figures relating to search engine sizes, as well as links to search engine size surveys and past articles about FAST, which talk about its use of inexpensive computers to index the web.

Can You Find Your Candidate?
The Search Engine Report, Feb. 29, 2000

Information about the presidential candidate survey I mentioned can be found here.