Yahoo! Birth of a New Machine

Yahoo is rolling out a brand new search engine today, with its own index and ranking mechanisms, casting aside its long-standing use of Google-powered search results. The move is bound to roil the industry and sets in motion a new race for the claim of web search champion.

Ever since Yahoo's acquisition of Inktomi nearly a year ago, speculation has focused on when the company would replace its Google powered search results with those from the Inktomi index.

In a surprising move, Yahoo isn't replacing Google with Inktomi. Rather, the company has developed a brand new search engine, drawing on the lessons learned from what the company calls the "critical mass" of search engineering talent that it has brought together through hiring and acquisitions, as well as investment in infrastructure and product quality.

"High quality, talented search engineers are in very short supply these days," said Jeff Weiner, Yahoo's senior vice president of search and marketplace. "Regardless of how good your planning process is, at the end of the day it comes down to people and chemistry."

Weiner said Yahoo has waited until now to make the switch from Google to be certain users would have the best experience possible after the transition. "It was absolutely essential to us that we had a roadmap in place that not only let us sustain our quality, but build on it."

Although the change to self-powered search results is a radical change, Yahoo has steadily made incremental improvements in its search capabilities for more than a year. In October 2002, the company made the most significant change to its operation since its birth, replacing its human-compiled directory listings with Google search results.

Then in April of last year, the company rolled out its new Yahoo Search, introducing a streamlined search page. It also added new tabs to search result pages offering access to its directory listings, news, images, and yellow pages.

Today's launch is the beginning of a progressive rollout that will take place over the next few weeks. It is also the beginning of numerous planned enhancements focusing on web search, personalization and vertical search.

It's important to note that the new search engine is for web results only. Image search is still powered by Google, and News search is still a combination of Yahoo's own editorial and technological resources.

How does the new Yahoo search engine differ from Google? The presentation of the results is very similar. Yahoo has wisely opted to keep things looking mostly the same, with a few exceptions. There's a linked to the cached copy of each indexed page -- now being served from Yahoo, not Google -- but the "more pages from this site" link is gone. Just about everything else on search result pages looks the same.

The actual results returned by Google and Yahoo depends on the query. For popular or common queries, there seemed to be very little difference between the two engines in top few results. But once you get past those, the results tend to diverge dramatically. And for less common or non-popular queries, Yahoo results look quite different from Google results.

While Yahoo and Google are likely using similar algorithms, one reason for the differences in what's displayed is that Yahoo's email and search teams are now working together to leverage what they've learned about spam. Since Yahoo mail processes billions of email messages, this knowledge is likely quite helpful in providing Yahoo with a much deeper understanding of the characteristics of spam -- and helping keep the nasty stuff out of the web page index.

Bottom line: I'm impressed with the quality of results that Yahoo is delivering. It's a very viable alternative to Google and the other "last engine standing," Ask Jeeves/Teoma.

What's Being Indexed?

The Yahoo Search index is capturing the full text of web pages, up to a 500K limit. This is greater than the 101K maximum indexed by Google. A broad range of file types, including HTML, PDF, and Microsoft Office documents is also included in the mix.

In addition to indexing the full text of web pages, the new service is also indexing the contents of the meta keyword tag, something that virtually no other search engines pay attention to. If a page has an associated RSS page containing XML metadata (such as this one for Search Engine Watch) the content of that page is also indexed.

The new crawler is adaptive, meaning it attempts to understand how often documents change. Yahoo frequently crawls documents that change often. Documents that haven't changed in a long time are recrawled less frequently. Newly discovered documents and millions of frequently changing documents are refreshed in a daily crawl. New content and pages that have changed are pushed to the index twice per week.

There is no major monthly massive rebuild of the index, in the fashion of the Google dance. Theoretically, this means that Yahoo's index may be fresher than Google's, though Google itself refreshes a lot of frequently changing content without waiting for the monthly rebuild. Until we can do more comparison testing between the two, the jury's out on whether Google or Yahoo has the fresher index.

How big is Yahoo's index? They aren't saying, despite Google's announcement yesterday that it has expanded its index to nearly 4.3 billion documents (6 billion, if you count images and newsgroup postings, as Google does). In my tests, one Yahoo search result reported more than 2.1 billion documents found, so while Yahoo's index is apparently not quite as large as Google's, it's not tiny, either.

"We're very confident in the quality and size of our index, and we think the results speak for themselves," said Weiner.

What About AltaVista and AlltheWeb?

Last year, before Yahoo acquired Overture, Overture itself was busy acquiring AltaVista and AlltheWeb. Speculation at the time was that Overture would kill off AltaVista's technology, and power both search sites using the AlltheWeb index.

To the contrary, both search engines continued to maintain their own independent indexes. Then, in July 2003, Yahoo bought Overture. Less than a month later, Search Engine Watch editor Danny Sullivan and I visited AltaVista and AlltheWeb, and learned that the plan was to unify the two search engines, keeping the strongest technologies from both.

That was exciting news. But then nothing seemed to change. Today, both AltaVista and AlltheWeb continue to maintain separate indexes, and Yahoo isn't saying publicly whether this will change with the introduction of the new Yahoo Search Technology index.

Getting Pages Indexed in Yahoo Search

There are two ways to get content indexed in the new Yahoo search engine: by using a free URL submission box, or by using Yahoo's paid inclusion program.

Yahoo has added a free "add URL" box that you can access by becoming a registered Yahoo user and logging in to the service. Submitting a URL this way is a "suggestion" to have the page added to the crawl, but Yahoo stresses that this is not a guarantee that the page will be included.

To be certain your content is included, you'll need to use the Inktomi paid inclusion program. Somewhat confusingly, Yahoo continues to offer three paid inclusion programs. These are Inktomi's, AltaVista's, and AlltheWeb's (via Lycos). Until now, all three have been important to consider, to assure a broad reach in coverage.

But now, Inktomi's search submit (for small sites) and index connect (for sites with more than 1000 pages) are the only paid inclusion programs that feed into the new Yahoo Search Index.

Rumors have been circulating on many webmaster forums that Yahoo is planning major changes to its paid inclusion programs, perhaps unifying all three programs into a single solution, but the company declined to confirm that these changes are coming or when they will occur.

Should you consider using the Inktomi paid inclusion programs? If you want broad reach, the answer is an unequivocal yes. The new Yahoo Search Technology index not only provides web search results on Yahoo and its various properties, but is now powering Yahoo partners that formerly used the Inktomi index, such as MSN Search, About.com, Goo and others.

Spam guidelines are similar to those previously published by Inktomi and AlltheWeb. While the specific guidelines weren't yet available at the time I filed this story, Yahoo says they will be online soon.

What's Coming Next

In addition to continually working to improve the quality of its web search results, Yahoo plans to put particular emphasis in the coming months on personalization and vertical search. The company's My Yahoo portal already offers extensive content customization options.

Newly released features like the SmartSort option in Yahoo Shopping, which provides very specific product advice for digital cameras, mp3 players, computers and other electronic devices based on criteria you enter, is one example. The ability to add RSS feeds to your My Yahoo page is another.

"Ultimately we want to understand the intention of the user, and I think we're going to get closer to that through personalization," said Weiner.

In the vertical search arena, Yahoo plans to focus on local, travel, personals, and its Hot Jobs search portal.

But these moves are clearly just the beginning of many more to come at Yahoo. "Over time you're going to see Yahoo extend our search technology, and ultimately into our media properties," said Weiner. "To a large extent that will help drive our growth."

And give Google, Ask Jeeves, and Microsoft good reason to be even more attentive to the quality of their search results. The coming year promises to be a very good one for searchers.

About the author

Chris Sherman is a frequent contributor to several information industry journals. He's written several books, including The McGraw-Hill CD ROM Handbook and The Invisible Web: Uncovering Information Sources Search Engines Can't See, co-authored with Gary Price. Chris has written about search and search engines since 1994, when he developed online searching tutorials for several clients. From 1998 to 2001, he was About.com's Web Search Guide.