Make Room For Teoma

After the loss of Go earlier this year and the expected departure of NBCi, you might have thought it was all over in the search engine game. However, just as consolidation seemed inevitable, new player Teoma has stepped up with an impressive debut of its new search service.

Opened to the public last month, Teoma leverages link structures from across the web to provide not only relevant results but to allow present different views of information automatically.

Currently in beta, the site is primarily intended to demonstrate Teoma's technology to potential partners or buyers.

"We're in discussions with many of the major portals and also the major technology companies," said Paul Gardi, Teoma's president and chief operating officer.

The idea of running Teoma as an standalone search engine for the public hasn't been ruled out, and even though the current site is designed as a demonstration, it's already powerful enough that searchers may want to add it to their research arsenal.

Teoma is a crawler-based service and has a collection of about 100 million URLs. That's tiny compared to the 500 million URL and higher range that most of the other major crawlers have. However, even Teoma's smaller collection may produce some good results on popular queries.

Don't forget, Google was getting great buzz with only 25 million URLs when it debuted in 1998. Google's index was smaller than its competitors and stayed that way through much of 1999, but users gravitated to it anyway because of the quality of its results.

Of course, to be a serious contender in the search engine space, Teoma will need to grow, and it is planning to do so.

"Internally, we're working with a data [center” set that's larger than the demo site. We're capable of scaling. We could go to billions of pages, handle large volumes of traffic, and the quality of search would be excellent," Gardi said.

Database size is an important factor, but without good relevancy ranking, a large index isn't necessarily useful. Teoma hopes its own style of link analysis will give it the ability to take on the widely-acknowledged relevancy leader, Google.

To understand what Teoma is doing, it makes sense to summarize the Google system, first. Google examines link structures all over the web. By doing so, it can give every page a popularity rating known as "PageRank" (named after Google cofounder Larry Page). When you do a search, URLs with high PageRanks are more likely to be listed first. However, this will only happen if the pages also match other criteria, such as containing your search terms or being identified as being relevant to your search terms by analyzing the context of links.

Teoma operates in an opposite fashion. When you do a search, Teoma looks across the entire web to find pages that contain your search terms or which are considered relevant to those terms based on link context. After finding a matching set of documents, which it calls a "community, Teoma then examines the links between just this set, to determine which are the most popular.

"At the end of the day, we are ranking sites based on other sites that are on the subject," Gardi said. "We don't only use all the sites that are pointing at a site, we also use that are on the subject."

The implication is that Teoma's "community" generated results will be more relevant than those from Google or others that use a "global" system which examines the entire web, because links from irrelevant pages are excluded. However, this understates what Google does.

Yes, PageRanks at Google are computed from examining the entire web, but link context and the content of web pages are also taken into account. This is supposed to reduce the impact of "irrelevant" pages in Google's system.

"Topic specific PageRank versus general PageRank, I'm not sure how much of a difference there is," said Urs Holzle, a Google Fellow and the company's former vice president of engineering. "Suppose you search for something about ice hockey. The sites that come up, where are they getting their PageRank from? Most likely, other ice hockey sites."

While the value of calculating global versus relative page popularity may be arguable, there's no disagreement that Teoma's system is allowing it to do two things that Google does not: the presentation of "Expert Links" and the autoclassification of pages into topics.

Let's take Expert Links first. When you search at Teoma, a list of "Expert Links" appears along the right side of the page. These listings are pages that provide links to a wide range of resources on a particular topics. In other words, these are " link links" or " weblogs" for a particular subject.

Here's another way to think of it. If you go to Yahoo and search for something, you'll usually be lead to a matching category that that lists a variety of web sites on your search topic. Other people create these type of topic specific lists, and Teoma's Expert Links area is designed to help you easily find these types of resources from across the web.

"We're finding these Expert Link pages. We're the only engine that can find these pages in real time," Gardi said.

There's another name given to these types of pages: "hubs." This term came out of link analysis work done by Cornell University researcher Jon Kleinberg and refers to a page that links outward to a variety of other pages on a particular topic. Think of a bicycle wheel to understand this. At the center is the hub, and the spokes that radiate outward from the hub are like the links that lead out from a hub page.

In Kleinberg's system, there's another type of special page that was defined, "authorities." Imagine you needed an expert in a particular field, and you asked a bunch of people who they would recommend as an authority. If many people all directed you to the same person, then you'd likely consider that person an important authority on the topic. In the same way, authorities are pages with many inbound links pointing at them.

When I first wrote about Kleinberg's work back in 1998, I specifically avoided writing about hubs and authorities. Why not? The distinction didn't matter to the average web searcher. The Clever project, which grew out of Kleinberg's work, presented both hubs and authorities but wasn't available to the public. The other big link analysis search engine that was emerging, a little experimental service with the funny name of Google, didn't try to present authorities and hubs. It just gave you results.

That remains the case today with Google. Do a search, and you'll get one list of results. Authority pages will be there, but important hub pages may also emerge. The advantage to this is simplicity. Many users simply want a top ten list of good pages, rather than having to choose between a variety of different options.

Google also says that a disadvantage to showing a list of hub pages is that they present a spam problem.

"Hubs are very easy to fake. It's a common spamming mistake," said Holzle.

What keeps Google from getting faked out, Holzle says, is its use of global links to determine PageRank. By looking across the web, rather than within a specific collection of documents, it can spot the fake hubs.

"With PageRank, you could create the same page that looks like a hub, but we'd see that no one is referring to it [broadly”," Holzle said.

Teoma counters that is able to filter out fake hubs that occur.

"Typically the fake hubs generate their own fake communities", says Apostolos Gerasoulis, Teoma's founder and chief technology officer. "Teoma is able to separate the fake hubs from the good hubs using local rank and clustering to avoid spamming."

So, there are pros and cons to Teoma's presentation of hub sites. The big plus is that if you want these types of pages, Teoma is a great resource for finding them. The main downside is that you may prefer the simplicity of having a single results list. However, Teoma does note that any really popular hub pages should also appear in its main results list, just as Google mixes hubs and authorities together.

Teoma's other unique feature is the autoclassification of web pages. This was the original promise of Clever -- that it would be able to automatically create topic lists of web sites, relieving companies such as Yahoo of the burden of paying editors to do the work by hand. Of course, since that time, classification has turned into a profit center for Yahoo and LookSmart, with per submission fees being charged.

At the top of Teoma's results page is a section called "Web Pages Grouped By Topic." Underneath, all the pages found that match your query have been grouped into broad categories. You can click on a category link to narrow your focus, and you can further drill down, as desired.

Fans of Northern Light will see similarities between this and Northern Light's "Custom Search Folders," which also group results into categories, in real time. A key difference is presentation. Northern Light's folders, which have always been a useful alternative way to scan results, have always been tucked off to the side of its main results. Teoma's categories are front in center before the users, which will likely increase use.

To perform the categorization, Teoma looks at the results set, then seeks out "clusters" or "communities" of 300 or more pages that link to each other. When these clusters emerge, the link text is analyzed to find the most common words, which are then used to describe the category. This use of link analysis is also different that the pure text analysis that Northern Light does, Teoma says.

Will such categorization come to Google? It's already there -- in the Google Web Directory, which is based on the Open Directory. This was seen as a more effective way to present categorized results than by using autocategorization, something Google has experimented with.

"A few years ago, we had a similar feature that never went public," said Holzle. "Our conclusion then was that no matter how much you do, you end up with something that works 70 percent of the time. How many clusters do you come up with? Are there too many, or are they not detailed enough?"

Other problems Google found was that folders sometimes were given strange names, and also that sometimes important sites were left out.

"Basically, we concluded something like the Open Directory does it better," Holzle said.

How about Teoma's main results, the "Web Page" section -- what's there? These are the authority pages I talked about earlier, the pages that are more likely to answer your questions, in contrast to the Expert Links hub pages that don't provide answers but may lead you to pages that do.

Teoma grew out of a federally funded project in 1998 at Rutgers University. The Teoma technology team is led by Gerasoulis and Professor Tao Yang, from University of California, Santa Barbara, who is chief scientist and vice president of research and development. Now a private company with funding from Hawk Holdings, Teoma hopes that it will establish some portal partnerships within the coming months. If not, then the Teoma site itself is likely to be expanded beyond the current demo.

The company is also considering enterprise and site search services in the future, as well as licensing its categorization tools to those who want to create their own directories or vertical portals.



Northern Light


Meta search tool that provides autocategorization similar to Northern Light's Custom Search Folders.

Counting Clicks and Looking at Links
The Search Engine Update, August 4, 1998

Discusses the emergence of clickthrough and link analysis as ways of refining search results. Focuses on the launch of Direct Hit, IBM's Clever and a former Ph.D project called Google.