Search Engine Algorithms & Research

Date published 13 April 2005 Author

Christine Churchill

Categories

Industry

As a searcher or search engine optimization specialist, do you really need to understand the algorithms and technologies that power search engines? Absolutely, said a panel of experts at a recent Search Engine Strategies conference.

A special report from the Search Engine Strategies conference, February 28-March 3, 2005, New York, NY.

The Search Engine Algorithm and Research panel featured Rahul Lahiri, Vice President of Product Management and Search Technology at Ask Jeeves, Mike Grehan, CEO of Smart Interactive (recently acquired by webSourced), and Dr. Edel Garcia from Mi Islita.com.

What’s the fuss all about?

“Do we really need to know all this scientific stuff about search engines?” asked Grehan. “Yes!” he answered unequivocally and proceeded to explain the practical competitive edge you gain when you understand search algorithm functions.

“If you know what ranks one document higher than another, you can strategically optimize and better serve your clients. And if your client asks, ‘Why is my competitor always in the top 20 and I’m not? How do search engines work?’ If you say ‘I don’t know—they just do’—ow long do you think you’re going to keep this account?”

Grehan illustrated his point by quoting Brian Pinkerton, who developed the first full text retrieval search engine back in 1994. “Picture this,” he explained, ” A customer walks into a huge travel outfitters store, with every type of item, for vacations anywhere in the world, looks at the guy who works there, and blurts out, ‘Travel.’ Now where’s that sales clerk supposed to begin?”

Search engines users want to achieve their goals with minimum cognitive load and maximum enjoyment. They don’t think carefully when they are entering queries; they use inaccurate three word searches, and haven’t learned proper query formulation. This makes the search engine’s job more difficult.

Heuristics, abundance problems & the evolution of algorithms

Grehan went on to explain the important role that heuristics play in ranking documents. “A fascinating combination of things come together to produce a rank. We need to understand as much as we possibly can, so at least when we’re talking about what ranks one document higher than another, we have some indication about what is actually happening.”

Grehan described the progression of search algorithms over time. In early search engines, text was extremely important. But then search researcher Jon Kleinberg discovered what he termed “the abundance problem.” The abundance problem occurs when a search for a query returns millions of pages all containing the appropriate text. Say a search on the term “digital cameras” will return millions of pages. How do you know which are the most important or authoritative pages? How does a search engine decide which one is going to be the listing that comes to the top? Search engine algorithms had to evolve in complexity to handle the problem of over-abundance.

Social network theory & bibliometrics

The evolution of search algorithms incorporated principles from the social sciences and bibliometrics into document ranking. Grehan explained how social network theory considers the connectivity between things, and search engines algorithms take this into account, evaluating the connections between sites.

Bibliometrics can be fairly easily understood with the example of citation analysis. This is the practice where an author cites influential or central papers as resources or references to their own work. This happens in the scientific community where authors cite the work of other scientists in their papers and reports. A citation the equivalent of a vote that says “This guy over here is an expert. I read his stuff and then I was able to prepare my own.”

Similarly, search engines use the wisdom of bibliometrics, but in a slightly different way than humans. Search engines look at the web as a graph. web links are nodes linked together in the graph. There are two distinct types of linkage: directed edge and co-citation. Directed edge is fairly simple. We see a directed edge when web site A links to web site B. It’s a direct link. But since search engines are looking as the web as a huge graph, they get another look, an alternate view called co-citation. Here Page C points to A and B – but A and B do not link to each other. Through algorithms, the search engines understand that even though there isn’t a direct link between A and B, there is a connection between the two sites on which they reside.

PageRank & HITS algorithms

Grehan said that there are two main algorithms based on links – PageRank and Hyperlink-Induced Topic Search (HITS). Google’s PageRank measures the prestige of a page independent of any query. Thus, PageRank is keyword independent, meaning all web pages already have a ranking before any searcher enters a query.

This is in direct contrast with the HITS algorithm. HITS is keyword dependent. When someone does a query at an engine using the HITS algorithm (Teoma, for example), the process is different. If the user enters “blue widgets” into the query box, the HITS algorithm pulls input from the community of “blue widgets” and then determines page rankings on the fly. Acquiring input from the community where people count is closer the real world. In real life humans are split into different niches, communities, religions, groups, clubs, etc. HITS tries to replicate the communities and extract input in a meaningful manner.

A theory with holes

The problem, Grehan said, is that neither PageRank nor HITS actually works. At the very beginning—theoretically at least—they were great. Put into practice, they were full of problems. Grehan joked that the problem with Page Rank is that no one, not even Google, uses it. He described HITS as suffering from “topic drift,” “nepotistic linking (spam),” and “runtime analysis.” The runtime problem has been significant. Grehan said a team at IBM got HITS to return a search . . . in 11 minutes! Fortunately for Teoma, Apostolos Gerasoulis cracked the runtime problem and found a way to return the search in less than one second. This technology is now being used at Ask Jeeves and Teoma today.

Hubs & authorities

The concepts of hubs and authorities is key to many search algorithms. Authorities are web pages with good content on a specific topic. Hubs are pages that point to many other pages, such as directories with many hyperlinks to different pages. Grehan explained that hubs and authorities can be mutually reinforcing. A good hub can be a good authority and a good authority can be a good hub. Now search engines take this concept a step further by using subject specific popularity. For example, Teoma ranks a site not on popularity alone, but rather on the number of same-subject pages that reference it.

Hilltop, local rank, Florida, & the valley beyond

Grehan then touched on another famous algorithm called Hilltop. Hilltop is based on HITS. It is locally based and relies on link anchor text and the text surrounding it. Krishna Bharat, one of the scientists involved with Hilltop’s development was hired by Google. As part of search pioneering and claim staking, Grehan commented, Google later filed a new algorithm patent known as Local Rank that was based on Bahrat’s work with Hilltop.

In November 2003 Google made a dramatic shift in its algorithm. Grehan believes that this shift, know as “Florida” to the members of the search marketing community, was a shift by Google from keyword independent algorithms to keyword dependence.

Insights from Ask Jeeves

Ask Jeeves is the seventh ranked property on the web and the number 4 search engine,according to Rahul Lahiri from Ask Jeeves. Lahiri described a number of components that are key to Ask Jeeves search algorithms, including index size, freshness of content and data structure. Ask Jeeves’ focus on the structure of data is unique and differentiates its approach from other engines, he said.

There are two key drivers in web search: content analysis and linkage analysis. Lahiri confirmed that Ask Jeeves looks at the web as a graph and looks at the link relationships between them, attempting to map clusters of related information.

By breaking down the web into different communities of information, Ask Jeeves can rely on the “knowledge” from authorities in each community to better understand a query and present more on-topic results to the searcher. If you have a smaller site, but one that is very relevant within your community, your site may rank higher than some larger sites that provide relevant information but are not part of the community.

Why co-occurrence is important

Dr. Edel Garcia was delayed and not able to be physically present at the panel, but had prepared a PowerPoint presentation with audio narration. Moderator Chris Sherman told everyone to pretend Dr. Garcia was “channeling” through him and presented in his stead.

Dr. Garcia is a scientist with a special interest in Artificial Intelligence and Information Retrieval. He explained that terms that co-occur more frequently tend to be related or “connected.” Furthermore, semantic associations affect the way we think of a term. When we see the term “aloha” we think of “Hawaii” because of the semantic associations between the terms. Co-occurrence theory, according to Garcia, can be used to understand semantic associations between terms, brands, products, services, etc.

Dr. Garcia then posed a question. Why should we care about term associations in a search engine? His answer: Think about keyword-brand associations. This has powerful implications for search marketing.

Co-occurrence sources & uses

Next Garcia introduced “c-indexes” or co-occurrence indices. He developed this method to quantify the degree of relatedness between different terms. There are three main types of sources of co-occurrence:

1. Global (databases, collections)
2. Local (answer sets, individual documents)
3. Fractal (word distributions, passage segmentation)

Dr Garcia then offered a very simple example using two terms (k1 and k2). Think of two overlapping Venn diagrams. The overlapping circles—n1 and n2—represent the number of search results containing k1 and k2. The overlapping area k12 represents number of results containing both k1 and k2. The correlation index Garcia discovered can be explained: c (correlation) such that c=n12/(n1 + n2 -n12).

Garcia believes SEOs can use the c-index to properly identify semantically connected terms from a pool of candidate terms for a given database search engine. (C-values can be computed with a simple calculator). Garcia also showed how c-indices can be used to monitor keyword trends, word patterns and topics in time. C-indices can reveal information about terms, including their competitiveness, temporal trends and even if terms are overused (spam). Garcia said his research has revealed that spam related queries have extremely high c-indexes.

Dr. Garcia then explained what he called EF-ratios. The “E” in EF stands for Exact and the “F” stands for Findall. Findall is the default mode for most search engines. This occurs when the query terms are returned in any order. Exact mode returns results in the prescribed or exact order. Garcia explains that EF-ratios can be used to estimate the relative frequency of natural sequences and phrases in a source. He said they can also be used to examine how easy or difficult it would be to rank for a given sequence in a search engine.

EF-ratios can be used to identify and test natural term sequences, examine rank feasibility for a given sequence, and understand information retrieval behaviors of search engines. He successfully conveyed the necessity of understanding the mathematics behind search in order to really serve your search marketing clients.

For more information on Dr Garcia’s theories, check out the Search Engine Watch forum thread Keywords Co-occurrence and Semantic Connectivity.

The panel ended with a lively Q&A session. Where is the evolution of the search algorithm going? Grehan had a ready answer: He expects the introduction of probabilistic latent semantic indexing and probabilistic hyper text induced topic search. What do those mouthfuls of jargon mean? You’ll have to attend the next SES to find out.

Christine Churchill is President of KeyRelevance.com, a full service search engine marketing firm offering organic search engine optimization, strategic link building, usability testing, and pay per click management.

Follow us

Search Engine Algorithms & Research

What’s the fuss all about?

Heuristics, abundance problems & the evolution of algorithms

Social network theory & bibliometrics

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

PageRank & HITS algorithms

A theory with holes

Hubs & authorities

Hilltop, local rank, Florida, & the valley beyond

Insights from Ask Jeeves

Why co-occurrence is important

Co-occurrence sources & uses

Leave a Reply Cancel reply

Resources

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

The Search Engine Watch Top 5!

The ultimate 2022 Google updates round up

Is Google headed towards a continuous “real-time” algorithm?

Why we’re hardwired to believe SEO myths (and how to spot them!)

Seven Google alerts SEOs need to stay on top of everything!

The not-so-SEO checklist for 2022

Wrapping up 2021 with our top 10!

Four tips for SEM teams to adjust to a privacy-focused future