Search Engine Link Popularity: Winners Don't Take All

Though a small number of sites get the majority of inbound links and traffic, a new study reveals a previously unknown pattern of web page connectivity and shows how new, poorly connected sites can compete.

Over the past several years, search engines have increasingly relied on "link analysis" in addition to traditional information retrieval techniques to help determine relevance. Whereas traditional techniques look at factors like textual, semantic, and conceptual relationships between pages and queries, link analysis attempts to understand the structure of the web and the "social networks" that are formed when people link web pages to one another.

Google's PageRank algorithm is probably the best known (but least understood) approach to link analysis. In simple terms, PageRank is relatively straightforward. It simply counts the number of inbound links to a page, and then calculates the "importance" of those links by determining the "reputation" of the pages providing the links. The result is the PageRank, and Google calculates this value for every page in its index.

Google's increasing popularity and importance, as well as the adoption of link analysis techniques by other search engines, has led many webmasters to aggressively seek out links from other web sites, hoping to boost their link popularity and thereby boosting their position in Google and other search engine results.

The problem is, PageRank and other link analysis approaches tend to work well for large, well-established sites that the web community has overwhelmingly "voted" for by creating links to them. Link analysis techniques work less well for small, obscure, or especially, new sites with few links pointing to them.

For this reason, Google and other search engines do many other things beyond and apart from link analysis to calculate relevance -- a fact many webmasters lose sight of in their frenzy to gain as many inbound links to their site as possible. Efforts at "link spamming" are easily detected, and search engines are increasingly penalizing sites that are caught attempting to artificially inflate their link popularity.

The good news is that there's hope for smaller, less well connected sites, according to "Winners Don't Take All: Characterizing the Competition for Links on the Web," a new study from computer scientists at the NEC Research Institute who have spent years studying the structure and size of the web.

The researchers found that the distribution of links within specific categories doesn't follow the same pattern as the overall web. Previous studies have shown that the distribution of links to web sites approximates a "power law," where a tiny fraction of sites receive a hugely disproportionate share of links, and the vast majority of sites are essentially ignored.

The new study found that the "rich get richer" phenomenon enjoyed by large, popular web sites varies significantly across different categories and within online communities.

The scientists examined network structures for several subcategories of the web, including university homepages, newspaper, and scientist homepages, as well as several e-commerce categories such as publications, consumer electronics, entertainment, sports, and photographers. They found that the degree of "winners take all" behavior varied greatly.

The distribution was closest to a pure power law ("winners take all") for companies, newspapers, and publications. In contrast, the distribution for universities, scientists, and photographers was much less biased -- individual members of these communities fared much better.

"We were surprised to find such drastic differences for competition within individual communities, as compared to competition viewed across the entire web," said Dr. David Pennock of NEC Research Institute, the study's lead author.

Another interesting finding was that these network structures bear strong resemblances to real world networks, including research paper citations, movie actor collaborations and U.S. power grid connections.

Pennock noted that the dynamics of information dissemination online have the potential to alter competition and diversity in commerce and society, and that an increasing percentage of commerce and communication is occurring on the web over time.

What are the implications of the study for searchers and webmasters? For searchers, the key point seems to stay alert to an ongoing potential "narrowing" of search results, as well-connected sites become increasingly dominant over time. If you want to broaden your results to incorporate smaller, less connected sites, use a search engine like Teoma that actively seeks "native" communities relevant to your query.

For webmasters, the implications are less clear, though one point seems certain -- spend less time working on your overall, global link popularity, and focus more on building up strong connections in the natural "community" of sites that share a similar focus to your own.

While this tactic of "sleeping with the enemy" might seem dubious in the real world, the NEC study strongly suggests that it's one of the best ways to strengthen your presence in the universe of the web.

As you might imagine, the paper is dense with complex formulae and nearly impenetrable mathspeak. But it's well worth a read for anyone interested in how the structure of the web itself ends up affecting our search results, and what webmasters can do to compete against the "big guys" who tend to dominate search results and web page traffic.

Winners Don't Take All
by David M. Pennock, Gary W. Flake, Steve Lawrence, Eric J. Glover and C. Lee Giles

An abstract and summary of the paper published in the proceedings of the National Academy of Sciences, Volume 99, Issue 8, pp. 5207-5211, April 2002. This page also includes a link to download a PDF version of the complete paper.

Self-Organization and Identification of Web Communities

A recent paper by several of the same authors above discussing the self-organization characteristics, and how link structure alone (without any textual analysis) can reveal communities of highly related information on the web.

Search Headlines

NOTE: Article links often change. In case of a bad link, use the publication's search facility, which most have, and search for the headline.

Online search engines news
Developers dig into Google's toolbox...
ZDNet Apr 17 2002 11:24AM GMT
Railway to sue Google over sabotage links... Apr 17 2002 10:16AM GMT
Online portals news
AOL rekindles browser battle...
The Register Apr 17 2002 9:02AM GMT
Internet features
Without Congress' help, Internet radio will wither... Apr 17 2002 6:23AM GMT
Online content news
Consumers Trust In Online Content 'Alarmingly Low'... Apr 17 2002 6:13AM GMT
Online legal issues news
'Virtual' Child Pornography Ban Overturned...
New York Times Apr 17 2002 5:46AM GMT
Web developer news
Microsoft .NET team talks about security...
CNET Apr 17 2002 4:00AM GMT
Online search engines news
Google Revamps Their Stopwords...
Research Buzz Apr 16 2002 11:10PM GMT
Online legal issues news
IBM drops Internet patent bombshell...
ZDNet Apr 16 2002 10:17PM GMT
Online search engines news
Overture Backers See Little to Fear From Google... Apr 16 2002 6:38PM GMT
Online marketing news
Internet advertising...
The Economist Apr 16 2002 2:12PM GMT
Online search engines news
Google protects its search results...
CNET Apr 16 2002 11:16AM GMT
powered by