The problem of finding and storing the Internet's never-ending content is fast becoming a task even Google can't keep up with. As Mike Grehan pointed out to me a few weeks ago, millions more pages are added to the Web than are indexed by the major search engines.
All three engines have been using a variation of link analysis to determine what's relevant and what gets ranked into their databases. But the news of the week has been Microsoft's research into BrowseRank -- analysis that includes time spent at a site or page -- added into the source and number of links to the content.
The old way was garnered from the scientific method of giving weight to articles referenced by other academic articles. Given the Internet was initially started as a way for academics to share information, this initial approach made sense. However, with the sheer volume of information, the Web has reached a plateau where such simple methods no longer work well. Add the ability to game the system and you have a methodology that needs an overhaul.
Will Microsoft's BrowseRank be the answer? That remains to be seen; but it brings its own problems. As Navneet Kaushal of SearchNewz points out, " As BrowseRank takes into account the time spent by a user on a particular website while compiling its data, it then becomes obvious that it highlights a lot of social networking websites. However, the issues with such websites is that the content of these websites isn't generally valuable or relevant to audiences at large. This factor makes the BrowseRank ineffective, as it could lead to a lot of results that are irrelevant, spam, or both."
The Microsoft paper explains their approach where "the user browsing graph can more precisely represent the web surfer's random walk process, and thus is more useful for calculating page importance. The more visits of the page made by the users and the longer time periods spent by the users on the page, the more likely the page is important. With this graph, we can leverage hundreds of millions of users' implicit voting on page importance."
The biggest question is how they get the time information. Sites would need to have a pixel or some way of passing this information to Microsoft. The push to have sites give this level of access has been around for a while and hasn't been met with much support.
The conversation I had with Grehan discussed the use of trust -- not directly tied to the TrustRank Google has been dropping information about -- but the combination of social networks and ranking of Web sites. When a friend tells you about a site, you're more likely to visit it as you trust his judgment. The building out of communities that have their own ranked access to various places on the Web could well be the way of the future.
Everyone is trying to improve the quality of the information presented by the engines. Microsoft may have some short-term success, but it doesn't appear to offer the answer needed in our rapidly multiplying information sources. A communal effort using social bookmarking with the ability to search through them may start finding more popularity if the results keep getting less relevant.
Kevin Newcomb Fires Back
It's good to see Microsoft doing some innovating around search, but I agree with you, Frank. BrowseRank, in its current iteration and used on its own, doesn't seem to be the answer. But it is a good step in the right direction. I do see a few obvious flaws, including the data-gathering methods you described.
In Microsoft's paper, the authors say that PageRank, which relies on links, is "not a very reliable data source, because hyperlinks on the Web can be easily added or deleted by Web content creators." I fail to see how user behavior can be any more reliable, as that can be gamed as well, just in different ways. Instead of buying links, we'll see spammers employing legions of low-cost Web surfer "farms" to spend time on their sites, for example.
Besides the ability to game BrowseRank, I'd also argue that it's not an entirely fair way to measure quality in the first place. While a content site may strive to keep readers on the site for as long as possible, a transactional site may focus on getting users to complete a transaction quickly. And a landing page that's mainly navigational may suffer from its efficiency in directing users to the proper page.
As you mentioned, the experimental results show that social network sites like MySpace and Facebook scored especially high, when scoring sites instead of individual pages. Will they be the new Wikipedia, dominating search results for every search, no matter how irrelevant?
Join us for SES San Jose, August 18-22 at the San Jose Convention Center.