The second issue of Google's Newsletter for Librarians is now available. It features an article by Karen Schneider, the director of the Librarians' Internet Index, the wonderful and important searchable directory of high quality web resources that I've mentioned on the blog and in SearchDay many times.
Schneider focuses on the some of the critical information judgments needed in determining the trustworthiness of a site and the info that it contains. Those of us who attended library school are aware of many of these concepts. I hope Karen's article reaches more than information professionals including students where these ideas should be taught and reinforced from the earliest grades forward.
Next, Matt "Jagger" Cutts is back with a look at how Google determines what sites are "most trusted." His article talks about the 100's of factors (including some traditional info retrieval metrics) that Google looks at in addition to PageRank.
For more of an in-depth discussion of this you might want to pick up a copy of Chris Sherman's (yes SearchDay's Chris Sherman) book, Google Power. You can preview the title via Amazon's Search Inside the Book. I was unable to find it using Google Book Search.
Remembering that Matt's article was written primarily for librarians and other information professionals, he explains that Google, like other engines analyzes the actual content.
He points out that, "this [analysis] goes beyond scanning page-based text, which webmasters can easily manipulate through meta-tags."
While it's true that Google and other engines look to some degree at the meta-description tag, he doesn't mention that although the meta-keyword tag is still used by some, it's value is not as great as it once was. Danny points this fact out in a 2002 article. You'll also meta tags listed in this post from Barry.
Cutts goes on to write:
We also look at factors like fonts and the placement of words on a page. And we examine the content of neighboring pages, which can provide more clues as to whether the page we're looking at is trusted and will be relevant to users.
It would have been useful, particularly to the readers of this article, if Matt would have explained that the factors listed above and many others can also be manipulated or what others have termed "gamed."
As I've pointed out in many presentations to librarian, this is not a good or bad thing but simply the way large general-purpose web enginrs work. For the librarian, a knowledge and understanding of this is important and useful.
After reading both Karen's article and Matt's piece we see somewhat of a disconnect between trustworthiness in terms of inclusion and good placement on a results page versus the trustworthiness concepts that a human might use to judge not only the quality of a web page itself but the data it contains. Yes, I'll readily admit to being a bit prejudice here but I think Karen's article also illustrates the value of just one of the many skills well-trained librarian can offer.
Matt concludes with links to a few more excellent papers.
Btw, many of the same concepts (what Google calls and has patented as PageRank) are in place at just about every other major web engine. In other places, the concept is referred to as link analysis.
As a librarian I would have loved if Matt would have thrown a "shout out" to Dr. Eugene Garfield, the father of citation analysis. It has has been around since the 1950's and librarians have been using it since day one. The relationship between citation analysis (something librarians understand) and link analysis (PageRank) is strong and are even noted in Brin and Page's seminal paper. One of the biggest differences is that web link analysis is much more open than traditional citaton analysis and thereby harder to game (although to some degree) it's also possible.
Yes, the concepts used in citation analysis are really what drive link analysis.
If you want to learn more, this post has tons of links and interviews about citation analysis. It also includes a link to Garfield's paper, Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas."
Finally, although this Scientific American article was written in 1999, I still think it's one of the best, especially for non-geeks, about web link analysis. It was written by members of IBM's Clever team.
Clever was web search engine (never publicly released) by IBM. More about it here. Members of the Clever team read like a "who's who" of web search including Jon Kleinberg, Soumen Chakrabarti, and Prabhakar Raghavan who is now the head of Yahoo Research
As you review the article, take special note of the section where Clever and Google are compared. While Clever never made a public appearance, many of the concepts it offers are what power the Teoma/Ask Jeeves search technology.
Yahoo's Prabhakar Raghavan offers archived materials from his Stanford classes on text and information retrieval classes online. Must have content for those interested in the subject.