SEO News
Search

Danny & Tristan Talk About Link Counts, Site Counts & Index Auditing

author-default
by , Comments

I've been having a series of email conversations with Tristan Louis, who has been trying to understand how well "A-List Bloggers" do in the major search engine based on links. The problem, I've been explaining, is that links and various other counts don't paint the picture you might expect. With his permission, I'm sharing much of our correspondence below. It's not as nice and neat as writing this all up as an article, but time doesn't allow for that. Hopefully you'll find it interesting to see all the complexities involved and why it's difficult to draw any conclusions. Remember, this was all email, so neither of us was particularly watching punctuation, capitalization or spelling!

June 14

Danny: Saw the article on Technorati versus Google links. Here's the problem. Google's only showing you a small percentage of the actual links it knows about. Go do a link count on Yahoo, and you'll see the link numbers are much larger than what Google shows you. Google does this on purpose to thwart search marketers looking to mine search data. As a result, your link percentages for Technorati are much higher than what Google really knows.

Tristan: Good point... That explains why my data set on Yahoo! seems so wildly out of proportion (that's for my next entry ;) )

June 21

Tristan: I thought the following might interest you and even possibly your readers: Technorati, Yahoo and Google Too. From the article:

In the last entry on the subject, we took a look at how Technorati and Google compared. From there, we discovered that Technorati was getting roughly a fourth of the links Google could locate. Which brought up some interesting questions: could we rely on the Google numbers? Were they so much larger than any other search engine that we were building an unfair comparison? And, as some alert readers pointed in email, was Google under-reporting the number of links to a site? In order to answer some of those questions, I decided to build some more comparisons. So I decided to take a look at some of Google's competitors. Today, I'll go into how Yahoo! fared (Hint: I was surprised by the results).

Danny: Tristan, I don't really see how you are making your conclusions at the end from this. It's probably easiest if I run down the conclusions.

Yahoo! generally does a better job at indexing the blogosphere than Google does. We know they have been working hard to improve their index and here's proof that they are getting results

What proof? You are counting links that Yahoo has and comparing to Google links, which you know aren't all the links that Google knows about. So all the math is meaningless. Google may have MORE links than Yahoo, but you can't tell this.

Much more important, number of links does not equal number of pages indexed. If you want to measure indexing, you have to do a site: search, such as:

http://search.yahoo.com/search?p=site%3Aboingboing.net
http://www.google.com/search?q=site%3Aboingboing.net

In that, you see Yahoo has 137,000 pages indexed versus Google's 71,000 pages. On the face of that data, Yahoo seems to go twice as deep. But:

http://search.yahoo.com/search?p=site%3Awww.wilwheaton.net
http://www.google.com/search?q=site%3Awww.wilwheaton.net
 

Now Google isn't as far back.

More important, what's getting indexed? Could Yahoo be indexing the same page over and over but under slightly different URLs? Could Google? These types of issues plague making use of search counts to prove anything.

Even if Google is the one with the motto about not doing evil, Yahoo! seems to be the one interested in giving equal opportunity to the little guy: smaller blogs seem to have a better chance of being recognized by Yahoo! than they do of being recognized by Google

That means nothing, either. Let's say Google really does only index say half the pages that Yahoo does. Now when you do a search, does Google still manage to bring the little blogs up while Yahoo doesn't? Google's come under accusations of being "blog clogged" in the past, and now you're suggesting that it's almost unfair to blogs. On what searches? Having pages means nothing if those pages don't surface into the first page of results. Run comparative queries on something where you think a little blog should surface at both places and see what happens.

While the front page of Google advertises they are currently indexing over 8 billion pages, it is very difficult to find ways to support that claim via the link feature they are offering: this can be seen as confirmation that Google does not tell you about all the links it has in its index.

The link: command is completely different than the site: command. The link command tells you nothing about the size of the index. As for a confirmation that all links aren't reported, this past blog post from SEW gives you confirmation and this page on Google mentions links are only a sampling of what Google knows although this other Google page fails to make this clear.

As for confirming the number on the page, how much time have you got to go into issues about that? Start here to understand the minefield of search index sizes.

Sure volume counts but in the case of search indexes, they may count against sites: if one is less likely to appear in Google than it is to appear in Yahoo! and the Google index is much larger than the Yahoo! one, then, if Yahoo! and Google had the same amount of traffic, a single blog could find itself receiving more traffic from Yahoo! than it does from Google. This would be due to the fact that each individual page in Yahoo! has more weight than it does in Google.

So for all we know, Yahoo has as many pages or more than Google has. They don't report the number. We do have one recent estimate that puts them a bit lower. But the odds that more pages means less visibility. If it were purely random. It's not. What's going to rank well depends on the number of pages on a particular topic, plus the linking data and in particular whether the links are relevant in terms of anchor text. It's not just a pure popularity play.

I applaud what you are trying to do. It's just that it's difficult to draw any conclusions from what I've seen presented.

Tristan: Thanks for the feedback... Let's go through a little polemic on this... (Tristan quotes what I've written above, which I've shown as italicized indented text, then poses new questions).

What proof? You are counting links that Yahoo has and comparing to Google links, which you know aren't all the links that Google knows  about. So all the math is meaningless. Google may have MORE links than Yahoo, but you can't tell this.

A quick question here: if they do have more links, why are they not advertising them? It seems odd that someone would possibly claim to have more of something and, upon closer inspection, would report less. It's as if I said that I had a billion visitors a month and, when someone examined my logs, they found only a few hundred thousands. Would you trust me if I then said, "well, you know, we don't report on all the visitors"

More important, what's getting indexed? Could Yahoo be indexing the same page over and over but under slightly different URLs? Could Google? These types of issues plague making use of search counts to prove anything.

Important point here, though is that there still seems to be a difference. Yahoo! does seem to index more pages in either of the cases you demonstrated. I understand that duplicates might be dropped but shouldn't they at least be listed in the raw number? I mean Google provides you with an option to see (the "In order to show you the most relevant results, we have omitted some entries very similar to the 987 already displayed. If you like, you can repeat the search with the omitted results included."

So what do the numbers from Google mean? With omitted results, I can't get past 1000, without I can't get past 987... How do we know that the 9500 pages number is correct?

Let's say Google really does only index say half the pages that Yahoo does. Now when you do a search, does Google still manage to bring the little blogs up while Yahoo doesn't? Google's come under accusations of being "blog clogged" in the past, and now you're suggesting that it's almost unfair to blogs. On what searches? Having pages means nothing if those pages don't surface into the first page of results. Run comparative queries on something where you think a little blog should surface at both places and see what happens.

Well, what I'm saying here is that Google may not be as blog clogged as Yahoo! is, if claims on the size of indexes are correct (remember we're all assuming the claims are correct... no one ever challenged that assumption until now)

As for confirming the number on the page, how much time have you got to go into issues about that? Start here to understand the minefield of search index sizes.

But since it's a major marketing tool (as in "our index is bigger"), shouldn't someone investigate this stuff. Maybe we need some audits on all the major search engines in order to see if the claims are correct.

What's going to rank well depends on the number of pages on a particular topic, plus the linking data and in particular whether the links are relevant in terms of anchor text. It's not just a pure popularity play.

I agree there are many factors in terms of rankings however, wouldn't a page in an index of 100 pages have more of a chance (1/10) or appearing in the first 10 results (ie on the first page) than a page in an index of 1000 pages (1/100 chance) all things being equal. So, if you start with this, all things being equal, if the google index is much larger then chances for a blog to appear on the front are lower than it would be on Yahoo.

I applaud what you are trying to do. It's just that it's difficult to draw any conclusions from what I've seen presented.

No problem... Any kind of input is good. Basically, I managed to get a set of numbers and want to get other people to start playing with them (400 data points across 4 indexes (MSN is next in line) ). I can't help but feel like no one has actually attempted to do this kind of side by side mathematical comparison. I was hoping someone would and, when no one else went out and did it, I decided to undertake it.

Please provide information as to how to do this properly. Maybe someone will be able to then go and get the data in a way that's more in line with what you think is right (I'm a neophyte in that space and my blog is something I do for fun so a REAL analysis would be better :) ). I'd love to see an expert do an analysis on this (... and I wish there were an automated way to get to the data, it took me a long time to gather all the raw numbers :) )

Danny: (As above, I quoted parts of what Tristan was asking, before responding. I've shown those quotes italic font and indented)

A quick question here: if they do have more links, why are they not advertising them? It seems odd that someone would possibly claim to have more of something and, upon closer inspection, would report less. It's as if I said that I had a billion visitors a month and, when someone examined my logs, they found only a few hundred thousands. Would you trust me if I then said, "well, you know, we don't report on all the visitors"

OK, first, remember that the number of links isn't the same as the number of pages. Google knows about far more links to pages than actual pages it lists.

What does it advertise on the home page? Pages that it has indexed, not links it knows about. And no one is really suggesting that that number is super inflated. If anything, people tend to wonder if they are undercounting.

Now to links. Why aren't they showing all the links they know about? Because they fear site owners and marketers will take that data and figure out some way to better manipulate Google. It's also query intensive to generate that data, so it has no great interest in trying to do a great job there. It's run by relatively few people.

I'm not saying it's good that they do this, by the way. I think if you're going to offer a command, the command ought to work as you'd expect, and show everything. But the point is, you can't trust those numbers to do what you're trying to do.

Yahoo! does seem to index more pages in either of the cases you demonstrated. I understand that duplicates might be dropped but shouldn't they at least be listed in the raw number? I mean Google provides you with an option to see (the "In order to show you the most relevant results, we have omitted some entries very similar to the 987 already displayed. If you like, you can repeat the search with the omitted results included."

Yes, in the two cases I checked. That's not enough to be confident of anything. If you do want to play the numbers game, investigate all 100 of the sites on the list.

So what do the numbers from Google mean? With omitted results, I can't get past 1000, without I can't get past 987... How do we know that the 9500 pages number is correct?

We don't. And we don't necessarily for Yahoo. See the reference material I sent you, if you want to understand what a real challenge you're just dipping your toe into.

The best way to know is to find a site small enough that you can actually review all 1,000 or less pages that will be displayed, then literally count and see if duplicate and other junk is being eliminated. In lieu of that, you can go with the raw count figures and hope that this other stuff isn't going on.

Well, what I'm saying here is that Google may not be as blog clogged as Yahoo! is, if claims on the size of indexes are correct (remember we're all assuming the claims are correct... no one ever challenged that assumption until now)

The claims of blog clogging have come from the idea that blogs have better ranking power, not that they've got more pages indexed. That goes to the main point. Number of pages means little. What are you finding when you actually search? If I have 100,000 pages from your site and 10 from another, no great help to you if your 100,000 pages are deemed not good enough and never rank well.

But since it's a major marketing tool (as in "our index is bigger"), shouldn't someone investigate this stuff. Maybe we need some audits on all the major search engines in order to see if the claims are correct.

People have. My Search Engine Sizes page documents this type of stuff in great detail, efforts that have happened over the years. It's neither a new issue nor an easy one to solve. Here's a recent and fairly short summary of what's involved: Search Engine Size Wars V Erupts.

I agree there are many factors in terms of rankings however, wouldn't a page in an index of 100 pages have more of a chance (1/10) or appearing in the first 10 results (ie on the first page) than a page in an index of 1000 pages (1/100 chance) all things being equal. So, if you start with this, all things being equal, if the google index is much larger then chances for a blog to appear on the front are lower than it would be on Yahoo.

Things are not equal.  Search results are not like a lottery. Every page is different. Every page is going to be slightly better or worse for a particular query. Linkage data skews things even more. There is no level playing field out there, where just number of pages gives you a better chance. Yes, you have more actual chances, but it's still not a case that it will skew in your favor.

No problem... Any kind of input is good. Basically, I managed to get a set of numbers and want to get other people to start playing with them (400 data points across 4 indexes (MSN is next in line) ). I can't help but feel like no one has actually attempted to do this kind of side by side mathematical comparison. I was hoping someone would and, when no one else went out and did it, I decided to undertake it.

People have to some degree, as the stuff I've previously sent points out.

Honestly, skip the numbers. It's the results. You want to measure how well blogs do on search engines, pick queries, do the searches and see who comes up. That's the very best test you can do.

NOTE TO READERS: I've put that last part of my correspondence in bold for a reason. It's probably the most important point in all of this. Look at queries, not counts, to measure how well things are working.

Tristan: (quoting Danny in responses, those quotes in ital indented copy)

Remember that the number of links isn't the same as the number of pages. Google knows about far more links to pages than actual pages it lists.

That makes sense since most pages will generally have more than 1 link.

What does it advertise on the home page? Pages that it has indexed, not links it knows about. And no one is really suggesting that that number is super inflated. If anything, people tend to wonder if they are undercounting.

It's true. However, it would be nice to see an actual audit of those indexes to see what the numbers really are.

Now to links. Why aren't they showing all the links they know about? Because they fear site owners and marketers will take that data and figure out some way to better manipulate Google. It's also query intensive to generate that data, so it has no great interest in trying to do a great job there. It's run by relatively few people.

But what does it present as a # when I do a link: search? What is that number? That's the question I'm trying to pose (albeit maybe not clearly enough). When Google says "Results 1 - 10 of about XXXXXXXX linking to foo.com" what does that number mean? That data is being generated (whether it's query intensive or not is the problem of the search engines: if they're going to display something, they better make sure it's correct). Furthermore, I don't think it would be any more intensive to show the all the links (since each query is only for 10 to 100 results per page max): the processing power is such that it would be the same for each page anyways since the tough part of the processing is in the ordering and it is being done in that way for the pages it shows.

I'm not saying it's good that they do this, by the way. I think if you're going to offer a command, the command ought to work as you'd expect, and show everything. But the point is, you can't trust those numbers to do what you're trying to do.

If you can't trust them, why are they even offering them, then? Wouldn't it make more sense for them to not display that info. I think there are quite a few people working at Google on the UI and generally, they do not throw information on that screen just because it looks pretty. So what is that number? If the agreement is that the number is meaningless, why is it there?

If you do want to play the numbers game, investigate all 100 of the sites on the list.

OK.. let's try the top 5 then (you had boingboing so I'm doing the other 4 :) )

Instapundit:
http://search.yahoo.com/search?p=site%3Ainstapundit.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt
58,300
http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3Ainstapundit.com&btnG=Search
80,300

Daily Kos:
http://search.yahoo.com/search?p=site%3ADailyKos.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt
19,000
http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3ADailyKos.com&btnG=Search
682,000

Gizmodo:
http://search.yahoo.com/search?p=site%3AGizmodo.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt
195,000
http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3AGizmodo.com&btnG=Search
38,100

Fark:
http://search.yahoo.com/search?p=site%3AFark.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt
1,940
http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3AFark.com&btnG=Search
1,030,000

Hmmmm.... Looks like I'm going to have to extend the data set, this looks all over the place :)

The best way to know is to find a site small enough that you can actually review all 1,000 or less pages that will be displayed, then literally count and see if duplicate and other junk is being eliminated. In lieu of that, you can go with the raw count figures and hope that this other stuff isn't going on.

There's got to be a site of that small a size somewhere in the Internet.com network. Could you ask around internally. If you identify one, maybe we can get a group effort started on this. I figure if we throw it as a challenge in an SEO forum, we could get some good response.

If I have 100,000 pages from your site and 10 from another, no great help to you if your 100,000 pages are deemed not good enough and never rank well.

Number of pages does mean a lot in terms of marketing. It can also have an impact on results (the higher number of page, the higher the possibility that you have the best set of pages (hence the race to build bigger indexes).

People have [tried to audit sizes] to some degree, as the stuff I've previously sent points out.

Actually, what I'm asking for is independent confirmation of the numbers. The pages you sent me provide useful info about what the claims are but how do we investigate whether the claims are correct? How do we move from reported size figures to actual size figures?

Things are not equal. Search results are not like a lottery. Every page is different. Every page is going to be slight better or worse for a particular query. Linkage data skews things even more. There is no level playing field out there, where just number of pages gives you a better chance. Yes, you have more actual chances, but it's still not a case that it will skew in your favor.

I know things are not equal. I'm just trying to establish a base line here. Think of it as dissection. Trying to get one piece sorted and then the next. Maybe we can learn something out of careful dissection of this type.

Honestly, skip the numbers. It's the results. You want to measure how well blogs do on search engines, pick queries, do the searches and see who comes up. That's the very best test you can do.

The results are interesting and there's a fair amount of research being done there (as you know and chronicle :) ) . What I'm trying to understand is how the indexes are built. It's definitely not as exciting to most people but it is important in the long run (I'm working under the assumption that crawling is not going to work in the long run as a way to keep a relatively fresh index)

June 22

Danny: (I'm making these responses to Tristan's last email as part of this post, rather than through email)

It's true. However, it would be nice to see an actual audit of those indexes to see what the numbers really are.

Sure, but the time and energy to focus on size numbers detracts from the real figure you want, a relevancy figure. The size marketing game comes around from time to time, as Search Engine Size Wars V Erupts explains. But overall, it's not worth the time to deconstruct. If Google is 8 billion and MSN is 6 billion, they are both BIG. The question isn't whether Google really has an extra billion or two. The question is whether it has massively more info indexed than MSN. On this scale, now. See also Search Engine Size Wars & Google's Supplemental Results for more on this and In Search Of The Relevancy Figure on how size is used as a surrogate for the real figure we need, a relevancy figure.

But what does it present as a # when I do a link: search? What is that number? That's the question I'm trying to pose (albeit maybe not clearly enough). When Google says "Results 1 - 10 of about XXXXXXXX linking to foo.com" what does that number mean? That data is being generated (whether it's query intensive or not is the problem of the search engines: if they're going to display something, they better make sure it's correct). Furthermore, I don't think it would be any more intensive to show the all the links (since each query is only for 10 to 100 results per page max): the processing power is such that it would be the same for each page anyways since the tough part of the processing is in the ordering and it is being done in that way for the pages it shows.

That number means the number of links Google chooses to display to you, a sample of all the links it knows about. It is correct -- just not correct in that you assumed it meant ALL the links it knows about. A disclaimer would be nice. After banging on them about this issue, they finally got at least a note about sampling on one of their help pages that webmasters read. As I said, the page searchers might read doesn't explain this. As for query power, yes, search engines commonly report that generating things like link lists takes a lot more work. For one thing, lots of people search on the same things every day, so common searches can come off of cached memory. But a link list? Do that, you're likely the first person that day to search for that set of links. You've got to go to disk and pull up the data anew, is my understanding from talking with them. They're still fast at it -- but it lots and lots of people did it, it would be a drain.

If you can't trust them, why are they even offering them, then? Wouldn't it make more sense for them to not display that info. I think there are quite a few people working at Google on the UI and generally, they do not throw information on that screen just because it looks pretty. So what is that number? If the agreement is that the number is meaningless, why is it there?

Because when search engines remove these numbers, they get complaints. That's one reason they've said. Also, it gives you some degree of feel for how much is out there. Also, Google did say "of about" with the numbers it reports. That's not an accident. They're saying that this is an estimate. But no disagreement with me. If you put up a count, it would be nice if the count was as accurate as possible. Google's have come under question. See Revisiting Google's Counts & Drops When Searching The Same Word Twice and Questioning Google's Counts. That latter article highlights a series of other articles on count issues, including just how historic an issue this is going back to problems with AltaVista. Also see Tim Bray's recent On Search: Sorting Result Lists.

There's got to be a site of that small a size somewhere in the Internet.com network. Could you ask around internally. If you identify one, maybe we can get a group effort started on this. I figure if we throw it as a challenge in an SEO forum, we could get some good response.

If people want to audit search sizes, they can start by visiting Greg Notess's Search Engine Showdown site, where he illustrates how to run a set of queries rather than doing site: searches to determine sizes. He's even been contracted in the past by search engines who wanted to prove an audit, as with Northern Light. Anyone can do what he's done. Even better, heck, people could just fund him to do a new study.

Number of pages does mean a lot in terms of marketing. It can also have an impact on results (the higher number of page, the higher the possibility that you have the best set of pages (hence the race to build bigger indexes).

Yes, after all, how can you find the needle in the haystack if you search only half the haystack? Wait! What if I dump the haystack on your head? Can you find the needle now, even though you have everything? That's how I've long tried to explain this issue when speaking or in this article, Search Engine Size Wars & Google's Supplemental Results. We do want index growth, but having an extra 1 billion pages almost certainly won't make your search for "britney spears" any better.

Actually, what I'm asking for is independent confirmation of the numbers. The pages you sent me provide useful info about what the claims are but how do we investigate whether the claims are correct? How do we move from reported size figures to actual size figures?

Fund Greg, test yourself in a new way, maybe lobby the search engines to come together on this. Auditing was big in 1999, back especially when Northern Light was annoyed at AltaVista's claims despite its timing out habit. See Who's The Biggest Of Them All? and Northern Light Claims Largest Index. As said, Northern Light funded one such audit. It is an issue, but I'd rather see them come together on a commonly accepted set of metrics on how relevant they are, if they're going to start anywhere. But maybe after writing on the size issue for almost ten years, I'm beaten down :) Really, it's more that it's not a huge issue to me and my coeditors because we don't look to size figures to know who is best.

And Now For Something Completely Different

While writing this up, I noticed that Technorati's David Sifry had been following Tristan's reports and posted some questions of his own in On Search Engine results comparisons: Where's the remaining 99.8% of the results?. So, I'll conclude with a quick answer to something he raised:

If you can only view 703 results of about 575,000, where are the other 573,297 results? That's only 0.2% of the search results that the estimate claims. Where's the missing 99.8% of the search results?

No major search engine lets you go beyond 1,000 results, last time I looked. This is something that's been in place for ages and ages and ages. Some key reference material:

  • On Search: Sorting Result Lists: Consider this a classic post, from Tim Bray on why it doesn't matter how many pages you think the search engine is sorting through, it's not really happening like that.
     
  • Sorry, no more results: From Michael Bazeley, following up on Tim's post and discovering that you can't get more than 1,000 results on Google, so when it says it searched X number of results, should you really believe it?
     
  • Search Engine Size Test: July 2000: Gives an in-depth look at how search engines have long not allowed you to see all the results they have. It's not just a Google thing -- or even a new thing for the search industry.
  • Postscript (from Gary): I've been posting about the unreliability of page estimates from large web engines for several years ResourceShelf.com Here are a few links.

    ClickZ Live San Francisco This Year's Premier Digital Marketing Event is #CZLSF
    ClickZ Live San Francisco (Aug 11-14) will bring together the industry's leading online marketing practitioners to deliver 4 days of educational sessions and training workshops. From Data-Driven Marketing to Social, Mobile, Display, Search and Email, the comprehensive agenda will help you maximize your marketing efforts and ROI. Register today!

    Recommend this story

    comments powered by Disqus