Number one on my 25 Things I Hate About Google list from March was "web search counts that make no sense." This week's fiasco with the "5 billion spam pages" in Google only underscores that those counts really are a big issue that can be noticed by more than a few tech heads. Fix them or get rid of them, I say.
Adam Lasnik from Google's search quality team has been running around to various public forums explaining that it really wasn't 5 billion pages that got indexed from one master domain but instead a counting glitch that makes the problem seem worse than it was. We noted Monday that he commented over at Threadwatch:
We have noticed that some site: queries are showing bizarre results and it's turned out to be tied to a bad data push. We're fixing it now....
I'm saying that the results counts are drastically off.
Adam's also been at Digg:
Our engineers recently noticed that our site: queries (number of results listed for a search) were showing bizarre results. This has turned out to be tied to a bad data push, and we're fixing this right now.
In the case being discussed above, the number in "about [x billion]" is currently incorrect. We haven't indexed anywhere close to as many pages of these sites as is currently suggested. It's a significant results estimation error, thankfully limited in scope but clearly pretty stark when it appears.
And over at John Battelle's blog:
Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off. For example, the one site that supposedly had 5.5 billion pages in the index actually had under 1/100,000th of that.
John's post is probably the most important illustration of why those counts really do matter, given that he took them at face value -- and so many others will, as well.
When I saw the story on Monday, I doubted Google really had indexed so many pages, especially given the known problems with the site: command recently. While Google doesn't report the total number of pages it indexes any longer, it wasn't that long ago when 5 billion pages would have been over half the reported size, as John noted:
5 billion pages is the entire size of the Google index just a year or so ago. The last claim, before they stopped MAKING claims, was 8 billion...think about that.
Now sure, maybe Google really did index that many pages. Maybe they've expanded so much that there's plenty of room. More likely, adding that massive amount of pages really should have caused a lot more good pages to go missing, to make room for them. There would have been a ton of screaming *widely* across the web from site owners big and small.
I know, I know -- some believe Google's running out of space, and Eric Schmidt even commented on a "machine crisis" which the company later denied was an issue with web search. Certainly many webmasters have long been reporting missing pages in the wake of shifting to Google's BigDaddy crawling infrastructure. But many webmaster also have not been having problems.
Maybe Google is so screwed up that it IS picking up billions of spam pages from a few sites and dumping good stuff. However, I think that's unlikely. I think lots of pages did get in from this site, though maybe in the millions rather than billions. And perhaps collectively, millions of pages of spam from a number of sites are pushing good stuff out. But that 5 billion figure for this particular site (and its subdomains)? I do think it was a counting error.
That counting error is a big problem in and of itself. As said, many people take the counts at face value, even trying to use these meaningless figures in court cases as Fox News once did or the US Attorney General once did before the US Supreme Court.
Enough is enough. Make the figures accurate or stop reporting them at all. Last year, I lobbied for Google to drop the index count on its home page, something that eventually happened. Now they should strongly consider doing the same thing with results count.
Time For Results Counts / Number Of Matches To Go? from Gary Price last year talked about this perhaps being a good next move for Google and the other search engines to make. Certainly the time now seems right.
Google, like Yahoo won't let you go past the first 1,000 matches anyway (Ask goes to 200; MSN to 250). So who cares about showing how many matches there are? Counts like these are remnants of the days when search engines first appeared and showing that they had lots of matches helped perhaps make you think they must be good or comprehensive. But if the counts mean nothing, why keep using them?
Ah -- but it's only an issue with the counts if you do a site: command, you might say. Certainly we've known about a bug with that since May. We've been told some of it has been fixed, but clearly bugs are still being worked out.
But are regular search counts accurate? If I search for djkfdkjfdkjddfdfdd, I get told there are no matches. So if I shift to -djkfdkjfdkjddfdfdd, I should get a count of all pages in the index that don't contain that word -- and since we know there are no pages with it in the index -- I should get a count of ALL pages Google has indexed. And that count?
Results 1 - 10 of about 25,270,000,000 for -djkfdkjfdkjddfdfdd. (0.07 seconds)
So there we have it -- Google has 25 billion pages indexed. Maybe. Or maybe not. This type of search sometimes has produced figures in the past that you knew couldn't be right. Plus, as I wrote before, Google's long had counting problems. I don't know whether to trust that count or not. And if I can't trust it, why offer it to me? Especially why offer it to me if after a glitch, you have to run around doing damage control to say the count is wildly inaccurate. Just get rid of it.
Instead, this is what I want to see in the future:
Results 1 - 10
OK? And how about giving an option to have a number show up next to a result, for those who want it. That would be nice if I want to refer to the exact position of a particular listing to someone else. But the total number of matches? It's meaningless. And the time it took to search? Chest thumping we don't need anymore.
Keep in mind that a site: command is incredibly processor intensive. It's not something most searchers do, so spending the time, energy and machine power to get hyper-accurate results for regular Google searches isn't a priority.
Instead, move site: searches to work within Google Sitemaps, and you take the burden off your main machines. It's also something you can perhaps have scheduled to run as a report, something generated en masse during slower periods for anyone who wants to get that type of data. If three people all want site:amazon.com data, you run that once and give all three the info on a scheduled basis.
Yahoo rolled out a similar Yahoo Site Explorer tool last September. It was a good move. It would be a good move for Google to also make, along with dropping the general results counting on Google results pages.
Want to comment? Please join our Search Engine Watch Forums thread, Get Rid Of Results Counts On Google?