Number one on my
25 Things I Hate
About Google list from March was "web search counts that make no sense."
fiasco with the "5 billion spam pages" in Google only underscores that those
counts really are a big issue that can be noticed by more than a few tech heads.
Fix them or get rid of them, I say.
Adam Lasnik from Google’s search quality team has been running around to
various public forums explaining that it really wasn’t 5 billion pages that got
indexed from one master domain but instead a counting glitch that makes the
problem seem worse than it was. We
that he commented
over at Threadwatch:
We have noticed that some site: queries are showing bizarre results and
it’s turned out to be tied to a bad data push. We’re fixing it now….
I’m saying that the results counts are drastically off.
been at Digg:
Our engineers recently noticed that our site: queries (number of results
listed for a search) were showing bizarre results. This has turned out to be
tied to a bad data push, and we’re fixing this right now.
In the case being discussed above, the number in "about [x billion]" is
currently incorrect. We haven’t indexed anywhere close to as many pages of
these sites as is currently suggested. It’s a significant results estimation
error, thankfully limited in scope but clearly pretty stark when it appears.
at John Battelle’s blog:
Compounding the issue, our result count estimates in these contexts was
MANY orders of magnitude off. For example, the one site that supposedly had
5.5 billion pages in the index actually had under 1/100,000th of that.
John’s post is probably the most important illustration of why those counts
really do matter, given that he took them at face value — and so many others
will, as well.
When I saw the story on Monday, I doubted Google really had indexed so many
pages, especially given the known problems with the site: command recently.
While Google doesn’t report the total number of pages it indexes any longer, it
wasn’t that long ago when 5 billion pages would have been over half the reported
size, as John noted:
5 billion pages is the entire size of the Google index just a year or so
ago. The last claim, before they stopped MAKING claims, was 8 billion…think
Now sure, maybe Google really did index that many pages. Maybe they’ve
expanded so much that there’s plenty of room. More likely, adding that massive
amount of pages really should have caused a lot more good pages to go missing,
to make room for them. There would have been a ton of screaming *widely* across
the web from site owners big and small.
I know, I know — some believe Google’s running out of space, and Eric
commented on a "machine crisis" which the company later
denied was an
issue with web search. Certainly many webmasters have long been
pages in the wake of shifting to Google’s
infrastructure. But many webmaster also have not been having problems.
Maybe Google is so screwed up that it IS picking up billions of spam pages
from a few sites and dumping good stuff. However, I think that’s unlikely. I
think lots of pages did get in from this site, though maybe in the millions
rather than billions. And perhaps collectively, millions of pages of spam from a
number of sites are pushing good stuff out. But that 5 billion figure for this
particular site (and its subdomains)? I do think it was a counting error.
That counting error is a big problem in and of itself. As said, many people
take the counts at face value, even trying to use these meaningless figures in
court cases as Fox News
or the US Attorney General
before the US Supreme Court.
Enough is enough. Make the figures accurate or stop reporting them at all.
Last year, I lobbied for Google to drop the index count on its home page,
happened. Now they should strongly consider doing the same thing with
Results Counts / Number Of Matches To Go? from Gary Price last year talked
about this perhaps being a good next move for Google and the other search
engines to make. Certainly the time now seems right.
Google, like Yahoo won’t let you go past the first 1,000 matches anyway (Ask
goes to 200; MSN to 250). So who cares about showing how many matches there are?
Counts like these are remnants of the days when search engines first appeared
and showing that they had lots of matches helped perhaps make you think they
must be good or comprehensive. But if the counts mean nothing, why keep using
Ah — but it’s only an issue with the counts if you do a site: command, you
might say. Certainly we’ve known about a bug with that
told some of it has been fixed, but clearly bugs are still being worked out.
But are regular search counts accurate? If I search for
I get told there are no matches. So if I shift to
I should get a count of all pages in the index that don’t contain that word —
and since we know there are no pages with it in the index — I should get a
count of ALL pages Google has indexed. And that count?
Results 1 – 10 of about 25,270,000,000
for -djkfdkjfdkjddfdfdd. (0.07 seconds)
So there we have it — Google has 25 billion pages indexed. Maybe. Or maybe
not. This type of search sometimes has produced figures in the past that you
knew couldn’t be right. Plus, as I wrote
Google’s long had counting problems. I don’t know whether to trust that count or
not. And if I can’t trust it, why offer it to me? Especially why offer it to me
if after a glitch, you have to run around doing damage control to say the count
is wildly inaccurate. Just get rid of it.
Instead, this is what I want to see in the future:
Results 1 – 10
OK? And how about giving an option to have a number show up next to a result,
for those who want it. That would be nice if I want to refer to the exact
position of a particular listing to someone else. But the total number of
matches? It’s meaningless. And the time it took to search? Chest thumping we
don’t need anymore.
Keep in mind that a site: command is incredibly processor intensive. It’s not
something most searchers do, so spending the time, energy and machine power to
get hyper-accurate results for regular Google searches isn’t a priority.
Instead, move site: searches to work within Google Sitemaps, and you take the
burden off your main machines. It’s also something you can perhaps have
scheduled to run as a report, something generated en masse during slower periods
for anyone who wants to get that type of data. If three people all want
site:amazon.com data, you run that once and give all three the info on a
Yahoo rolled out a similar
Yahoo Site Explorer tool
It was a good move. It would be a good move for Google to also make, along with
dropping the general results counting on Google results pages.
Want to comment? Please join our Search Engine Watch Forums thread,
Get Rid Of
Results Counts On Google?