CNN reports that the Internet has now crossed a significant milestone; there are 100 million operating websites. The Web's growth has been accelerating: "There were just 18,000 Web sites when Netcraft, based in Bath, England, began keeping track in August of 1995. It took until May of 2004 to reach the 50 million milestone; then only 30 more months to hit 100 million, late in the month of October 2006."
This is kind of like human population growth. The bottom line here is that the more unwieldy the Internet becomes, the more central search becomes as the main navigational tool. And that means -- ka-ching -- paid search will continue to grow for the foreseeable future.
Posted by Greg Sterling at 12:24 PM | Permalink
In Google Has the Largest Number of Dead and Old Pages, the Google Operating System Blog points to a video and some research from Google's Ziv Bar-Yossef that discusses how to grab a random sample of pages from major search engines and extrapolate from those pages information about the search engines. This can be used in a number of ways.
One interesting piece of information that you can determine from the method he discusses is the percentage of dead and of old pages that a search engine may contain. In comparing Google, MSN, and Yahoo! following these methods, Google appears to contain the largest number of dead pages. The Video is from an August 17th Techtalk from Google covering this.
In addition to the information that it provides about search engines, and this method of sampling them, the video also discloses that Ziv Bar-Yossef joined Google a couple of weeks before the video footage was shot. Ziv Bar-Yossef previously worked at the IBM Almaden Research Center, and was most recently at Technion - Israel Institute of Technology, Israel.
I also wrote a little about this reseach at the SEO by the Sea Blog in How Do You Estimate the Size of A Search Engine?, and include with that post a listing of some of the patents that he was involved in developing while at IBM. One of the more interesting was one on Methods and apparatus for assessing web page decay.
Ziv Bar-Yossef brings a wealth of knowledge to Google. Another interesting recent paper he was involved with, while at Technion, looks carefully at Different URLs with Similar Text, and ways that search engines could identify those more easily.
Posted by Bill Slawski at 1:42 PM | Permalink
I've written before about Google giving strange results counts and why maybe it's time for them to go. Yesterday, I came across the oddest ones ever, when doing some typical searches to gauge the size of the index.
Here's an example. Search for xxkjdiuenmnmd8i, which when I just did it came back with no results. Now search for -xxkjdiuenmnmd8i. In theory, that should show the size of the Google index, all the pages it has.
In reality, that type of search hasn't often worked. It was only last September that this type index estimation technique gave any results at all. Even then, I didn't trust that the numbers were accurate. Still, they seemed better than what's coming up now. Look at the screenshot below:
Ten results? Only ten results, for a search technique that last month would have come up with more than 25 billion? Something funky is going on.
Finding it odd, I tried a search for the, often useful as a fast way to get a sense of how big Google might be, at least for the number of English language pages it has. The query came back with 23 billion matches. So how about -the, I tried, just out of curiosity. Ten matches:
Ten? Ten?!!! And more strangeness. A search for -and, -cars, -movies all did the same thing. The results were different in various ways, but the count was always only 10 matches, when it should be much more.
Note that the results all have additional information that make them appear to come out of Google Base. It all suggests that Google has disabled counting for queries involving a single word, but that somehow, Google Base integration is still happening to throw things off. It might be that Google is still doing a call to Google Base, asking for the top 10 results that it has, in order to integrate those results into a regular web search listing. But because it also has disabled display of regular web search results for a single negative word query, it's only Google Base that shows.
Going back to my post from last month, Google, Kill The Web Search Counts!, I explained how Google had stated that the counts reported for a spam site that were removed were much inflated by a counting glitch. I talked with Google about this and some other issues last week just before leaving for my trip to SES Latino in Miami, where I am now.
Some of what I talked about with Google's Matt Cutts and other engineers at Google has already addressed in a recent blog post. The issue of counts came up, and I'll do a longer post on what Google said after I get back from this trip and clear what I can discuss. The short answer is that they are aware of the issues and are looking to correct things. These strange results counts might be part of that.
More later when I'm back from my current trip, or watch Matt's blog, in case he posts before me.
Posted by Danny Sullivan at 8:12 AM | Permalink
Number one on my 25 Things I Hate About Google list from March was "web search counts that make no sense." This week's fiasco with the "5 billion spam pages" in Google only underscores that those counts really are a big issue that can be noticed by more than a few tech heads. Fix them or get rid of them, I say.
Adam Lasnik from Google's search quality team has been running around to various public forums explaining that it really wasn't 5 billion pages that got indexed from one master domain but instead a counting glitch that makes the problem seem worse than it was. We noted Monday that he commented over at Threadwatch:
We have noticed that some site: queries are showing bizarre results and it's turned out to be tied to a bad data push. We're fixing it now....
I'm saying that the results counts are drastically off.
Adam's also been at Digg:
Our engineers recently noticed that our site: queries (number of results listed for a search) were showing bizarre results. This has turned out to be tied to a bad data push, and we're fixing this right now.
In the case being discussed above, the number in "about [x billion]" is currently incorrect. We haven't indexed anywhere close to as many pages of these sites as is currently suggested. It's a significant results estimation error, thankfully limited in scope but clearly pretty stark when it appears.
And over at John Battelle's blog:
Compounding the issue, our result count estimates in these contexts was MANY orders of magnitude off. For example, the one site that supposedly had 5.5 billion pages in the index actually had under 1/100,000th of that.
John's post is probably the most important illustration of why those counts really do matter, given that he took them at face value -- and so many others will, as well.
When I saw the story on Monday, I doubted Google really had indexed so many pages, especially given the known problems with the site: command recently. While Google doesn't report the total number of pages it indexes any longer, it wasn't that long ago when 5 billion pages would have been over half the reported size, as John noted:
5 billion pages is the entire size of the Google index just a year or so ago. The last claim, before they stopped MAKING claims, was 8 billion...think about that.
Now sure, maybe Google really did index that many pages. Maybe they've expanded so much that there's plenty of room. More likely, adding that massive amount of pages really should have caused a lot more good pages to go missing, to make room for them. There would have been a ton of screaming *widely* across the web from site owners big and small.
I know, I know -- some believe Google's running out of space, and Eric Schmidt even commented on a "machine crisis" which the company later denied was an issue with web search. Certainly many webmasters have long been reporting missing pages in the wake of shifting to Google's BigDaddy crawling infrastructure. But many webmaster also have not been having problems.
Maybe Google is so screwed up that it IS picking up billions of spam pages from a few sites and dumping good stuff. However, I think that's unlikely. I think lots of pages did get in from this site, though maybe in the millions rather than billions. And perhaps collectively, millions of pages of spam from a number of sites are pushing good stuff out. But that 5 billion figure for this particular site (and its subdomains)? I do think it was a counting error.
That counting error is a big problem in and of itself. As said, many people take the counts at face value, even trying to use these meaningless figures in court cases as Fox News once did or the US Attorney General once did before the US Supreme Court.
Enough is enough. Make the figures accurate or stop reporting them at all. Last year, I lobbied for Google to drop the index count on its home page, something that eventually happened. Now they should strongly consider doing the same thing with results count.
Time For Results Counts / Number Of Matches To Go? from Gary Price last year talked about this perhaps being a good next move for Google and the other search engines to make. Certainly the time now seems right.
Google, like Yahoo won't let you go past the first 1,000 matches anyway (Ask goes to 200; MSN to 250). So who cares about showing how many matches there are? Counts like these are remnants of the days when search engines first appeared and showing that they had lots of matches helped perhaps make you think they must be good or comprehensive. But if the counts mean nothing, why keep using them?
Ah -- but it's only an issue with the counts if you do a site: command, you might say. Certainly we've known about a bug with that since May. We've been told some of it has been fixed, but clearly bugs are still being worked out.
But are regular search counts accurate? If I search for djkfdkjfdkjddfdfdd, I get told there are no matches. So if I shift to -djkfdkjfdkjddfdfdd, I should get a count of all pages in the index that don't contain that word -- and since we know there are no pages with it in the index -- I should get a count of ALL pages Google has indexed. And that count?
Results 1 - 10 of about 25,270,000,000 for -djkfdkjfdkjddfdfdd. (0.07 seconds)
So there we have it -- Google has 25 billion pages indexed. Maybe. Or maybe not. This type of search sometimes has produced figures in the past that you knew couldn't be right. Plus, as I wrote before, Google's long had counting problems. I don't know whether to trust that count or not. And if I can't trust it, why offer it to me? Especially why offer it to me if after a glitch, you have to run around doing damage control to say the count is wildly inaccurate. Just get rid of it.
Instead, this is what I want to see in the future:
Results 1 - 10
OK? And how about giving an option to have a number show up next to a result, for those who want it. That would be nice if I want to refer to the exact position of a particular listing to someone else. But the total number of matches? It's meaningless. And the time it took to search? Chest thumping we don't need anymore.
One exception, however. Google Sitemaps has just added a bunch of expanded reporting. I want them to go further and let site owners get accurate index counts through that system.
Keep in mind that a site: command is incredibly processor intensive. It's not something most searchers do, so spending the time, energy and machine power to get hyper-accurate results for regular Google searches isn't a priority.
Instead, move site: searches to work within Google Sitemaps, and you take the burden off your main machines. It's also something you can perhaps have scheduled to run as a report, something generated en masse during slower periods for anyone who wants to get that type of data. If three people all want site:amazon.com data, you run that once and give all three the info on a scheduled basis.
Yahoo rolled out a similar Yahoo Site Explorer tool last September. It was a good move. It would be a good move for Google to also make, along with dropping the general results counting on Google results pages.
Want to comment? Please join our Search Engine Watch Forums thread, Get Rid Of Results Counts On Google?
Posted by Danny Sullivan at 10:32 AM | Permalink
We've noticed that the publicly announced total size of permanently archived web pages found in The Internet Archive's Wayback Machine has increased in size 15 billion pages today from 40 billion pages to 55 billion archived web pages (versus a web search cache).
The increased page total has now been updated yet on the Wayback home page and it's also clearly visible on The IA's home page. Is The Wayback Machine a complete archive of every page on the web? No, not at all. However, it's the largest one-stop permanent web archive out there and it's a very important tool for all web researchers. Material in The Wayback Machine dates back to 1996. Kudos and congrats to Brewster and his team. Now, let's hope keyword search capabilities comeback soon. Btw, you can also find direct links to The Wayback Machine from Gigablast (look for the "Older Copies" link) next to each snippet and Yahoo. Look for The Wayback Machine link in the top box when reviewing a page found in the Yahoo cache.
Postscript: You can also find direct links to The Wayback Machine via Alexa.com. On a results page, click the "Site Info" link and then look for The Wayback Machine links in the left column.
Posted by Gary Price at 6:48 PM | Permalink
As Danny mentions, it's good to see the total size war go away for at least the time being. Danny also points out this page from Google that lays out there thoughts on comprehensiveness. A couple of quick comments, including wondering if the results counts that every search engine shows should now go away.
From the page:
The basic test for search engine comprehensiveness is whether you can find uncommon information. Popular queries return millions of results, but even the most obsessive searcher isn't about to surf a few million pages, or even a tiny fraction of them; in most of these cases, you'll either quickly find what you're looking for or refine your search to be more focused.
Perhaps it's time to take a look at the usefulness (asides from their marketing value and likely the reason they don't point out this fact) of the page estimates that Google and others provide at the top of results pages.
Just how accurate are they? What are they telling the typical searcher? It would be useful if all search companies (not only Google) would let the public (including many journalists) know that they're just estimates and often far from accurate.
Yes, some people will refine (if they know how, do they?) their searches. However, don't forget that even if you wanted to view all of the results, you couldn't. Most web engines will only show the first 1000 results.
Are the estimates on web results pages going to be the next battleground? I wonder how many people even noticed the total that Google used to list on their home page vs. the estimates they see each and every time they run a search?
More from the Google page:
To see for yourself, try searching for something very specific, or try a query that previously returned very few results. For example, you could enter your name or hometown, along with your favorite color or animal. Navigate to the last page to see how many results the search engine really delivered. (On the last page, you may have to click the "repeat the search with the omitted results included" link to see all the results.) Do this on different search engines for several queries and see what you come up with. As you can imagine, we've run quite a few tests like this, and we expect your results will be very similar to ours.
Sure, you'll likely find a result for this type of query but the real question is how useful is the info to the searcher? Is it a page simply scraping or reposting (possibly without permission) content from another page that's already in the index? Are random words (note the Google suggested search above) simply appearing on a word list? Is it one of the thousands of versions (technically different pages) of the Online Directory Project appearing in the index? How about nearly identical pages for a book appearing at Amazon.com and many affiliates?
These pages will show up on results pages and be included in the total count but, in many cases, the material could prove to be of little value to most searchers.
Don't get me wrong, comprehensiveness can be a VERY good thing. However, larger indices can also be a challenge, especially for the unsophisticated searcher. That's why verticals and specialized search tools that focus on a specific type of material can be very valuable.
As I said yesterday, Google and all of the major engines would be doing all searchers a favor by using their notoriety to teach people, even in a small way, to use ALL the tools they offer to build better queries that offer more precise results.
Posted by Gary Price at 10:02 AM | Permalink
Roundup Of Google Size Announcement CoverageYesterday was pretty much spent by me writing my story about Google claiming to be most comprehensive search engine but also dropping any page count from its home page. That story, if you missed it, is up here: End Of Size Wars? Google Says Most Comprehensive But Drops Home Page Count. Now that I've emerged from my writing cocoon, here's a roundup of what others are saying on the subject:
"We congratulate Google on removing the index size number from its home page and for recognizing it is a meaningless number," Yahoo said in a statement. "As we've said in the past, what matters is that consumers find what they are looking for, and we invite Google users to compare their results to Yahoo search at http://search.yahoo.com."
I may add further links as I see unique stuff flow in.
Want to discuss? Visit the Google Drops The Home Page Count thread in our Search Engine Watch Forums.
Posted by Danny Sullivan at 8:57 AM | Permalink
Google Claims To Be Most Comprehensive - But Helps Defuse Size Wars By Dropping Home Page CountToday's SearchDay article Google Says Now Biggest, Most Comprehensive - But Size Wars Defused By Dropped Home Page Count covers the latest chapter in the dispute over search engine size that started with last month's claim by Yahoo to have outdistanced Google in index size.
Google now says it is three times larger than its closest competitor (ie, Yahoo) and is the most comprehensive search engine available. However, it's not offering proof of that through an actual count. Indeed, Google is dropping the famous number of web pages it is "searching" from its home page.
Why? Because comparison counts don't mean much any more, something Yahoo has said itself. In short, Google is leaving it to users to prove to themselves whether it does -- or does not -- measure up as most comprehensive.
More in my story, as well as a long look at why count figures themselves aren't the comprehensiveness metric they've sometimes been in the past.
Want to discuss? Visit the Google Drops The Home Page Count thread in our Search Engine Watch Forums
Posted by Danny Sullivan at 12:01 AM | Permalink
Further to my previous post on the Google index update/size increase, there appears to be a new way to count all the pages within Google. Find a term that doesn't exist, then search for minus that term, and you get a full count. Well, sort of.
This was the technique that we used to be able to use at Northern Light, to verify all the pages it had. AllTheWeb used to have a similar method it gave to Greg Notess, as he says here, that he used as part of his long time documentation of search sizes.
I was emailing with Google last week about wanting that type of command to exist at Google. It didn't work last week when I tried, nor had I seen it working before. But if we had it, then anyone could see exactly the total number of pages Google should have in its index.
That's important, because as I've written, the count on Google's home page doesn't change in line with the index growing. In addition, searching for a common word like "the" sometimes doesn't work well because of stopword issues. There are also plenty of non-English language pages that won't contain the word "the."
Today, I noticed the technique suddenly did work! To see it in action, I've provided an example below. This work in the long-term, because once this post gets indexed, the word will suddenly exist in Google's index. But you can easily do it with other words.
A search for djfdkjkfjkdjdfk comes back with "Your search - djfdkjkfjkdjdfk - did not match any documents." OK, then we know there are no documents in the Google index with this term.
Now I do a search for -djfdkjkfjkdjdfk. That means, "Show me all the pages you have that don't have this word on them." Since we know that NO pages have that word, asking for all pages without it should show us everything.
Count? About 9,560,000,000 pages. Count on the Google home page? "Searching 8,168,684,336 web pages." So at least, Google should have about 1.5 billion pages in its index more than it currently claims.
I actually think that's much higher, as I'll explain in a future post. That's why I'm saying this may "sort of" work to show all the pages. Certainly PhilC on our SEW Forums has tried this technique and gotten 11.3 billion results. I can't get the same, but it's just another sign that the counts aren't adding up in the many ways you want to slice them.
By the way, I tried the negative technique at some other places. It won't work for Yahoo and Ask Jeeves. But at MSN, -djfdkjkfjkdjdfk came back with a count of 5,304,186,736, which is right in line with the self-reported figure of 5 billion MSN gave last year.
Of course, even if all the search engines make this technique work, it doesn't necessarily mean we've got apples-to-apples comparisons. What depths are the pages indexed to? How well are duplicates removed? Are these pages actually indexed or just links to pages you know about? Those are just some of the issues.
More important, as I've written before and will come back to again, having higher counts won't mean you're more comprehensive. For more on this, see my post from yesterday, Googlewhacks Show More Signs That Google's Increased Its Index; Time To Drop The Hamburger Count.
Want to comment or discuss? Visit our forum thread, Sept. 2005 Google Index Update & Size Increase Coming?
Postscript: Spotted via Inside Google, Google: Spot the mistake charts how queries on Google are now bringing back more results and estimates the index may not be at 21 billion.
Posted by Danny Sullivan at 9:45 AM | Permalink
Yesterday, I wrote about signs that the Google index size has increased despite the fact that the home page number, as usual, remains firmly stuck around 8 billion. Today, Gary Stock over at Googlewhack tells us that several recent Googlewhacks -- results that yield only one result on Google -- now are showing more than one result. As I'll get into below, it's also another reason why it's time for the "hamburger count" of pages on the Google home page should go.
It's not that the searches have suddenly gained more popularity, producing more content, overnight. It's that Google appears to have gained more pages in its index. Gary says repairwomen falter is a Googlewhack that eight hours ago had one result but now comes up with 10 (or 11 if you go into the omitted results). Note that what's showing up are word lists of little value to the searcher. But hey, they get the counts up!
Ironically, the original and only page that had been coming up for this query -- what Gary calls the "legitimate" page -- is no longer showing up when I look. For him, it was showing up buried under the other results. You'll see it here, but the cached copy is easier to read.
Yeah, I'm still trying to finish my piece on why self-reported index numbers don't mean a search engine is more comprehensive than another. Here's the short answer.
We know McDonald's makes millions of hamburgers. Well, billions. Yet I don't even see that mentioned on the vast majority of signs when I go to McDonald's on occasion. It no longer matters. We know they're big. Big, big, big.
Google and Yahoo are big. When Yahoo stated its new size, I didn't think it was suddenly X percentage better than Google, and neither should you, and neither should Google.
Realistically, Google clearly isn't going to relax until it can get the bigger number and tick over the figures on their home page. And this time, I expect they'll trot out some "proof" to back up their claims, which has never been offered in the past.
But why does it matter? Did Google suddenly discover over the past three weeks that there were billions of pages that they should have had but they didn't? No -- it discovered that the size games that have worked for Google in the past suddenly went south, so let's try to play it again by getting our count up. Search relevancy probably won't increase one bit, as a result. But the PR aspect might -- at least in some quarters.
Want the PR aspect to go up with me? Drop the count on the home page, which isn't accurate now and generally hasn't been accurate over time.
Otherwise, maybe the Pat Herr from McDonald's is available. She's the "tracker of McDonald's hamburger count" and describes the auditing she's done over the years here. It really won't mean much in terms of quality, but heck, it'll be another fun factoid to have trotting out. And maybe the size figure on the Google home page will update when the index actually grows larger and reflect everything in it, rather than being updated in reaction to a competitor suddenly threatening to be bigger with their own self reported figure.
Want to discuss? Visit our forum thread, Sept. 2005 Google Index Update & Size Increase Coming?
Posted by Danny Sullivan at 8:13 AM | Permalink
I wrote earlier that there were signs that Google was increasing its index size. More signs are coming now -- as well as more chatter about a major update overall, at least in terms of link counts and counts of pages spotted by those on search forums.
The usual complaints over lost rankings and search result listing changes are largely quiet compared to normal. Barry did an earlier post about backlink and PR updates being seen and discussed at various forums. WebmasterWorld has a fresh thread up at Update Gilligan Google Upate Sept 2005. Our own fresh thread is here: Sept. 2005 Google Index Update & Size Increase Coming?
So much for getting a "weather report" ahead of coming changes, it seems.
Need further proof? Until recently, a query for "the" was bringing back around 3 billion or 5 billion matches (I can't recall off hand, but it was well below the 8 billion pages claimed in the Google index). Today, Eric Baillargeon notes how the brings back "about 8,000,000,000" pages. The Google home page reports, "Searching 8,168,684,336 pages."
So either only 168 million of those pages don't include the word "the" or something screwy is going on.
Posted by Danny Sullivan at 12:37 PM | Permalink
I'm still working on my revisit to the entire Yahoo-Google size debate, but we may be about to see the long-expected response of Google raising its figures shortly. A reader drops a note that on some Google data centers, you can find major increase in counts being reported.
Below are some examples comparing results from Google.com to one of its data centers:
But it's not always an increase:
Want to play? Here are the data centers to try:
Anyway, as others have expected including myself and Gary, the script to this battle is sadly too clear. Google will obviously want to get above Yahoo's figure so it can say it's once again the "biggest."
The twist is that this time, folks won't be taking those numbers as they have before. The counts as I've said mean nothing, and I really am trying to finish off my piece to hammer home why this is so.
The real brave move by Google? Drop the count from the home page, so we can finally get beyond this nonsense.
Posted by Danny Sullivan at 12:28 PM | Permalink
I wrote earlier of the dispute over index size between Google and Yahoo in my Screw Size! I Dare Google & Yahoo To Report On Relevancy post. Over the past week or so, Gary Price and I've had several conversations with both sides and have been working on a piece to come out hopefully later this week. We're only some of the many analysts that both sides have been lobbying to explore the claims and issues more. John Battelle's been hit, as has Charlene Li. Today, Charlene gives an excellent update on her views in Why search index size no longer matters.
I'm right with her. Many of the things she talks about are exactly the same points I'm making in my own piece that's underway. In short, the self-reported figures present great difficulty in auditing for accuracy, further reducing their utility as a measure of how good a search engine may be. In particular, I've written before on how size is mistakenly used as a surrogate for relevancy. My next piece will explore how sheer size isn't even a good surrogate for comprehensiveness. More to come.
Posted by Danny Sullivan at 7:49 AM | Permalink
Last week, prior to the Yahoo index size announcement, we were notified that Google was planning to increase the size of their web index. To be clear, the new number that follows is NOT in response to the recent Yahoo size increase. In case you haven't noticed by now and care, the current total index size listed on the Google home page is: 8,168,684,336. Previously, the total listed was: 8,058,044,651. It's an increase of about 110 million pages. Google also upped their image database total last week. Like we've said many times, these numbers are only claims and mean little (except in the pr/bragging rights area) especially when comparing total database size with other engines since we're likely not comparing apples with apples. I posted about this last week. Danny's blog post from last November (when Google issued it's 8.0 number only hours before MSN said their index was indexing about 5 billion pages) also discusses this topic.
Posted by Gary Price at 3:19 PM | Permalink
Ah, summer. Time to play on the beach, head out on vacation and if you're a search engine, announce to the world that you've got the largest index. -- Search Engine Size Wars & Google's Supplemental Results, Search Engine Watch, Sept. 3, 2003
The quote above is from an article I wrote after Google and AllTheWeb played a game of "who's biggest" in August 2003. They'd done the same thing in August 2002. Now here we are in August 2005, and it's another spat over size once again, this time between Yahoo and Google.
I cannot believe we're going through this again. This is Search Engine Size Wars VI, by my count. It's absurd. It's annoying. It's a friggin' waste of time. Instead of advancing to a commonly accepted relevancy figure, the search engines want to keep us mired in the mud of who's biggest.
Who's biggest really doesn't matter, as I and others have written so, so, so, so, so many times before. Reasons? There are many. How about...
Pick your metaphor, your explanation, your qualification (Gary gives you even more here) -- we've been through this all before.
Nothing has changed. Size hasn't suddenly gotten more important overnight. What has happened is for the first time, one search engine is strongly disputing the claims of another. Google doesn't believe the figures Yahoo is bandying about, as Gary covered earlier. Yahoo has been steadfast that it's not lying.
Well let's do some testing! Let's come up with some standards! Let's audit the figures! Yeah, let's do that. After all, it's been discussed since 1999, when Northern Light wanted to say definitively that it was biggest. Surely it's time for that to happen, right?
No, it's not. If the search engines are all going to come together to figure out a standard on something, move forward! Move forward! Pull it together and unite to come up with a way to test relevancy! That's what matters, not this squawking and time wasting over size.
In Search Of The Relevancy Figure from me in 2002 looks at the need for a relevancy figure and how without it, we'll continue to have search engines use surrogates such as size for relevancy:
A relevancy figure would also free us from search engines playing the "size card" or the "freshness card" to quantify themselves as better than the competition. Yes, having a large index is generally good. Yes, having a fresh index is desirable. However, neither of these stats indicates how relevant a search engine is. Nevertheless, the search engines keep pushing them at us, and in particular at journalists, in an effort to trump their competitors.
Here we are in 2005 and what's happening? Size is pushed again in our faces. Sure, Yahoo didn't do a release on it. But it knew exactly the reaction it would get by announcing via its blog that it was twice as big as Google. And Google? The company has pulled out all the stops in lobbying us at Search Engine Watch along with other analysts to poke hard at the Yahoo numbers, because it doesn't want to be seen as "second best" in any area.
The irony is deep. Google has never provided any proof when it trumped others on the size front. MSN says it's at 5 billion in November? No problem -- Google magically announces on its home page that it's at 8.1 billion. While MSN didn't seriously question that Google was larger than it, plenty of other rumblings went around that the count might not be correct. But since it had trumped everyone else, Google apparently didn't feel the burning concern it now has that size should somehow be verified. Sure, maybe Yahoo isn't at 19 billion. But maybe Google isn't at 8 billion, either.
This game is going to go on and on until someone is brave enough to change the rules. I'm daring either of the leaders, Google or Yahoo, to do just that. Both of them say that size is one of only many factors to consider. Both of them tell you relevancy matters most. SO PROVE IT!
Ideally, I want to see the major search engines come together to develop a unified, accepted way to measure relevancy in various ways: web search, local search, advanced queries, whatever. Establish a research center, a consortium or something and a methodology that all will agree upon. Then test every four to six months and pledge you'll accept the results publicly. Someone wins? Kudos all around! Didn't win? Then do better next time.
That's the challenge. Let's see if someone steps up. As for size -- yes, Gary and I will revisit the various claims and counter-claims in more depth later this week. In the meantime, some past reading on the subject of size and the complications in measuring it:
Posted by Danny Sullivan at 11:31 AM | Permalink
Gigablast has posted a new total size count. The new total listed on the Gigablast home page is at 2,068,530,608 pages indexed. That's up from the 2,024,193,536 toal that Gigablast posted in May. In addition to running the feature filled Gigablast site, Matt and his team also provide web results to Snap.com.
Posted by Gary Price at 1:50 PM | Permalink
The total web database size claims that Yahoo released this week continue to have people talking. It's so not a big deal, at least for me, the searcher. For Gary, the search industy watcher, it's interesting to see another round of database size wars up and running but it's still not a big deal in the searching sense. We've been through this before. What Total Size Wars 2005 illustrates that pr/bragging rights and mindshare are so crucial in the today's search business.
Yesterday, Google went on the record with John Battelle:
"Our scientists are not seeing the increase claimed in the Yahoo! index. The data we have doesn't support the 19.2 (billion page) claim and we're confused by that."Both Google and Yahoo officials have also talked to the Search Engine Watch team. In fact, during the GoogleDance the other night, Danny, Chris, and I spent about 90 minutes chatting and looking over some of the same reasearch that they also shared with Battelle. I'm sure Danny will have more to say about our meeting next week.
Total size battles are nothing new. For example, back in the summer of 2003, we had something similar go down with total size claims between Google and AllTheWeb. Of course, the search biz in 2003 wasn't what it is today.
So, what are possible next steps or is this something that will be repeated over and over again and not just between Google and Yahoo?
First, remain calm, all is well. Enjoy the weekend.
Second, if total size claims are so important to Yahoo, Google, and others, how about both of these companies and others sitting down and agreeing to an independent third party auditing and certifying future size claims? I just wonder if each company would be willing to disclose the needed info for a third party to make accurate verifications. Btw, in the short term, I'm also hoping that noted search engine expert Greg Notess, will run some tests and offer his search size estimates.
Again, do total size numbers mean anything in the first place to the searcher? No. We all know what does matter. However, as we've seen this week, this number sure means something in the pr/marketing/branding/press coverage game.
Those of you who decide to do your own size tests need to remember that without knowing precisely what each company is counting, it's very difficult to measure apples with apples.
For example, to get accurate total size numbers it would be useful to know how each engine handle the followning and other variables:
Also, attempting to make estimates simply by running a bunch of searches on both engines and only looking at total page estimates are not doing anything productive. The page estimates listed at the top of web results page are not accurate especially as a measurement tool. To get total page counts you're going to have to literally count each and every result and now how each engine handles the variables listed above (and others). Very time consuming.
Googlewhacking with Yahoo
Since a Googlewhack is only Googlewhack if it returns just one result, I thought it would be interesting but far from scientific to see if that one result would or wouldn't appear if you ran the same Googlewhack producing search with Yahoo. Would more results appear? Less? Since these searches produce just one result, counting would likely be easy. I selected 20 recent "whacks" from the current Googlewhack stack.
I'll leave the interpretation, if any, up to you. For me, the following was just a fun exercise and proves nothing. Btw, I wonder if Googlewhack founder, Gary Stock, and his crew of "whackers" are going to start Yahoowhacks?
Results A Googlewhack equals a Google search producing one result.
10 Googlewhacks were not found (zero results) in Yahoo. 6 Googlewhacks were found in Yahoo. In other words, the same single result at both Yahoo and Google. 4 Googlewhacks found more than one unique result at Yahoo. ++ 3 Googlewhacks searched with Yahoo found one additional result. ++ 1 Googlewhack searched with Yahoo found 6 additional results.
Specific Searches + tartiest dieing Not found in Yahoo
+ intergalactically janitorial Not found in Yahoo
+ icebreaking snaggletooth Not found in Yahoo
+ poboys moneybag Not found in Yahoo
+ pangea anthropocentrically Not found in Yahoo
+ bedtimes downshifter Not found in Yahoo
+ obverse tartiness Not found in Yahoo
+ hubristic sweatsuits Not found in Yahoo
+ overload underkills Not found in Yahoo
+ tailgated winnebagoes Not found in Yahoo ------------------------------------ ------------------------------------ + supercharged disestablishmentarianism 2 results found in Yahoo. One unique
+ wildebeest colonoscopies 7 results found in Yahoo, Six unique
+ fictionizing rumsfeld 2 found in Yahoo. One unique
+ arachnophobic swashbuckler 2 results, One unique ------------------------------- ------------------------------- + semipublicly popularized Same result found in both
+ gifting twoonies Same result found in both
+ cruddiness pretentiousness Same result found in both
+ congratulating schoolchilds Same result found in both
+ fabulator marsupial Same result found in both
+ overaggressively tapped Same result found in both
Notes: I'll run a new random Googlewhack test next week and report if if find anything different or interesting. Also, no word if I was searching on the "new" larger Yahoo web database. OK, that's it. Remember, TGIF!
Postscript: I just noticed that John posted a few more comments on his blog including the following. He writes, "Would I be surprised if Google announced shortly that its index was magically up to, oh, 22 billion or so? No, I would not." I agree with John, Total size numbers from all engines are just claims. They've always just been claims. To move beyond this, some type of agreed upon standards and methods are needed. Otherwise, this week's headlines will likely happen over and over again.
Posted by Gary Price at 11:14 PM | Permalink
If you asked me yesterday after Yahoo's total index size announcement (web, images, audio databases) what I would be posting today I would have said that Google would post a new total size to one or more of their databases. I would have been correct.
Google has now posted a new total image size count on the Google Images home page.
The new total listed is 2,187,212,422 up from 1,305,093,600. Yes, that's nearly double. In yesterday's announcement Yahoo said their image database currently contains 1.6 billion images.
Remember, Yahoo and Google's numbers are just claims and they're mire about bragging rights and for keeping the buzz going especially in the non-search community. Many other factors (relevance, freshness, etc.) are what really matter.
Posted by Gary Price at 8:03 PM | Permalink
Looks like we might have a search engine total size wars beginning.
As I've said in the past total size numbers are primarily used for marketing purposes, bragging rights if you like. Michael Liedtke from the AP reports that Yahoo is announcing a total size count. The number Yahoo is announcing is 20 billion "web objects." The number is a combination of total web pages and total images.
Yahoo said its index, boosted by a recent upgrade, covers 20.5 billion online "objects," comprised of about 19 billion documents and 1.5 billion images. By comparison, Google said it tracks 11.3 billion objects.Tim Mayer points out on the Yahoo Search Blog that the "total objects" number also includes more than 50 million audio files.
This is the first time Yahoo has publicly announced a total web count. However, they have announced a total image and audio file counts in the past.
Interesting numbers but don't get carried away with them. Yahoo will have "the largest" bragging rights until (I would bet) Google announces a larger number. Then, it will be Yahoo's turn again. Will MSN join in the fun? What about Ask Jeeves? And so it goes. What really matters is relevance and other metrics. Hat tip to Tim Mayer for not forgetting this important point and mentioning this in his Yahoo blog post.
I want to make sure that while Yahoo's total size number is just a number, a claim really since all size numbers from just about all web engines are difficult to verify, Yahoo Search does deserve lots of credit for building some first-rate products over the past couple of years. Web search but also offering several excellent specialty indexes including image search, audio search (here's my overview), and a great news search engine. Yes, competition means good things for the searcher. (-:
Don't forget that very often a smaller, focused web databases are also very capable of providing excellent results. Finally, since many searchers only look at the first few results, just because a page is listed somewhere in a results set doesn't mean it will be seen. Again, this is why relevance is so important. Maybe the Invisible or Deep Web in 2005 is everything beyond the first 10 results?
Postscript: As web engines grow larger, the searcher would be doing themeselves a favor and learn to to take advantage of some of the many advanced features web engines offer that could do wonders in providing more precise queries and more relevant results. A little learning can go a long way.
Posted by Gary Price at 7:43 PM | Permalink
Everyday we read estimates of the total number of blogs and feeds out there. Of course, we rarely get solid definitions of just what a blog is. Does every feed belong to a blog? Do blogs or feeds that haven't been updated in x amount of time count? Do all the sites that post totals use the same criteria? I'm sure you've asked these and other questions. Just like total the database sizes that we see from some web engines, total blog and feed numbers are primarily marketing tools.
Jim Lanzone, Senior Vice President of Search Properties at Ask Jeeves, has just posted some interesting numbers and graphs on the Ask.com Blog that reveal the total number of feeds that have at least one subscriber who access the feed with Bloglines.
Lanzone believes this is a more accurate number of the total amount of feeds since someone has taken the time to subscribe to it. He calls these, "feeds that matter."
According to Bloglines members around the world, 1,121,655 feeds ?matter? to date. Note this includes only content feeds tracked, and not topics tracked via ?saved? or ?persistent? searches using the Bloglines service.Findory's Greg Linden adds an excellent comment to the post saying that a feed might need more than a single to subscriber to really "matter." He thinks 20 subscribers might be a better number to use. I think Greg makes an excellent point. Lanzone promises more breakdowns in the near future. I would also like to see how many of these 1 million plus feeds are updated at least once or twice a month.
I'll add that in some cases Bloglines has more than one feed listed for the same blog. I can speak from experience on this one since Bloglines currently lists seven feeds (one official, others unofficial, several broken) for my ResourceShelf site. All of these feeds have at least one subscriber.
Bottom Line? This post is worthy of your attention and, at the least, helps to provide a more realistic idea about the number of feeds out there.
Posted by Gary Price at 8:58 PM | Permalink
I've been having a series of email conversations with Tristan Louis, who has been trying to understand how well "A-List Bloggers" do in the major search engine based on links. The problem, I've been explaining, is that links and various other counts don't paint the picture you might expect. With his permission, I'm sharing much of our correspondence below. It's not as nice and neat as writing this all up as an article, but time doesn't allow for that. Hopefully you'll find it interesting to see all the complexities involved and why it's difficult to draw any conclusions. Remember, this was all email, so neither of us was particularly watching punctuation, capitalization or spelling!
June 14
Danny: Saw the article on Technorati versus Google links. Here's the problem. Google's only showing you a small percentage of the actual links it knows about. Go do a link count on Yahoo, and you'll see the link numbers are much larger than what Google shows you. Google does this on purpose to thwart search marketers looking to mine search data. As a result, your link percentages for Technorati are much higher than what Google really knows.
Tristan: Good point... That explains why my data set on Yahoo! seems so wildly out of proportion (that's for my next entry ;) )
June 21
Tristan: I thought the following might interest you and even possibly your readers: Technorati, Yahoo and Google Too. From the article:
In the last entry on the subject, we took a look at how Technorati and Google compared. From there, we discovered that Technorati was getting roughly a fourth of the links Google could locate. Which brought up some interesting questions: could we rely on the Google numbers? Were they so much larger than any other search engine that we were building an unfair comparison? And, as some alert readers pointed in email, was Google under-reporting the number of links to a site? In order to answer some of those questions, I decided to build some more comparisons. So I decided to take a look at some of Google's competitors. Today, I'll go into how Yahoo! fared (Hint: I was surprised by the results).
Danny: Tristan, I don't really see how you are making your conclusions at the end from this. It's probably easiest if I run down the conclusions.
Yahoo! generally does a better job at indexing the blogosphere than Google does. We know they have been working hard to improve their index and here's proof that they are getting results
What proof? You are counting links that Yahoo has and comparing to Google links, which you know aren't all the links that Google knows about. So all the math is meaningless. Google may have MORE links than Yahoo, but you can't tell this.
Much more important, number of links does not equal number of pages indexed. If you want to measure indexing, you have to do a site: search, such as:
http://search.yahoo.com/search?p=site%3Aboingboing.net http://www.google.com/search?q=site%3Aboingboing.net
In that, you see Yahoo has 137,000 pages indexed versus Google's 71,000 pages. On the face of that data, Yahoo seems to go twice as deep. But:
http://search.yahoo.com/search?p=site%3Awww.wilwheaton.net http://www.google.com/search?q=site%3Awww.wilwheaton.net
Now Google isn't as far back.
More important, what's getting indexed? Could Yahoo be indexing the same page over and over but under slightly different URLs? Could Google? These types of issues plague making use of search counts to prove anything.
Even if Google is the one with the motto about not doing evil, Yahoo! seems to be the one interested in giving equal opportunity to the little guy: smaller blogs seem to have a better chance of being recognized by Yahoo! than they do of being recognized by Google
That means nothing, either. Let's say Google really does only index say half the pages that Yahoo does. Now when you do a search, does Google still manage to bring the little blogs up while Yahoo doesn't? Google's come under accusations of being "blog clogged" in the past, and now you're suggesting that it's almost unfair to blogs. On what searches? Having pages means nothing if those pages don't surface into the first page of results. Run comparative queries on something where you think a little blog should surface at both places and see what happens.
While the front page of Google advertises they are currently indexing over 8 billion pages, it is very difficult to find ways to support that claim via the link feature they are offering: this can be seen as confirmation that Google does not tell you about all the links it has in its index.
The link: command is completely different than the site: command. The link command tells you nothing about the size of the index. As for a confirmation that all links aren't reported, this past blog post from SEW gives you confirmation and this page on Google mentions links are only a sampling of what Google knows although this other Google page fails to make this clear.
As for confirming the number on the page, how much time have you got to go into issues about that? Start here to understand the minefield of search index sizes.
Sure volume counts but in the case of search indexes, they may count against sites: if one is less likely to appear in Google than it is to appear in Yahoo! and the Google index is much larger than the Yahoo! one, then, if Yahoo! and Google had the same amount of traffic, a single blog could find itself receiving more traffic from Yahoo! than it does from Google. This would be due to the fact that each individual page in Yahoo! has more weight than it does in Google.
So for all we know, Yahoo has as many pages or more than Google has. They don't report the number. We do have one recent estimate that puts them a bit lower. But the odds that more pages means less visibility. If it were purely random. It's not. What's going to rank well depends on the number of pages on a particular topic, plus the linking data and in particular whether the links are relevant in terms of anchor text. It's not just a pure popularity play.
I applaud what you are trying to do. It's just that it's difficult to draw any conclusions from what I've seen presented.
Tristan: Thanks for the feedback... Let's go through a little polemic on this... (Tristan quotes what I've written above, which I've shown as italicized indented text, then poses new questions).
What proof? You are counting links that Yahoo has and comparing to Google links, which you know aren't all the links that Google knows about. So all the math is meaningless. Google may have MORE links than Yahoo, but you can't tell this.
A quick question here: if they do have more links, why are they not advertising them? It seems odd that someone would possibly claim to have more of something and, upon closer inspection, would report less. It's as if I said that I had a billion visitors a month and, when someone examined my logs, they found only a few hundred thousands. Would you trust me if I then said, "well, you know, we don't report on all the visitors"
More important, what's getting indexed? Could Yahoo be indexing the same page over and over but under slightly different URLs? Could Google? These types of issues plague making use of search counts to prove anything.
Important point here, though is that there still seems to be a difference. Yahoo! does seem to index more pages in either of the cases you demonstrated. I understand that duplicates might be dropped but shouldn't they at least be listed in the raw number? I mean Google provides you with an option to see (the "In order to show you the most relevant results, we have omitted some entries very similar to the 987 already displayed. If you like, you can repeat the search with the omitted results included."
So what do the numbers from Google mean? With omitted results, I can't get past 1000, without I can't get past 987... How do we know that the 9500 pages number is correct?
Let's say Google really does only index say half the pages that Yahoo does. Now when you do a search, does Google still manage to bring the little blogs up while Yahoo doesn't? Google's come under accusations of being "blog clogged" in the past, and now you're suggesting that it's almost unfair to blogs. On what searches? Having pages means nothing if those pages don't surface into the first page of results. Run comparative queries on something where you think a little blog should surface at both places and see what happens.
Well, what I'm saying here is that Google may not be as blog clogged as Yahoo! is, if claims on the size of indexes are correct (remember we're all assuming the claims are correct... no one ever challenged that assumption until now)
As for confirming the number on the page, how much time have you got to go into issues about that? Start here to understand the minefield of search index sizes.
But since it's a major marketing tool (as in "our index is bigger"), shouldn't someone investigate this stuff. Maybe we need some audits on all the major search engines in order to see if the claims are correct.
What's going to rank well depends on the number of pages on a particular topic, plus the linking data and in particular whether the links are relevant in terms of anchor text. It's not just a pure popularity play.
I agree there are many factors in terms of rankings however, wouldn't a page in an index of 100 pages have more of a chance (1/10) or appearing in the first 10 results (ie on the first page) than a page in an index of 1000 pages (1/100 chance) all things being equal. So, if you start with this, all things being equal, if the google index is much larger then chances for a blog to appear on the front are lower than it would be on Yahoo.
I applaud what you are trying to do. It's just that it's difficult to draw any conclusions from what I've seen presented.
No problem... Any kind of input is good. Basically, I managed to get a set of numbers and want to get other people to start playing with them (400 data points across 4 indexes (MSN is next in line) ). I can't help but feel like no one has actually attempted to do this kind of side by side mathematical comparison. I was hoping someone would and, when no one else went out and did it, I decided to undertake it.
Please provide information as to how to do this properly. Maybe someone will be able to then go and get the data in a way that's more in line with what you think is right (I'm a neophyte in that space and my blog is something I do for fun so a REAL analysis would be better :) ). I'd love to see an expert do an analysis on this (... and I wish there were an automated way to get to the data, it took me a long time to gather all the raw numbers :) )
Danny: (As above, I quoted parts of what Tristan was asking, before responding. I've shown those quotes italic font and indented)
A quick question here: if they do have more links, why are they not advertising them? It seems odd that someone would possibly claim to have more of something and, upon closer inspection, would report less. It's as if I said that I had a billion visitors a month and, when someone examined my logs, they found only a few hundred thousands. Would you trust me if I then said, "well, you know, we don't report on all the visitors"
OK, first, remember that the number of links isn't the same as the number of pages. Google knows about far more links to pages than actual pages it lists.
What does it advertise on the home page? Pages that it has indexed, not links it knows about. And no one is really suggesting that that number is super inflated. If anything, people tend to wonder if they are undercounting.
Now to links. Why aren't they showing all the links they know about? Because they fear site owners and marketers will take that data and figure out some way to better manipulate Google. It's also query intensive to generate that data, so it has no great interest in trying to do a great job there. It's run by relatively few people.
I'm not saying it's good that they do this, by the way. I think if you're going to offer a command, the command ought to work as you'd expect, and show everything. But the point is, you can't trust those numbers to do what you're trying to do.
Yahoo! does seem to index more pages in either of the cases you demonstrated. I understand that duplicates might be dropped but shouldn't they at least be listed in the raw number? I mean Google provides you with an option to see (the "In order to show you the most relevant results, we have omitted some entries very similar to the 987 already displayed. If you like, you can repeat the search with the omitted results included."
Yes, in the two cases I checked. That's not enough to be confident of anything. If you do want to play the numbers game, investigate all 100 of the sites on the list.
So what do the numbers from Google mean? With omitted results, I can't get past 1000, without I can't get past 987... How do we know that the 9500 pages number is correct?
We don't. And we don't necessarily for Yahoo. See the reference material I sent you, if you want to understand what a real challenge you're just dipping your toe into.
The best way to know is to find a site small enough that you can actually review all 1,000 or less pages that will be displayed, then literally count and see if duplicate and other junk is being eliminated. In lieu of that, you can go with the raw count figures and hope that this other stuff isn't going on.
Well, what I'm saying here is that Google may not be as blog clogged as Yahoo! is, if claims on the size of indexes are correct (remember we're all assuming the claims are correct... no one ever challenged that assumption until now)
The claims of blog clogging have come from the idea that blogs have better ranking power, not that they've got more pages indexed. That goes to the main point. Number of pages means little. What are you finding when you actually search? If I have 100,000 pages from your site and 10 from another, no great help to you if your 100,000 pages are deemed not good enough and never rank well.
But since it's a major marketing tool (as in "our index is bigger"), shouldn't someone investigate this stuff. Maybe we need some audits on all the major search engines in order to see if the claims are correct.
People have. My Search Engine Sizes page documents this type of stuff in great detail, efforts that have happened over the years. It's neither a new issue nor an easy one to solve. Here's a recent and fairly short summary of what's involved: Search Engine Size Wars V Erupts.
I agree there are many factors in terms of rankings however, wouldn't a page in an index of 100 pages have more of a chance (1/10) or appearing in the first 10 results (ie on the first page) than a page in an index of 1000 pages (1/100 chance) all things being equal. So, if you start with this, all things being equal, if the google index is much larger then chances for a blog to appear on the front are lower than it would be on Yahoo.
Things are not equal. Search results are not like a lottery. Every page is different. Every page is going to be slightly better or worse for a particular query. Linkage data skews things even more. There is no level playing field out there, where just number of pages gives you a better chance. Yes, you have more actual chances, but it's still not a case that it will skew in your favor.
No problem... Any kind of input is good. Basically, I managed to get a set of numbers and want to get other people to start playing with them (400 data points across 4 indexes (MSN is next in line) ). I can't help but feel like no one has actually attempted to do this kind of side by side mathematical comparison. I was hoping someone would and, when no one else went out and did it, I decided to undertake it.
People have to some degree, as the stuff I've previously sent points out.
Honestly, skip the numbers. It's the results. You want to measure how well blogs do on search engines, pick queries, do the searches and see who comes up. That's the very best test you can do.
NOTE TO READERS: I've put that last part of my correspondence in bold for a reason. It's probably the most important point in all of this. Look at queries, not counts, to measure how well things are working.
Tristan: (quoting Danny in responses, those quotes in ital indented copy)
Remember that the number of links isn't the same as the number of pages. Google knows about far more links to pages than actual pages it lists.
That makes sense since most pages will generally have more than 1 link.
What does it advertise on the home page? Pages that it has indexed, not links it knows about. And no one is really suggesting that that number is super inflated. If anything, people tend to wonder if they are undercounting.
It's true. However, it would be nice to see an actual audit of those indexes to see what the numbers really are.
Now to links. Why aren't they showing all the links they know about? Because they fear site owners and marketers will take that data and figure out some way to better manipulate Google. It's also query intensive to generate that data, so it has no great interest in trying to do a great job there. It's run by relatively few people.
But what does it present as a # when I do a link: search? What is that number? That's the question I'm trying to pose (albeit maybe not clearly enough). When Google says "Results 1 - 10 of about XXXXXXXX linking to foo.com" what does that number mean? That data is being generated (whether it's query intensive or not is the problem of the search engines: if they're going to display something, they better make sure it's correct). Furthermore, I don't think it would be any more intensive to show the all the links (since each query is only for 10 to 100 results per page max): the processing power is such that it would be the same for each page anyways since the tough part of the processing is in the ordering and it is being done in that way for the pages it shows.
I'm not saying it's good that they do this, by the way. I think if you're going to offer a command, the command ought to work as you'd expect, and show everything. But the point is, you can't trust those numbers to do what you're trying to do.
If you can't trust them, why are they even offering them, then? Wouldn't it make more sense for them to not display that info. I think there are quite a few people working at Google on the UI and generally, they do not throw information on that screen just because it looks pretty. So what is that number? If the agreement is that the number is meaningless, why is it there?
If you do want to play the numbers game, investigate all 100 of the sites on the list.
OK.. let's try the top 5 then (you had boingboing so I'm doing the other 4 :) )
Instapundit: http://search.yahoo.com/search?p=site%3Ainstapundit.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt 58,300 http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3Ainstapundit.com&btnG=Search 80,300 Daily Kos: http://search.yahoo.com/search?p=site%3ADailyKos.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt 19,000 http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3ADailyKos.com&btnG=Search 682,000 Gizmodo: http://search.yahoo.com/search?p=site%3AGizmodo.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt 195,000 http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3AGizmodo.com&btnG=Search 38,100 Fark: http://search.yahoo.com/search?p=site%3AFark.com&prssweb=Search&ei=UTF-8&fl=0&x=wrt 1,940 http://www.google.com/search?hl=en&lr=&biw=1024&q=site%3AFark.com&btnG=Search 1,030,000
Hmmmm.... Looks like I'm going to have to extend the data set, this looks all over the place :)
The best way to know is to find a site small enough that you can actually review all 1,000 or less pages that will be displayed, then literally count and see if duplicate and other junk is being eliminated. In lieu of that, you can go with the raw count figures and hope that this other stuff isn't going on.
There's got to be a site of that small a size somewhere in the Internet.com network. Could you ask around internally. If you identify one, maybe we can get a group effort started on this. I figure if we throw it as a challenge in an SEO forum, we could get some good response.
If I have 100,000 pages from your site and 10 from another, no great help to you if your 100,000 pages are deemed not good enough and never rank well.
Number of pages does mean a lot in terms of marketing. It can also have an impact on results (the higher number of page, the higher the possibility that you have the best set of pages (hence the race to build bigger indexes).
People have [tried to audit sizes] to some degree, as the stuff I've previously sent points out.
Actually, what I'm asking for is independent confirmation of the numbers. The pages you sent me provide useful info about what the claims are but how do we investigate whether the claims are correct? How do we move from reported size figures to actual size figures?
Things are not equal. Search results are not like a lottery. Every page is different. Every page is going to be slight better or worse for a particular query. Linkage data skews things even more. There is no level playing field out there, where just number of pages gives you a better chance. Yes, you have more actual chances, but it's still not a case that it will skew in your favor.
I know things are not equal. I'm just trying to establish a base line here. Think of it as dissection. Trying to get one piece sorted and then the next. Maybe we can learn something out of careful dissection of this type.
Honestly, skip the numbers. It's the results. You want to measure how well blogs do on search engines, pick queries, do the searches and see who comes up. That's the very best test you can do.
The results are interesting and there's a fair amount of research being done there (as you know and chronicle :) ) . What I'm trying to understand is how the indexes are built. It's definitely not as exciting to most people but it is important in the long run (I'm working under the assumption that crawling is not going to work in the long run as a way to keep a relatively fresh index)
June 22
Danny: (I'm making these responses to Tristan's last email as part of this post, rather than through email)
It's true. However, it would be nice to see an actual audit of those indexes to see what the numbers really are.
Sure, but the time and energy to focus on size numbers detracts from the real figure you want, a relevancy figure. The size marketing game comes around from time to time, as Search Engine Size Wars V Erupts explains. But overall, it's not worth the time to deconstruct. If Google is 8 billion and MSN is 6 billion, they are both BIG. The question isn't whether Google really has an extra billion or two. The question is whether it has massively more info indexed than MSN. On this scale, now. See also Search Engine Size Wars & Google's Supplemental Results for more on this and In Search Of The Relevancy Figure on how size is used as a surrogate for the real figure we need, a relevancy figure.
But what does it present as a # when I do a link: search? What is that number? That's the question I'm trying to pose (albeit maybe not clearly enough). When Google says "Results 1 - 10 of about XXXXXXXX linking to foo.com" what does that number mean? That data is being generated (whether it's query intensive or not is the problem of the search engines: if they're going to display something, they better make sure it's correct). Furthermore, I don't think it would be any more intensive to show the all the links (since each query is only for 10 to 100 results per page max): the processing power is such that it would be the same for each page anyways since the tough part of the processing is in the ordering and it is being done in that way for the pages it shows.
That number means the number of links Google chooses to display to you, a sample of all the links it knows about. It is correct -- just not correct in that you assumed it meant ALL the links it knows about. A disclaimer would be nice. After banging on them about this issue, they finally got at least a note about sampling on one of their help pages that webmasters read. As I said, the page searchers might read doesn't explain this. As for query power, yes, search engines commonly report that generating things like link lists takes a lot more work. For one thing, lots of people search on the same things every day, so common searches can come off of cached memory. But a link list? Do that, you're likely the first person that day to search for that set of links. You've got to go to disk and pull up the data anew, is my understanding from talking with them. They're still fast at it -- but it lots and lots of people did it, it would be a drain.
If you can't trust them, why are they even offering them, then? Wouldn't it make more sense for them to not display that info. I think there are quite a few people working at Google on the UI and generally, they do not throw information on that screen just because it looks pretty. So what is that number? If the agreement is that the number is meaningless, why is it there?
Because when search engines remove these numbers, they get complaints. That's one reason they've said. Also, it gives you some degree of feel for how much is out there. Also, Google did say "of about" with the numbers it reports. That's not an accident. They're saying that this is an estimate. But no disagreement with me. If you put up a count, it would be nice if the count was as accurate as possible. Google's have come under question. See Revisiting Google's Counts & Drops When Searching The Same Word Twice and Questioning Google's Counts. That latter article highlights a series of other articles on count issues, including just how historic an issue this is going back to problems with AltaVista. Also see Tim Bray's recent On Search: Sorting Result Lists.
There's got to be a site of that small a size somewhere in the Internet.com network. Could you ask around internally. If you identify one, maybe we can get a group effort started on this. I figure if we throw it as a challenge in an SEO forum, we could get some good response.
If people want to audit search sizes, they can start by visiting Greg Notess's Search Engine Showdown site, where he illustrates how to run a set of queries rather than doing site: searches to determine sizes. He's even been contracted in the past by search engines who wanted to prove an audit, as with Northern Light. Anyone can do what he's done. Even better, heck, people could just fund him to do a new study.
Number of pages does mean a lot in terms of marketing. It can also have an impact on results (the higher number of page, the higher the possibility that you have the best set of pages (hence the race to build bigger indexes).
Yes, after all, how can you find the needle in the haystack if you search only half the haystack? Wait! What if I dump the haystack on your head? Can you find the needle now, even though you have everything? That's how I've long tried to explain this issue when speaking or in this article, Search Engine Size Wars & Google's Supplemental Results. We do want index growth, but having an extra 1 billion pages almost certainly won't make your search for "britney spears" any better.
Actually, what I'm asking for is independent confirmation of the numbers. The pages you sent me provide useful info about what the claims are but how do we investigate whether the claims are correct? How do we move from reported size figures to actual size figures?
Fund Greg, test yourself in a new way, maybe lobby the search engines to come together on this. Auditing was big in 1999, back especially when Northern Light was annoyed at AltaVista's claims despite its timing out habit. See Who's The Biggest Of Them All? and Northern Light Claims Largest Index. As said, Northern Light funded one such audit. It is an issue, but I'd rather see them come together on a commonly accepted set of metrics on how relevant they are, if they're going to start anywhere. But maybe after writing on the size issue for almost ten years, I'm beaten down :) Really, it's more that it's not a huge issue to me and my coeditors because we don't look to size figures to know who is best.
And Now For Something Completely Different
While writing this up, I noticed that Technorati's David Sifry had been following Tristan's reports and posted some questions of his own in On Search Engine results comparisons: Where's the remaining 99.8% of the results?. So, I'll conclude with a quick answer to something he raised:
If you can only view 703 results of about 575,000, where are the other 573,297 results? That's only 0.2% of the search results that the estimate claims. Where's the missing 99.8% of the search results?
No major search engine lets you go beyond 1,000 results, last time I looked. This is something that's been in place for ages and ages and ages. Some key reference material:
Posted by Danny Sullivan at 11:09 AM | Permalink
Although they're just rough estimates that are primarily best used for marketing purposes (aka bragging rights), I guess it's worth noting that Google has just made an increase to the public number they post for total number of images searchable at Google Images. The new total number is 1,305,093,600 images an increase of 117,463,600 from 1,187,630,000.
I would have thought that when Google posted an increase to the total images number it would have been greater than 1.5 billion. Why? This is the number that Yahoo used in an announcement about 4 months ago.
Posted by Gary Price at 6:58 PM | Permalink
Those of you who track the steady stream of blog, blogosphere, and RSS stats might want to add some new numbers from Bloglines that were released today to your files.
* John pointed out a few weeks ago that Bloglines is planning on launching an improved search tool sometime this summer.
Posted by Gary Price at 9:54 AM | Permalink
It has been ages since I've seen anyone try to estimate the size of the web. Now a new paper puts it at 11.5 billion pages or more, for January 2005. The Indexable Web is more than 11.5 billion pages has the details.
The paper from Antonio Gulli of UniversitĂ di Pisa (who is also director of advanced products for Ask Jeeves) and Alessio Signorinialso of the University of Iowa estimates what percentage of the web is covered by each search engine. I've shown that in the chart below. (Please note that if you saw an earlier edition of this post, it didn't have the figures that were sent separately from the study).
Search Engine
Self-Report. Size (Billions)
Est. Size (Billions)
Coverage Of Indexed Web
Coverage Of Total Web
8.1
8.0
76.2%
69.6%
Yahoo
4.2 (est)
6.6
69.3%
57.4%
Ask
2.5
5.3
57.6%
46.1%
MSN (beta)
5.0
5.1
61.9%
44.3%
Indexed Web
9.4
Total Web
11.5
Earlier, I said the web was estimated to have 11.5 billion or more pages by the study. The "indexed web" refers to the part of that considered to have been indexed by search engines. That amount is estimated at 9.4 billion pages, or 82 percent of the entire web. The chart shows you what percentage of both the indexed web and the total web each search engine covers.
OK, the first thing you wonder is whether any of the search engines are lying when they say how big they are. Google has claimed to have the biggest search index, with 8.1 billion pages.
The estimate shows that this is right on target -- off by a tiny amount, so no apparent deceit by Google, at least in the sense of overstating! The same is true for MSN and Ask Jeeves. Ask is actually estimated to have more than claimed, while MSN is right on target.
Yahoo doesn't provide an estimate of its index. The 4.2 figure is the last we have, from back in 2004, when it said it was comparable to Google. More on this in my past article, Search Engine Size Wars V Erupts. So the estimate we have now from this paper is nice, in that we finally have an updated sense of where Yahoo might be.
There are a ton of caveats to throw out. The estimates are for the "visible" web, URLs that search engines can easily reach. The "invisible" or "deep" web refers to content locked behind databases or other systems that search engines haven't extracted. We've had estimates that the deep web might be 500 billion pages, in the past.
Also, while some URL normalization was done by the study, it still seems like mirror or duplicate pages may have been counted. So while there may be a certain number of pages, the number of unique pages might be lower.
Finally, as we've repeatedly said, size should not be taken as a surrogate for relevancy. Having a ton of pages doesn't mean anything if you can't return the best pages in the top results. It is nice to know that a search engine has good coverage of the web, but it's only one of many factors to consider.
Still -- it's great to have some updated estimate of the web's size, as well as search coverage. For background on size issues, see my Search Engine Size Wars V Erupts from last November and some historic articles on the Search Engine Sizes page. Yes, I'm still planning to update figures there! But the reference material is all still valid, if you want to understand more on this subject.
Posted by Danny Sullivan at 7:56 AM | Permalink
Matt Wells over at Gigablast has had his web crawler really cranking lately. According to the Gigablast homepage the database now provides access to 2,024,193,536 pages. Previously, the total listed was 1.504 billion pages.
Posted by Gary Price at 10:48 PM | Permalink
Dogpile has released a significant upgrade to its meta search engine, allowing easy comparison of search results across the major search engines. Dogpile has also introduced a new comparison tool that visually illustrates search engine overlap (or lack thereof) in the top results for Ask Jeeves, Google and Yahoo.
In today's SearchDay article, Dogpile Enhances Meta Search, Offers Comparison Tools, I take an in-depth look at these new services, and also comment on some new research that quantifies search engine overlap and why it's important for both searchers and search marketers alike.
Posted by Chris Sherman at 1:39 PM | Permalink
Over the weekend, Gigablast increased their total page count about 4 million pages from 1.5 billion pages to 1.504 billion pages. I also noticed that Gigablast now provides access to the Open Directory (DMOZ) database.
Posted by Gary Price at 1:02 PM | Permalink
Those of you who track total index size (or at least what the engines tell us) might be interested in learning that Matt Wells and his team at Gigablast have posted an increase to their total.
Gigablast now lists 1,500,103,760 pages indexed on their home page. The total previously listed was 1,014,325,120.
Posted by Gary Price at 1:01 AM | Permalink
A new search engine "Web's Biggest" has come out claiming they are bigger than the other major search engines. Wow, rush on over! Don't waste your time.
First, I highly doubt the claim. The search engine provides no count numbers with its results, so there's no way to run comparisons. Doing comparisons always is problematic anyway, but counts are a basic starting point.
It does provide a page that purports to show how it is bigger than the others. Enter a number, and it supposedly generates a random list of sites that supposedly have no or few pages listed at Google, Yahoo and MSN.
Oddly, no matter what number I enter, I get the same sites listed. And the links showing results at the other search engines? They don't use the right commands to bring back accurate results. And when I do use the right command? Over at Google, I get signs that the sites may have been banned. For comparison purposes, this "proof" shows nothing.
But let's assume that this site really was bigger than the others. Time to roll out the trusty haystack analogy of why bigger is better. How can you find the needle in the haystack if "small" search engines hunt through only half of it? That's something we used to hear in the early days of the search engine size wars.
I have my own haystack response that I've long used in these situations. If I dump the entire haystack on your head, can you find the needle then?
Going back to this site, we get plenty of proof on why having the entire haystack is no help if you don't have a powerful magnet to pull the good needles to the top. A search for "movies" brings up a list dominated by porn sites (OK, I suppose they ARE movies). "Cars" brings up travel search engines and give away sites. "US patents" fails to find the US Patent Office.
All in all, I find a good use for the nofollow attribute for the first time. For more on size issues, see my recent Search Engine Size Wars V Erupts post.
Posted by Danny Sullivan at 7:56 AM | Permalink
Search engine counts are never something you should depend on, a topic we've discussed many times before. Still, if you're going to get a count, it's nice if it doesn't seem to change much or simply seem absurd depending on the query you do.
Google's counting has been shaky for ages. But the Web: Google's counts faked? article does a lot of math to find the counts have even more weirdness to them.
Over at our forums, the Impossible Counts discusses the article and also skips the math and looks at why searches you know should bring back fewer results nevertheless don't. Also see these related articles:
Posted by Danny Sullivan at 1:49 PM | Permalink
News that Exalead's database of web pages has passed the one billion page mark. The total page count on the home page now reads: 1,031,065,733 pages.
If you've never tried Exalead, I think it's more than worth a look. I blogged an overview post focusing on a few of its numerous advanced search features back in October. A couple of weeks ago I posted about personalizing the Exalead home page.
Exalead's Paris-based CEO, Francois Bourdoncle, tells me that the company plans to have a two billion page web index online in the near future. He also said that his company is about just ready to introduce a desktop search tool.
Posted by Gary Price at 2:33 PM | Permalink
A congrats and kudos goes out to Matt Wells (and his team) as the Gigablast web index passes the one billion page mark. The official number listed is: 1,014,363,952. Previously, Gigablast was using a total page count of about 640 million pages.
Gigablast has been a very busy place lately. In the past few weeks they've launched several new services.
If you're interested in learning more about Matt Wells and Gigablast, take a look at this interview he did with Infoseek founder and Matt's former boss, Steve Kirsch.
Posted by Gary Price at 1:19 PM | Permalink
As Chris blogged, Google has raised the stakes in the search engine size wars by claiming an index of 8 billion pages. Microsoft had planned to seize the title of biggest search engine by announcing 5 billion pages indexed today. That would have put it above the 4.2 billion mark Google has self-reported for about a year.
We've been through these size wars before. They erupt any time a search engine seeks some type of concrete evidence that it is better than another. Size figures don't "prove" this at all, of course. A search engine with lots of pages might actually be worse than one with fewer, if the index isn't refreshed often or if the relevancy simply isn't there.
My Search Engine Sizes page lays out the past size wars we've had, for the curious, along with plenty of reference material and past articles. The figures haven't been updated since Size Wars IV in 2003, so I'll be off to fix that soon. Meanwhile, here's where we stand:
Search Engine Reported Size Page Depth Google 8.1 billion 101K MSN 5.0 billion 150K Yahoo 4.2 billion (estimate) 500K Ask Jeeves 2.5 billion 101K+Now time for all the caveats!
Reported Size Figures
Reported Size is just that -- whatever the search engines claim. With Google, this has sometimes included what they call "partially-indexed" pages or what would more fairly be called link-only pages. These were pages Google knows about solely by links pointing at them. Nothing on the pages themselves has been indexed.
Typically, search engine sizes shouldn't count duplicate pages, spam pages and so on. But we're not auditing here, so they might.
As for Yahoo, it's trying to stay out of the size game. When it launched its own search technology earlier this year, it refused to provide a size figure, instead saying it was "comparable" to others. The company is sticking with this.
"As in the past, we are not disclosing the size of our index for competitive reasons. That said, we believe our index is highly competitive. Search quality is comprised of a variety of factors including freshness, relevance etc. and we continue to deliver high quality results for our consumers to ensure that they are able to find the best results for what they are looking for," said spokesperson Stephanie Iwamasa.
I both love and hate Yahoo for this. I love the idea of not getting into the size wars again, which are never that productive. But I also hate the idea we don't have a clue where they are at. I want those numbers -- I just want the search engines to put them out without the hype.
Since Yahoo won't release a figure, I'm putting them at 4.2 billion. That was the figure Google had long claimed -- and I read Yahoo's past statements of being comparable to mean they were at least equal with where Google was at.
Page Depth Amount
Page Depth is much more interesting. So you've got tons of pages -- do you actually index the full text on them, every word? That used to be how some search engines operated. Google almost singlehandedly made it acceptable to only partially index some pages.
In the past, if a page were longer than 101K, only the first 101K worth of text was indexed by Google. Everything else was ignored. My assumption right now is that Google still operates this way. If not, we'll bring an update as more information is gained.
MSN's page depth figure comes from statements they gave during the Meet The Crawlers session at Search Engine Stategies San Jose last August. It may not be true for the current release. I'll double-check on this and update, if so.
Yahoo's figure is from that same session. Ask Jeeves declined to state a figure during the session, going with, "We're in the ballpark of others." So, I've made them equal to Google, for now.
It's pretty easy to figure this stuff out. You just find a big long page, then do searches to see which search engines find text at the bottom of it. Tara Calishain did this recently to Yahoo and found Yahoo actually picking up some pages to a depth of 800K.
Greg Notess of Search Engine Showdown is also the historic star of this type of auditing. In the past, Greg has run tests to try and determine if search engine sizes as reported seem to measure up. If he jumps back into this, we'll let you know. We may also jump in on the page depth side. Trying to audit the index size is much more time consuming.
In the meantime, I'll leave you with the refrain that Chris, Gary and I all agree with. Search engine size figures are useful but by no means should they be taken as a surrogate for a relevancy figure. Google having an index twice as large as Yahoo does NOT mean it is twice as good.
I'll leave you with a reference to my past article, Search Engine Size Wars & Google's Supplemental Results. It goes into even more depth on all the issues relating to search engine index size and the games that can -- and have -- been played.
Our recent Search Memories article is also a good read for those who want to hear some first-hand accounts of the first search engine sizes wars that were sparked by AltaVista.
Want to comment on this story? Visit our forum thread: It's Official: Google Now Searching 8,058,044,651 web pages.
Posted by Danny Sullivan at 8:42 AM | Permalink | Comments (0)
On the eve of Microsoft's long anticipated launch of MSN Search, Google is reporting on its home page that its index size has nearly doubled. Google now claims that it is now "Searching 8,058,044,651 web pages." Earlier today, a search for the word "the" returned nearly 11 billion results, a far larger number than officially reported on the home page. No matter which numbers you believe, it's a significant expansion of Google's web database.
Will this big increase in Google's index make a difference to searchers? Perhaps. Traditionally these volleys in the search engine size wars have meant little, but have been picked up by the media because they are tangible and easy to report.
But Google hasn't just increased the size of its index. It has also been working hard on other aspects of the search engine, dropping hints for the past six months of major impending changes in the way the search engine calculates results. Algorithm changes combined with a much larger database may ultimately result in major changes for our web searches.
So yes, this is yet another brilliant PR move by Google that will certainly steal some of Microsoft's thunder on its big announcement day. But it may also portend significant changes in Google search results. Or not. Only time will tell.
Posted by Chris Sherman at 8:54 PM | Permalink | Comments (0)