In December, Google became the first crawler-based search engine to break the 1.5 billion web page mark. In addition, the service rolled out changes designed to improve the freshness of its results and the ability for users to find news.
The Google index now contains more than 1.5 billion web pages that have been actually visited by Google, as well as an additional half-billion pages that it knows about through links. There are also another 330 million image files and 700 million Usenet posts, which stretch back to 1981.
The enlargement of Google's Usenet information makes it a fantastic resource for researching the early days of the Internet, and Search Engine Watch's associate editor Chris Sherman takes a closer look at the enhanced Google Groups, in his story below.
Sherman's story also provides more details about Google's improved news search results. Since the middle of 2000, Google has provided links to news stories at the top of its results page, in response to certain queries. The news content was pulled from major wire services.
The latest changes now pull that content from hundreds of web sites that Google says it has identified as having news content. Google did not specify exactly how news sites are identified nor is there any specialized news submission option. The company simply said that it has an automated process that it believes will find sites with good news content.
"If it looks even remotely like a news site, then it should be part of it," said Urs Hvlzle, Google Fellow and member of Google's executive management team.
Google says that news links are three times more likely to appear in results, than in the past. When it appears, news content shows up at the top of the standard Google results page, with the word "News" to the left of any links. Try a search for "euro" or "argentina," and you'll see examples of news links.
Users are also apparently pleased to get this news content, because the clickthrough rate on news links is five times better than before, Google said.
Unfortunately, the changes still leave Google weak in the news search arena. At competitors AltaVista and FAST, there are dedicated news search offerings. There are also a variety of good, new news search sites such as Daypop and RocketNews available, in addition to established ones such as Moreover. In any of these places, users who specifically want to find news content can be guaranteed to find it.
In contrast, there's no way to specifically perform a news-only search at Google, in the way you can an image search or a newsgroup search. Instead, you have to hope that the Google search algorithm manages to float news search results up in response to your query. To stay competitive, given the huge interest in news search, Google needs to finally make a dedicated news search option available.
Google did roll out a "Headline News" search service also in December, but that's not the same thing. This service aggregates top headlines from more than 100 leading English language newspapers into a single page, as well as grouping them into six categories: World, US, Business, Entertainment, Technology and Sports.
Google is promising future changes, such as more news sources and interface enhancements. Hopefully, one of those enhancements will be the ability to do keyword searching against the Google news search index used to feed its main results page.
Google is also trying to improve the freshness of its web page index. Previously, Google updated its web page index on a roughly monthly basis. This meant that pages could be around a month old, if you used Google just before the latest refresh happened.
The monthly refresh is still continuing, but a new daily refresh now also runs. A few million pages identified as being time-sensitive are being spidered regularly, so that the latest information from them is available.
Google is even highlighting if a page has been refreshed recently by the use of a new "Fresh!" tag that appears next to a page's URL. They show the exact time the page was respidered.
For instance, search for "white house," and you'll see that the US White House site is noted as "Fresh!," having last been visited on January 6.
The Fresh notations are welcomed, but even better would be if Google showed dates for all the pages it lists, in the way AltaVista used to offer. Then, it would be extremely easy to know exactly when a page was last visited by the Google spider.
By the way, that long-standing page date option was available at AltaVista until recently. It now appears to have been pulled, probably because it made it so easy to understand how fresh -- or stale -- AltaVista's index was.
Google didn't explain exactly how pages are selected for regular visits, but there are some common factors that can help. A good PageRank is one. Pages deemed more important in Google's link analysis system have an improved chance of being visited more often. However, it's also important that the page is also "relevant" for regular updating, Google says.
What's that mean? Google wouldn't go into more detail at the moment, but it's something I'm looking to revisit with them. I think it's fair to assume that it means the page has shown some degree to change frequently. However, "Just changing your page everyday isn't going to help," Hvlzle said.
Google also said that the daily refresh of web pages operates completely separate from the news search service. In other words, just because your URL shows up with a "Fresh" notation doesn't mean you've been selected as part of Google news search. However, it does mean that for the near-future, that particular page will be revisited on a regular basis.
Overall, I wouldn't get too worried about trying to get your content "Fresh" noted. If your content changes often, then with luck, it will happen automatically. If it changes often and you seem to be overlooked, then use the general Google feedback form to alert them to this. And if your content doesn't change frequently, then don't waste time or effort trying to make it seem as if it does. The Fresh moniker does not come with any type of ranking boost.
Given all these crawling changes, it's also a good time to revisit the basic architecture of how Google serves its web pages. The company says it currently runs four data centers, two on the West Coast of the United States and two on the East Coast.
Each data center consists of around 2,000 to 4,000 separate computers, which store the Google web page index. They are roughly mirror images of each other, but there may be small differences. For example, changes to a master data center will propagate to the others, but there can be a short delay until this happens.
For the most part, users shouldn't notice any differences. When you do a query, Google tries to automatically route you to the closest data center. For example, European users would tend to get results from one of the East Coast data centers. Google also tries to ensure that you stay with a particular different data center during a search session, so that if you repeat a search, you aren't surprised by minor differences.
Things are a bit different with Google Groups and Google Images searches. Because there is less demand on these indexes, they are not mirrored at every data center, Google says.
Overall, about 10,000 computers are involved in maintaining all of Google's search indexes. A fifth data center is currently being built on the East Coast, so expect that number to rise more.
Also, it's popular on the major search engine forums to talk about changes that happen on the Google "numbered" servers, "www2.google.com," "www3.google.com," and "www4.google.com." Google says these are "production" services, not live servers that correspond to being switched automatically to a particular data center.
Google engineers use the production servers to test different things, so some people find them as a guide to what may happen on Google in the future. Then again, they may not follow what goes live on Google at all. About the only thing Google said they were useful for, from a webmaster perspective, is knowing whether a previously unindexed site has been visited.
In other words, if you aren't in the live Google results but see your site listed in one of the production servers, you'll probably hit the live server in the near future. The caveat here -- and it is important -- is that while you may be listed in the live results, there's no guarantee that you'll rank for a particular term just as you did with one of the production servers.
Another popular topic on the forums at the moment is that some web site owners have found that their PageRanks have dropped. One reason behind the drops was a glitch in the Google Toolbar, which site owners can use to determine PageRank, Google said.
"As it turns out, while upgrading a server, we made a slight change that effected the toolbar. Once we discovered the issue, a fix was implemented," said Google spokesperson Nate Tyler.
I had asked about any potential problems last Thursday, when noticing that my toolbar was giving sites such as Yahoo and even Google itself no PageRank. Sure enough, the next day when the response came, I also found the both sites were again getting top scores of 10 out of 10.
Finally, Google looks to be readying plans to syndicated its AdWords paid listings to other sites or perhaps on the search results it provides to partners and others. The FAQ page about AdWords recently got this new addition:
"AdWords advertisers already know the power of keyword targeted text ads. Google has made arrangements to extend the reach of these ads to a wider range of people conducting searches on Google's partner sites. Google's new syndication program will put your ad on several popular sites, which means more new users see your ad -- increasing your reach to a wider audience of potential customers across the Web. The Google syndication program launches in the next 1-2 months."
Google said it couldn't talk about the plans yet, but as soon as they can, I'll bring you more details.
Google Launches New Salvo in Search Engine Size Wars
SearchDay, Dec. 11, 2001
More details on Google getting bigger, enhancing its Google Groups area and making freshness changes.
Google Headline News
News Search Engines
Freshly-updated, a guide to major news search resources.
Google Production Server Search
Lets you conduct the same search against all of Google's production servers, at the same time.
The Google Toolbar makes it easy to see a page's precise web rank.
Google AdWords: About Syndication
More about Google's plans to syndicate its ads and how to opt-out, if you don't want this to happen automatically, when syndication begins.
Newly updated page that's an entertaining read of Google's Cinderella story.
Introducing SES Online
Want to view one of the sessions you missed or listen to an especially informative presenter a second time? SES New York sessions are available for purchase on ClickZ Academy's new e-Learning site. SES is now Online!