SES Chicago - December 7-11, 2009

April 30, 2007

Google Helping State Government Sites Get Indexed

Google has teamed up with four state governments – Arizona, California, Utah and Virginia – to make public information on their Web sites more searchable. The four states have made their public databases more accessible to Google's crawler by using sitemaps to identify the structure of their sites. They have also used Google's Custom Search Engine service to include the Web sites of various state agencies in a site search.

"Connecting citizens with their government by offering the public better access to public sector information and services is consistent with our broader vision – to organize the world's information and make it universally accessible and useful," Eric Schmidt, Google's chairman and CEO, said in a statement. "These partnerships are among many that Google is pursuing with government agencies to better serve the public."

Google has a page on its site dedicated to helping public sector groups to use Google services.

Posted by Kevin Newcomb at 12:09 AM | Permalink

April 12, 2007

Enhancements to Sitemaps Announced At SES New York

In November Google, MSN and Yahoo! announced that they were all going to support a unified protocol whereby webmasters could notify the search engines of the URLs on their site that they wanted crawled. As expected, enhancements have been made and were announced yesterday at SES. First, Ask is now supporting the Sitemaps protocol. Second, support for auto-discovery has been added. All it takes is a simple line of code added to the robot.txt file. Here is an example:

Sitemap: http://www.mysitename.com/sitemap.xml

Since crawlers check the robots.txt file when they initially visit a site, this directive will provide immediate notice of where the crawler should look to find the sitemap. Webmasters can also use an HTTP request to submit their sitemap. For more information on this, readers are urged to check Sitemaps.org for the most current information.

During the SES session, Sitemaps and URL Submission, a show of hands indicated that there are still quite a number of webmasters who are not submitting xml versions of their sitemaps and rely on the alternative text versions. It will be interesting to see if a year from now the adoption rate has changed as webmasters discover how useful this protocol is for submitting their sites.

Posted by Amanda Watlington at 10:25 AM | Permalink

March 2, 2007

Google and the Site: Command Glitch

Is this worth a 4.0 on the SEO Richter scale? Probably not. Just a rumble really, but oh how nothing shakes up the SEM industry and gets SEOs chatting like a nice bug in Google search results. Bring up the topic of the Supplemental Index and duplicate content, the story gets even juicier.

As has just been confirmed by Google's Vanessa Fox, there is in fact, something amiss with the current "site:" command, which is currently being rectified 'as quickly as possible', and this is merely the result of display issue that which shouldn't have any impact on search queries or ranking. (Special thanks to Vanessa, for working with us on sorting out this issue and finding a solution so quickly!)

But let's dig deeper in into why this is such a big deal in the SEO world.

The "site:" command tells you how many of your sites' pages are indexed in Google. In Google's Webmaster Central, the official syntax is "site:domain.com", and many SEO experts look at this as a real number.

So when Google starts to suddenly return disparaging results for that command, it raises a red flag in the industry, and the conspiracy theories fly. For SEOs and webmasters, the questions that immediately come to mind are along these lines:

  • Is this the sign of a stronger "duplicate content" filter?
  • Does it mean I'm really in the Supplemental Index or possibly banned for life?
  • Did I mess up something on my site?

Probably nothing to raise your blood pressure over, but definitely this glitch is an anomaly in Google SERPs.

As is well documented here at SEW and other sites around the Web, typing "site:www.domain.com", "site: www.domain.com", or "site: domain.com" will return drastically different results. Note the differences when using a space after the colon, as well as when using the www vs. non-www version of a domain.

At SEW, we were alerted to this problem yesterday when the effervescent David Naylor posted that something was amiss with the results for SEW. The "site:" command site:searchenginewatch.com shows only 1 page, with "about 268" similar pages whose results are omitted.

Rest assured, at SEW, we do still have a vibrant pulse, and have not experienced any significant drops in traffic due to this problem. So, it's too early to plan a funeral. I am happy to report that traffic is normal at Search Engine Watch. In fact, it has actually been growing fairly steadily since January 1, and that deserves a post of its own.

As it turns out, Dave Naylor was not the first to discover this problem, as Danny Sullivan points out in his SEL post, Webmaster World has had a discussion going on this for almost a month now. Several large, authority sites, with total numbers of indexed pages reaching in the tens or hundreds of thousands were seeing this result as well.

Because of the strange coincidences in the number of results, Danny Sullivan does get credit for dubbing this "About 260" problem. However, that may not be an entirely correct title, because in some datacenters, the result is "about 359" for the same search. Try the searches among different browsers (Firefox/IE) and with personalized search on/off. While some are not dramatically different, they do still fall into the "About 260" category, other searches are up by at least 100 more results.

SEW blogger Eric Enge dug up similar examples of other authoritative sites exhibiting this problem:

Posted by Elisabeth Osmeloski at 2:44 PM | Permalink

November 16, 2006

Search Engines Unite On Unified Sitemaps System

In alphabetical order, Google, Microsoft and Yahoo have agreed to all support a unified system of submitting web pages through feeds to their crawlers. Called Sitemaps, taking its name from the precursor system that Google launched last year, all three search engines will now support the method.

More about Sitemaps is to be provided through the new Sitemaps.org site. As part of the announcement, the existing sitemaps protocol from Google gets a version upgrade to Sitemaps 0.9. However, no actual changes to the system have taken place. The new version number was simply done to reflect the protocol moving from an exclusive Google system to one that all three search engines now support.

Anyone already using Google Sitemaps needn't do anything different. The only change is now those sitemaps will be read by Microsoft and Yahoo, as well. More information will either be posted at the Sitemaps.org site or see these sections from each of the search engines, which I expect to be updated soon:

Other search engines are also invited to use the system -- it has specifically been placed as open property through Creative Commons so that others can make use of it. FYI, Ask isn't part of this announcement because it wasn't invited by the other three to take part, which I find unfortunate. Then again, among all four, Ask is the only one that doesn't already accept submissions in some way.

How can others contribute to its development? That remains to be worked out. So far, there's a working committee involving the three major search engines named. They say they are open to participation from other search engines, as well as content owners, to see the system grow and develop. I expect we'll find more structure to this emerging soon. At the moment, the key work has been in getting all three to agree to support the existing standard.

How about unification around other search standards, such as improving the robots.txt system of blocking pages. Again, this is something the search engines (specifically Google and Yahoo when I spoke to them), say they're interested in. So fingers crossed, we'll see more of this down the line.

Overall, I'm thrilled. It took nearly a decade for the search engines to go from unifying around standards for blocking spidering and making page description to agreeing on the nofollow attribute for links in January 2005. A wait of nearly two years for the next unified move is a long time, but far less than 10 and progress that's very welcomed. I applaud the three search engines for all coming together and look forward to more to come.

(Postscript: Announcements are up now from Yahoo, Microsoft and Google)

Below is more from the press release. Sorry I can't do a longer post about the system, but I'm also busy attending the PubCon conference, where the announcement has happened.

Las Vegas, November 16, 2006 - In the first joint and open initiative to improve the Web crawl process for search engines, Google, Yahoo! and Microsoft today announced support for Sitemaps 0.90 (www.sitemaps.org), a free and easy way for webmasters to notify search engines about their websites and be indexed more comprehensively and efficiently, resulting in better representation in search indices. For users, Sitemaps enables higher quality, fresher search results. An initiative initially driven by Yahoo! and Google, Sitemaps builds upon the pioneering Sitemaps 0.84, released by Google in June of 2005, which is now being adopted by Yahoo! and Microsoft to offer a single protocol to enhance Web crawling efforts.

Together, the sponsoring companies will continue to collaborate on the Sitemaps protocol and publish enhancements on a jointly maintained website www.sitemaps.org, which provides all of the details about the Sitemaps protocol.

How Sitemaps Work

A Sitemap is an XML file that can be made available on a website and acts as a marker for search engines to crawl certain pages. It is an easy way for webmasters to make their sites more search engine friendly. It does this by conveniently allowing webmasters to list all of their URLs along with optional metadata, such as the last time the page changed, to improve how search engines crawl and index their websites.

Sitemaps enhance the current model of Web crawling by allowing webmasters to list all their Web pages to improve comprehensiveness, notify search engines of changes or new pages to help freshness, and identify unchanged pages to prevent unnecessary crawling and save bandwidth. Webmasters can now universally submit their content in a uniform manner. Any webmaster can submit their Sitemap to any search engine which has adopted the protocol.

The Sitemaps protocol used by Google has been widely adopted by many Web properties, including sites from the Wikimedia Foundation and the New York Times Company. Any company that manages dynamic content and a lot of web pages can benefit from Sitemaps. For example, if a company that utilizes a content management system (CMS) to deliver custom web content – (i.e., pricing, availability and promotional offers) - to thousands of URLs places a Sitemap file on its web servers, search engine crawlers will be able discover what pages are present and which have recently changed and to crawl them accordingly. By using Sitemaps, new links can reach search engine users more rapidly by informing search engine “spiders” and helping them to crawl more pages and discover new content faster. This can also drive online traffic and make search engine marketing more effective by delivering better results to users.

For companies looking to improve user experience while keeping costs low, Sitemaps also helps make more efficient use of bandwidth. Sitemaps can help search engines find a company's newest content more efficiently and avoid the need to revisit unchanged pages. Sitemaps can list what is new on a site and quickly guide crawlers to that new content.

“At industry conferences, webmasters have asked for open standards just like this,” said Danny Sullivan, editor-in-chief of Search Engine Watch. “This is a great development for the whole community and addresses a real need of webmasters in a very convenient fashion. I believe it will lead to greater collaboration in the industry for common standards, including those based around robots.txt, a file that gives Web crawlers direction when they visit a website.”

"Announcing industry supported Sitemaps is an important milestone for all of us because it will help webmasters and search engines get the most relevant information to users faster. Sitemaps address the challenges of a growing and dynamic Web by letting webmasters and search engines talk to each other, enabling a better web crawl and better results," said Narayanan Shivakumar, Distinguished Entrepreneur with Google. "Our initial efforts have provided webmasters with useful information about their sites, and the information we've received in turn has improved the quality of Google's search.”

“The launch of Sitemaps is significant because it allows for a single, easy way for websites to provide content and metadata to search engines," said Tim Mayer, senior director of product management, Yahoo Search. "Sitemaps helps webmasters surface content that is typically difficult for crawlers to discover, leading to a more comprehensive search experience for users.”

“The quality of your index is predicated by the quality of your sources and Windows Live Search is happy to be working with Google and Yahoo! on Sitemaps to not only help webmasters, but also help consumers by delivering more relevant search results so they can find what they're looking for faster,” said Ken Moss, General Manager of Windows Live Search at Microsoft.

The protocol will be available at sitemaps.org, and the companies plan to have Yahoo Small Business host the site. Any site owner can create and upload an XML Sitemap and submit the URL of the file to participating search engines.

Posted by Danny Sullivan at 12:00 AM | Permalink

November 13, 2006

Google Updates 3rd Party Sitemaps Generator Tools Directory

The Google Webmaster Central Blog informed us that they have updated their Sitemaps Third Party Programs & Websites page. So if some past tools did not work, give this new list a chance.

Posted by Barry Schwartz at 8:46 AM | Permalink

September 20, 2006

Google Webmaster Central's Vanessa Fox & Amanda Camp Interviewed

Seattle 24x7 has an excellent conversation with Vanessa Fox and Amanda Camp of Google on Google Webmaster Central and working at Google. Both Google women began working at Google in April of 2005 in Seattle. They discuss the conception of Google Webmaster Central (also known as Google Sitemaps). The discussion also goes into the 20% time and recruiting Google women. You can see a picture of the "Seattle's Sisters of Search" also.

Posted by Barry Schwartz at 9:02 AM | Permalink

August 25, 2006

Google's Webmaster Central Adds A Blog

The folks who are responsible for Webmaster Central, the old Google Sitemaps, announced at the Google Blog that they have a blog specifically for Webmasters at http://googlewebmastercentral.blogspot.com/.

This blog will be more focused on the technical side of things, from what it appears. Currently, the blog has topics on feature enhancements to Webmaster Central, googlebot tips, system maintenance updates and more.

Posted by Barry Schwartz at 2:32 PM | Permalink

August 4, 2006

Google Sitemaps Becomes Google Webmaster Central; Preferred Domain Tool Launched

Google Sitemaps has gained a new name along with new features. Google Webmaster Central is the new name of the former Google Sitemaps service, which now has evolved into a central place for Google to provide help information, statistics, reports and tools to help webmasters.

Google Sitemaps launched last year primarily as a way for site owners to submit lists of URLs to be crawled. Since that time, it has steadily gained features that took it beyond being a submission tool. It has offered the ability to view stats on how people are finding your site, verify robots.txt files and much more.

I've actually just come from Google's office in Kirkland, Washington, which is home to the Google Sitemaps team. Here's a rundown on some of the new features offered within the Google Sitemaps / Webmaster Tools component of Google Webmaster Central.

  • Preferred domain: Is your site available with and without a www prefix? Until now, the recommendation was to do a 301 permanent redirect of one to the other. But some people can't easily do this. Now sitemaps has a preferred domain tool that lets you pick which you prefer. Make your choice, and Google will list the domain you choose. Behind the scenes, Google will understand the two domains are one and the same for purposes of things like link calculations. Keep in mind that Google says it will take some time before the changes are visible. Also keep in mind that you'll still need to do 301 redirection for other search engines. Still, I'm thrilled to see Google has move ahead with yet another suggestion I've wanted, as have many others.  
  • Crawl Rate: A small percentage of webmasters will see a new alpha feature under the Tools menu called "crawl rate." This will allow site owners to tell Google to crawl them at a particular speed: Fastest, Faster, Normal, Slower, Slowest. Feel like Google is hammering your server and slowing it down? Choose Slower or Slowest. Got a super server and want to help Google? Tell it that it can crawl you faster or fastest. Then it will crawl you more quickly and move on to other sites. And, if you've got so many pages that Google doesn't seem to be getting them all in the usual time it crawls, using a higher setting gives you a chance of getting more pages in. Note that Google says that many sites are already getting comprehensively crawled at a normal setting. "Ninety nine percent of website owners don't need to change this," said Amanda Teal, a software engineer on the Google Webmaster Central team. "This is for the tiny percentage of site owners who contact us with issues."  
  • Summary Page Changes: The summary page has new icons and colors to try and highlight good things and bad things are going on with your site.  
  • Better Crawl Error Reporting: The crawl error reporting feature now shows a full rundown on all errors that have happened over the past two weeks, and you can filter them by date, making it easy to see what errors have happened since your last check.  
  • Manage Site Verification: This is a new tab that shows you all methods that have been used to verify a site you control. Huh? Let's say you manage a site, but others have access to it as well. You verify the site using your own Google Sitemaps account, but two other people in your company who can also insert meta tags or place files on your server as verify the site in their own names. This new page helps you understand that these other people have also verified. There's also a way to reverify all accounts associated with a site, even if you didn't originally do the verification. This  is handy to wipe out verifications others may have done, if they are no longer associated with your company or site. For example, say one of those other people has left your company. Do you still want them having access to your stats? Probably not. The reverify feature lets you see exactly how they verified your site in their account. You control the site, so wipe out the verification file or meta tag. Then do a reverify. This will their access to stats from your site immediately, since the file or meta tag originally used can't be found.  
  • Improved Sitemap Error Reporting: If one of your sitemaps has a problem, such as a missing XML tag or formatting problem, these errors are now better explained and reported.

There are a number of other changes and tweaks, as well. Google has a post with more here, and there's further info you'll find when exploring Google Webmaster Central.

Posted by Danny Sullivan at 11:07 PM | Permalink

July 21, 2006

Site Diagnostics Tab Added to Google AdSense Console

Google has added a new tab, a tab they have been beta testing for a couple months, named Site Diagnostics. What this tool does is show you which pages the AdSense crawler is having problems getting to. Why would they crawler have a problem getting to those pages? The several possible reasons include a robots.txt file blocking then, password protected pages, server down or slow and other reasons explained in the AdSense help pages.

I have posted screen captures at the Search Engine Roundtable.

Posted by Barry Schwartz at 8:37 AM | Permalink

July 13, 2006

More Details On Google Sitemaps Query Stats

DaveN at ThreadWatch posted his love/hate for Google Sitemaps, but what I find to be the most interesting part is the discussion taking place in his post at his blog. Vanessa Fox, Google Engineering, from the Inside Google Sitemaps blog posted a comment at Dave's blog explaining why a the Sitemaps query stats may say you come up for a popular term even though you don't mention that term or phrase on your pages of your site.

I do not want to miss anything from her comment so let me quote them.

(1) Stats are based on three week averages; "They are averaged over a three week period, so any big fluctuations during that period may make the stats seem off."

(2) "They are top overall queries. For instance, say your site isn?t about Britney Spears, but you?ve mentioned her a few times and so your site ranks for her (although likely doesn?t rank well). Your site is actually about purple apples. So, if a million people search for Britney and 10 people search for purple apples, then Britney is going to show as a top query. And you might look at that and say, my site isn?t even about her. How can that query be higher for my site than what my site is actually about? But in sheer number of searches, Britney is a top query for the site."

And to clarify number two, we have this comment;

My early morning, under-caffeinated guess is that you linked to this threadwatch story (http://www.threadwatch.org/node/7076) in your ?industry news? section and at some point, that may have been on the same page as links pointing to this post: http://www.davidnaylor.co.uk/archives/2006/03/21/naked-truth-about-shoemony/ and possibly some anchor text pointing to your site includes the word ?nude? (the cached page info seems to indicate so). And when searching for christine dolce naked became a popular thing to do, your site may have been an early one to have all the keywords.

This explains a bit more about how Google Sitemaps query stats data works.

Posted by Barry Schwartz at 9:21 AM | Permalink

June 22, 2006

More Stats & Features From Google Sitemaps

The Inside Google Sitemaps Blog announced more features and statistics added to the Google Sitemaps product. The features mainly include additional statistics, but you can also find additional tools. Here is a quick rundown of the new items you can find at Google Sitemaps.

+ Unlimited crawl errors in reports + More query stats, a lot more, including reporting on subfolders + Common words report increased to show 75 words from 20 + Submit up to 500 sitemaps under one Google Account, up from 200 + Adsbot-Google useragent added to robots.txt tool + Added a rate this tool poll.

That is it.

Posted by Barry Schwartz at 8:32 AM | Permalink

April 27, 2006

Run Down Of Recent Google Bugs & Glitches

Over the past two days or so, I have reported over at my other blog, four new bugs and glitches over at Google. This is a high number of real bugs in Google in a short period of time. Here is a run down of them and if they have been fixed or not.

(1) Google Fixes Extended URL Broken Page Issue (fixed, not confirmed by Google) (2) Google AdWords Glitch: Bid Tool Conflicts With Position Preference Tool (not fixed and unconfirmed by Google) (3) Google AdWords Showing Same Two Ads On Search Results Pages at Google.com (not fixed and unconfirmed by Google) (4) Robots.txt Google Sitemaps Bug Fixed (fixed and confirmed by Google)

Posted by Barry Schwartz at 8:29 AM | Permalink

April 26, 2006

Google Sitemaps Adds Spam Checking, New Webmaster Help Center & Other Features

I just came out of the Meet the Crawlers session, where Google announced new features and a new layout for Google Sitemaps. The Sitemaps blog just posted the details as well. One huge feature is that Google tells you if your site is in the index or not and if it is not, they won't tell you why.

Here is a break down of the new features:

+ New verification method + Indexing snapshot + Notification of violations of the webmaster guidelines + Reinclusion request form + Spam report + New webmaster help center + More about our new look + Adding a Sitemap + Navigating the tabs

Full feature list at sitemaps blog.

Postscript: Matt Cutts just pinged me to let me know he has posted an entry named Notifying webmasters of penalties. That entry explains that the Google Web Search Team and Google Sitemap Team working together to notify "some (but not all)" webmasters of Google site penalties.

Posted by Barry Schwartz at 1:59 PM | Permalink

March 1, 2006

Google Sitemaps Adds Top Keyword Positions, Top Mobile Queries and CSV Downloads

Google Sitemaps has announced the ability to see your average position for search queries, top search queries from mobile devices and the ability to now download "details, stats, and errors" to a CSV file that you can then do what you like with it. More details at Google Sitemaps Blog.

Posted by Barry Schwartz at 6:35 PM | Permalink

February 6, 2006

Google Sitemaps Stats On Most Common Words In Your Anchor Text & Site Content

Along with the cool new robots.txt checker, Google Sitemaps has also released stats showing the most common words used on pages within your web site and the most common words anchor text pointing at your site.

The common words in site content stats will be good fodder for those who believe Google somehow tries to figure out a word "theme" for your entire site. Google's never claimed to do this before -- and seeing sites like Amazon or Wikipedia rank for anything when they are about nothing in particular should demonstrate that you don't need to target all your pages around a particular term or theme.

Still, if Google's generating stats like this for a site, it'll probably tip some people back to worry more about this. I wouldn't - but do as you deem best.

The anchor text analysis is far more intriguing. Again, Google has generally said that each page is measured by the links pointing at that particular page. So if someone points at a deep page in your site, that helps that particular deep page, not the site as a whole. And if someone points at your home page, that helps the home page, not the entire site (Yahoo, in contrast, has said it does some sitewide link crediting).

Now Google's reporting anchor text terms for an entire site -- which suggests that any link to any page in your site might have an impact on other pages. Or not!

Questions, questions. I'll drop a word over to Google blogmeister Matt Cutts to see about getting some answers. I'll postscript here, but I'd also say to watch his blog as well.

Finally, while these stats are promised, I don't see them live for all of my sites my sitemaps yet. If you don't as well, there's probably a delay in getting them rolled out and live.

Posted by Danny Sullivan at 8:24 PM | Permalink

Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more.

For Search Engine Watch members, the longer version of this article gives a real life example of how nice the checker is in action.

Overall, I'm thrilled with the new tool. I'd like to see the other search engines add similar ones. Even better, I'd like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:

Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I've updated my list above) and further in comments to the post, explaining why they don't support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.

Posted by Danny Sullivan at 8:08 PM | Permalink

January 11, 2006

Google Sitemaps Now Available in Four More Languages

It's quite the international day at the Googleplex. First, new languages and interfaces for Google Scholar and now a post from Vanessa Fox on Inside Google Sitemaps mentions that support and discussion groups for the service is now in:

+ Danish + Finnish + Norwegian + Swedish

Posted by Gary Price at 6:32 PM | Permalink

December 8, 2005

Google Releases New Version of Sitemap Generator Tool

A new version (v1.4) of the Google Sitemap Generator Tool is now available according to Vanessa Fox from Google Engineering.

She writes: This version has the same features as the last one, but fixes a subtle bug in writing GZip compressed Sitemap files. The old version stored more path information than it needed to when it created GZip files, and this was a point of concern for some webmasters.

Posted by Gary Price at 10:48 PM | Permalink

November 18, 2005

Failure Most Popular Term Sending Traffic From Google To US White House Site

Turns out the Google Sitemaps stats security problem means anyone can access the top terms driving traffic to the US White House web site. The top term isn't going to make President George W. Bush very happy. That's because it's "failure."

Many are familiar that Bush was targeted with a link bomb for the words "miserable failure," and recently just the word "failure" has worked as well. Googlebombing Now A "Prank" And Not Web's Opinion, Says Google from us in September explains more.

Now we see the flip side. Not only is the White House ranking well for that word, but it's also the biggest driver of traffic to the web site. Lots of people are clicking after searching on the term.

Specifically, here are the "Top Search Query Clicks" for the site, as reported by the Google sitemaps system:

1. failure 2. failure 3. white house 4. abraham lincoln 5. george washington

These show the five most popular queries that are sending the site traffic. In other words, of all the ways the White House web site might be searched for and rank well on Google, these are the terms sending the most visitors "downstream" to the White House.

Nope, I have no idea why failure appears twice. But it might be related to something that can at least sooth President Bush's feelings a bit. In the past, a search for [miserable failure] would bring up Bush's bio first, then bring up President Jimmy Carter's bio second. So some of this traffic might be related to Carter clicks.

It's difficult to know, however, because Google doesn't state how far back these stats go. Are they top searches in the past week, month, year? That's not defined in help pages about the tool. At the moment, Carter doesn't show for either [miserable failure] or [failure].

Aside from clicks, what are the most popular queries on Google that the White House site ranks for? Those are:

1. failure 2. w 3. failure 4. house 5. bush

The difference with these stats and the ones above is that while the White House ranks well for these popular queries, they aren't the biggest driver of visits. In other words, Bush may be number one for [W], but people are likely clicking on other results.

By the way, I debated whether to expose the data, not wanting to violate the privacy of another web site. However, in the end, it's public information. If I put in a freedom of information request to the White House for the data, there's absolutely no reason it wouldn't be granted. In addition, other companies such as Hitwise and comScore can determine search queries to any particular site, so it's hardly inaccessible -- just something I'd never seen or thought about before.

Posted by Danny Sullivan at 10:52 AM | Permalink

Major Security Flaw With Google Sitemaps Stats

David Naylor points out, as does this WebmasterWorld thread spotted via Threadwatch, a pretty surprising security oversight with Google's new Sitemaps stats system that can allow anyone access to stats of other web sites, if those web sites don't report 404/File Not Found errors correctly. Right now, I'm looking at stats for eBay and AOL, as well as Google's own Orkut!

In order to see stats for a site, you have to verify you own it by installing a special file on your server. Google randomly generates a filename to use, you install this file, then Google checks to see if it exists. If it does, you can view stats for that site.

The problem is, some web sites will respond that any page exists, even if it doesn't. Rather than sending out a 404 File Not Found error message, they'll dynamically generate the page with content anyway or they'll tell the user the file doesn't exist, but the server code sent to a browser says differently.

For example, try this:

http://www.ebay.com/djkfjkdjfkjd

You'll see that eBay responds that the page doesn't exist. However, behind the scenes it redirects the request (sending a 301 server code) to another page that has a 200 Page Found code. As a result, along with Dave and Barry, I'm now looking at eBay's stats, along with AOL's stats.

How could we all three of us get access? Because both eBay and AOL will turn any request into a page found code -- and remember, we were all given unique file URLs to enter. As far Google is concerned, we all have correctly installed these files.

That's another security issue. You'd think the system was smart enough that if one person verified ownership, no one else could. Not so, not at the moment.

Want to ensure you are protected? Be sure you are sending out proper 404 error codes for pages that don't exist. Rex Swain's HTTP Viewer is an excellent place to check this.

When the stats system came out, I did ask Google why they didn't go with a more common verification system of putting special code on a page. That would have been safer, plus easier for some people who don't have the ability in content management systems to easily generate files of a particular nature. I never got a reply to that.

Another solution would be for special code to have bee installed within a robots.txt file as a way of verifying a site with Google.

Want to discuss or comment? Visit our forum thread, Google Loses Trust with Sitemaps.

Postscript: It should be stressed that top query data isn't particularly private. Anyone with enough money can buy more extensive data through companies like Hitwise or comScore. The seriousness is really in that what was supposed to be a secure verification system failed. Especially consider Google's words on the system:

8. What is being done to protect my privacy?

We use the verification process to keep unauthorized users from seeing detailed statistics about your site. Only you can see these details, and only once we verify you own the site. We don't use the verification file we ask you to create for any purpose other than to make sure you can upload files to the site. You can read more about our commitment to privacy here.

Postscript 2: Google has sent this statement:

This morning we learned of an issue with the Google Sitemaps tool that may have temporarily enabled users to view statistics about sites they do not own. We acted quickly and fixed the issue. To ensure the security of all sites using the Google Sitemaps tool, we will re-verify all sites added in the last 48 hours.

Posted by Danny Sullivan at 9:22 AM | Permalink

November 16, 2005

Google Sitemaps Expands To Give Query & Indexing Stats!

Google's just added some new and long desired tools as part of its Sitemaps system. You can now get query stats and see top keywords driving traffic to your web site (wow!). Crawl stats also show you how often you've been visited, any particular errors and messages why, such as "You banned us with robots.txt, you idiot." OK, it doesn't say that, but it should. You don't have to submit to Sitemaps to play with the new tools -- you just need to have a free Sitemaps account. Go check it out. More details here on the Google Sitemaps blog. I'm off to play and will do a follow-up afterward. Want to discuss or comment? Visit our SEW Forums thread, New Stats On Queries & More With Google Sitemaps.

Postscript: Remember, to see the more detailed stats, you have to verify your site with Google first. Once verified, then you have access to them

Posted by Danny Sullivan at 3:52 PM | Permalink

August 30, 2005

Goin' Mobile with Google Sitemaps

Word just up on the Google Blog that an extension for Google Sitemaps is now available that allows webmasters to submit content for inclusion in Google's mobile web index. Details and examples here. More about Google Sitemaps in this blog post. In June, Google introduced an index of content that has been written/optimized for mobile web browsers.

Posted by Gary Price at 1:54 PM | Permalink

August 25, 2005

Yahoo Bulk Submit Now Live

Gary wrote earlier that part of the new Yahoo Site Explorer (placeholder page for now) service to come is a new bulk submit option for Yahoo. While we still wait for Site Explorer, Barry Schwartz at Search Engine Roundtable reports that the bulk submit part is now live.

It's rudimentary compared to Google Sitemaps, in that you can't prioritize pages, ping Google that you have updates or anything like that. On the other hand, there's a big, big plus in the simplicity. Just make a text file with a list of your URLs, one URL per line. Then submit the location of that file via this page. Have fun!

By the way, Google Sitemaps will also accept a text file in the same manner. So if you've done one for Google, you're set for Yahoo. Doing one for Yahoo? Then you're OK for Google.

FYI, for the "what's old is new" set, this is exactly how Infoseek worked back in 1997, except for the instant inclusion. When you gave Infoseek your list, all the URLs got in. Yahoo's service, like Google's, is merely a way to suggest that pages get crawled and added. There's no guarantee they will.

Posted by Danny Sullivan at 7:43 AM | Permalink

June 14, 2005

Google Sitemaps Gains List Of Supporting Programs & Web Sites

Via Google Blogoscoped, news that Google has added a new Sitemaps Third Party Programs & Websites page that lists third party programs and web sites relating to Google Sitemaps. It's a much welcomed resource, as support for sitemaps has been springing up from various vendors. Expect that the page will grow further. I see some things not on the list and will be sending across contributions that I've spotted. Use the contact link at the bottom of the page if you wish to contribute or be listed, as well. Oddly, the page doesn't appear linked off the main Google Sitemaps page itself. For more info on Google Sitemaps, see the FAQ, our Google Sitemaps Q&A from when it launched and our continuing forum thread that lists a variety of programs, answers and information: Google Sitemaps Now Accepting Web Page Feeds.

Posted by Danny Sullivan at 10:07 AM | Permalink

June 7, 2005

Should Google Sitemaps Take Pings? How About More General Compatibility Generally?

Jeremy Zawodny says he's scratching his head over the new Google Sitemaps system and wondering why Google doesn't adopt a ping-based page updating system instead. And why didn't I ask them, as well, he wonders. Well...

First, forget a ping-based system. The bigger question is why not have some type of system that works with other feed systems already? In particular, why not have a system that works with the XML feeds that Yahoo itself already takes in through its paid inclusion program? After all, if I'm already feeding Yahoo, I don't want to rework things to feed Google as well.

Answer? I did ask. But the answer I got wasn't on the record. Google often provides information but doesn't put it on the record. Yahoo does the same.

I can say the answer came back under the "we're always thinking; we're always open" type of response that Google and Yahoo both typically give.

The more important point is that Google Sitemaps is new, not necessarily set in stone, and certain to develop. I'm just glad to finally have something that lets webmasters -- any webmaster -- feed URLs into the major search engines for consideration. We haven't had this for free, for everyone, since Infoseek back in like 1998 or 1999.

Now that Google's offering it, I'd love to see Yahoo and MSN and Ask Jeeves jump in with suggestions on how it could be made better and work for everyone.

Well, what about a ping-based method to say something's new. Well, pinging doesn't work as well as you think. Just today, I watched a blog feed me 25 "new" posts. They weren't new. Instead, the feed glitched, or was updated or something. But pings went out as if things had changed.

In addition, despite that as each day passes and we're told there are 100 million blogs and counting, not everyone has a system set up to ping much less put out feeds. Yep, millions are. But millions aren't. In fact, it may very well be that the majority of content on the web is not set up to be fed or to ping at all.

Add to the fact that you've got some site owners and marketers who will be more than happy to ping you every day, if they think convincing you that they are fresh will boost a ranking. They aren't fresh, but they'll ping, ping, ping away.

It's also not necessarily the best case to be in a ping-and-retrieve situation. It's a waste of bandwidth, for one. If you trust me, far better I feed you the actual page content for inclusion. If you trust me, of course.

I don't know what the exact solution is. I know Google certainly doesn't have it perfectly right. In fact, we probably need a range of solutions. Look at our forum thread on the new Google system -- Google Sitemaps Now Accepting Web Page Feeds -- and you can get sense of further wants, gripes, needs, suggestions and workarounds.

In particular, note that people want some easy solutions. They don't necessarily want to generate an XML feed using Python. Can't I just send you a list of URLs via Excel? They ought to be able to. For many people, that would be fine.

Jump into that thread and add what you want, from Google or from search engines in general. I'll also be watching and coming back to this. I've been collecting a number of posts and comments on the subject since it came out last week for a follow-up. Sometimes it's also nice to sit back and see the comments and how something actually develops, then go back and discuss further how it might need to change or be shaped. But I thought I'd touch on this important point now.

Postscript: First, I'm an idiot! It was right in our own interview with Google that a simple text file listing all your URLs one by one is fine. Specific FAQ info is here. Also worth noting there are many different alternative formats you can use, including RSS 2.0. FAQ on that here. As for pinging, while you cannot ping that individual pages have been updated, you can ping that your sitemap overall has been updated. FAQ here. As said, I'm working on a follow-up to come.

Posted by Danny Sullivan at 6:05 PM | Permalink

June 2, 2005

New "Google Sitemaps" Web Page Feed Program

Today, Google has unveiled a new Google Sitemaps program allowing webmasters and site owners to feed it pages they'd like to have included in Google's web index. Participation is free. Inclusion isn't guaranteed, but Google's hoping the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited. Below, a Q&A on the new program with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps.

Can you give us a summary of how the new feed program will work?

Webmasters create XML files containing the URLs they want crawled, along with optional hints about the URLs such as things like when the page last changed, and the rate of change. They host the Sitemap on their server and tell us where it is. We provide an open-source tool called Sitemap Generator to assist in this process. Eventually, we are hoping webservers will natively support the protocol so there are no extra steps for webmasters. When a Sitemap changes, we support auto-notifying us so we can pick up the newest version.

Why are you doing this?

We want to index all publicly available information so we can offer better search results. However, currently web crawling is limited. Crawlers don't know all the pages at a website (e.g., dynamic pages), when those pages change, how often to recrawl pages, how much load to put on a website. So they try to guess. We want to work collaboratively with webmasters to get a big picture of all the URLs we should be crawling, and how often they should be recrawled. Ultimately this benefits our users by increasing the coverage and freshness of our index.

What are the technical details? Just a list of URLs? An XML feed?

We defined a simple XML format that includes the URLs plus optional last modification date, change frequency, and relative priority. We do support a simple list of URLs as well, but using the XML format will help us crawl the sites better.

Do you need for me to prove in some way that I'm associated with the site I'm submitting for?

We accept all the URLs under the directory where you post the Sitemap. For example, if you have posted a Sitemap at www.example.com/abc/sitemap.xml, we assume that you have permission to submit information about URLs that begin with www.example.com/abc/.

Will all my URLs get in? Some? Any guarantee? And how quickly?

At this early stage, we cannot guarantee that we'll crawl or index all your URLs. But as we understand the data better, we hope to get more of the data into our crawl and indices.

How does someone sign-up?

Go to Google Sitemaps and use your Google Account or create a new one to sign in. If you already use Gmail, Groups, My Search History, Alerts, or Froogle Shopping List, you already have a Google Account.

And this is all for free?

Absolutely. Also, this is an open protocol. We are hoping all webservers and search engines adopt this protocol and benefit from the increased collaboration

Any chance you may provide a reporting tool down the line, so people can tell what searches are sending them clicks?

We are starting with some basic reporting, showing the last time you've submitted a Sitemap and when we last fetched it. We hope to enhance reporting over time, as we understand what the webmasters will benefit from. If you have ideas on more of what you would like to see, let us know at the new Google-Sitemaps area at Google Groups.

How will you prevent people from using this to spam the index in bulk?

We are always developing new techniques to manage index spam. All those techniques will continue to apply with the Google Sitemaps.

If I don't use the program, you may still find pages through the regular way of crawling, correct?

Yes. This program is a complement to, not a replacement of, the regular crawl. However, we hope that the hints you offer in the Sitemap will help us do a better job than the regular crawl.

Still have more questions or comments? The Sitemaps FAQ goes into depth on many more details. The Google Sitemaps team will be taking questions and responding all day at our Search Engine Watch Forums thread, Google Sitemaps Now Accepting Web Page Feeds. Long-term, the team will also be monitoring the new Google-Sitemaps area at Google Groups.

Posted by Danny Sullivan at 7:52 PM | Permalink

See More Posts From:

This Week | This Month

  var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); var pageTracker = _gat._getTracker("UA-564586-7"); pageTracker._setDomainName(".searchenginewatch.com"); pageTracker._trackPageview(); window.collarity_appid = "incmedia"; //> //>

Senior Digital Planner
U.S. International Media Los Angeles, United States

Senior Search Analyst
U.S. International Media Los Angeles, United States New York, United States

Webmaster - Marketing
West Virginia School of Osteopathic Medicine Lewisburg, United States

Web Marketing Manager
Harvard Business Publishing Watertown, United States


0