As Google's popularity has swelled, so has the interest in getting listed in the service from webmasters. To help with this interest, Google has been moving forward on a number of fronts. It has posted new information for site owners, opened an automated removal tool and even created an online forum for Google questions.
Last month, the "Google Information for Webmasters" area was unveiled. It provides answers to many questions webmasters have about how their pages are listed with Google, such as:
- Getting Listed
- Not Listed
- Incorrect Listing
- Rank Questions
- Dos and Don'ts
- Facts & Fiction
If you are a regular newsletter reader, much of the information will already be familiar. Nevertheless, it's well worth reviewing the information that Google has published as a refresher and to get policies and information directly from Google itself.
The ability to remove pages, page descriptions or cached copies of pages has also been made easier for webmasters. In particular, an automatic removal tool went up in the summer that lets you remove web pages, images, dead links or newsgroup posts in about 24 hours.
One innovation of the tool is that when using it, the robots.txt file need not been in a web server's root directory. That's helpful to those who have web space within another person's domain. These people may not have the ability to install a robots.txt exclusion in the root directory of the web server. However, exclusions done via subdirectory robots.txt files must be renewed every 90 days. Otherwise, the pages will again appear within Google.
In addition to fast robots.txt file removals, pages marked for exclusion with a meta robots tag can also be removed in about 24 hours, when using the automated removal tool.
In both the cases above, the automated removal tool should only be used if you want to REMOVE your pages quickly from Google. It offers absolutely no mechanism for ADDING your pages to the search engine.
The removal tool can also be used to quickly remove page descriptions, or "snippets," as Google calls them, as well as the cached copies of your web pages that Google makes available to its users. Both of these actions are conducted using options within the "Remove an outdated link" area of the removal tool -- which can also quickly kill off dead links to your site.
Removing snippets, either via the automatic removal tool or through regular crawling, depends on installing the NOSNIPPETS meta tag. You place this on any page you don't want to have described, and it looks like this:
This tag only works with Google. It will not prevent descriptions from appearing in other search engines. In addition, it will not prevent a description appearing at Google for your web page if that web page is also described in the Open Directory.
Huh? You see, Google will display both its own snippet and an Open Directory description, if a page within Google's web page index is also listed in the Open Directory. For example, look at this listing for Microsoft:
Welcome to the Microsoft Corporate Web Site
... See how Microsoft Research is working to deliver
digital butlers and more. ...
Description: Official homepage of Microsoft Corporation
Category: Computers > Companies > ... > Consumer Software > Microsoft Corporation
The portion under the title is the Google snippet, formed by seeking the first text on the page that contains the search term. If you used the nosnippets meta tag, this portion would be removed. However, the line that begins with "Description" is the description of the site from the Open Directory. This description would NOT be removed. And, the category link at the end of the listing takes you to where the site is listed, within Google's version of the Open Directory.
Removing a page's snippet also causes any cached copy of the page to be removed. In turn, this could mean that your page might receive a ranking decrease, so you'll want to be careful about removing your snippets. To understand why, let's look more about cached pages and the specific page caching removal command at Google.
Google makes it possible to see what its spider saw, when it visited a web page. For example, when you do a search, you'll see a link called "Cached" appearing below each page that is listed. Clicking on this link brings up a copy of the page out of Google's web page index, not from the site itself.
For instance, when I searched for CNN on Monday, Oct. 15, the CNN site was listed. I clicked on the cached link and saw a copy of the CNN home page from the last time Google visited it -- which was September 13, based on the date that page showed.
This is one plus to the cached page option -- you can easily measure how fresh (or not) the Google index is. The option is also a great way to see copies of pages that no longer exist or which have changed recently. Finally, Google will also highlight your search terms on the cached pages, making it easy to spot the information you seek.
As you might imagine, not every site owner wants their page cached. Indeed, the legality of Google's page caching is unknown. No one has yet sued a web-wide, crawler-based search engine over the copies of pages they make to form their listings. However, when asked in the past about possible legal concerns, the search engines have generally taken the line that because they are not making entire pages available, what they are doing to make listings isn't copyright infringement.
Google can't make that argument, when it comes to page caching. One can indeed see an copy of a web page, at least the text of the page. Google does not cache images, though if those images are still online, the page will often be reconstructed with them.
To alleviate concerns (and probably possible lawsuits), Google allows site owners to "opt-out" of page caching. They can place a special meta tag on each page they do not want to be cached:
When installed, the page will still be listed, along with a snippet, but users cannot see a cached copy of the page.
I think this is a great compromise, one that allows Google to offer the service of page caching to its users while giving site owners control, if they don't want to be cached. Indeed, I love page caching because I think it's a great way to discover people who may be infringing your copyright via IP cloaking.
With cloaking, it is possible for someone to take a copy of your page, present it to a search engine as their own and prevent you from easily knowing what they have done. Because of this, I've long said that I think every search engine should make it possible to see exactly what they have spidered, so that you can determine if copyright theft has occurred.
There are only two real arguments against having this type of mandatory page caching feature. First, there's the issue that the search engine may violate copyright by providing cached copies, as I've described as possibly being the case with Google. This is easily solved by saying that as part of the terms of being listed in a search engine, you allow a cached copy to be presented. If you don't agree, you don't get listed -- simple as that.
The second argument would be that mandatory page caching would wipe out the "advantage" that cloaking can offer, which is to show search engines paged with code optimized for their crawlers while simultaneously showing human visitors more attractive versions of those pages. However, page caching wouldn't harm this, because if a user clicks on a listing, they'll still see the human-optimized version.
Of course, one "advantage" that would be lost is the ability to prevent people from seeing highly-optimized web pages. Some pages are constructed in such a way that the optimizers don't really want they consider to be their "secret recipe" to search engine success to be seen by others. Page cloaking protects their secrets, and making page caching mandatory would definitely remove this security. However, such heavily-engineered pages are also likely to appear as gibberish to the average human visitor, and such gibberish pages are already generally seen as spam by most search engines.
Mandatory page caching doesn't exist at any search engine, even Google, since you can use the opt-out option there and stay listed. However, Google does view pages that choose to opt-out with great suspicion. Why? Because those opting-out tend to be those who are cloaking -- which Google flat out does not allow -- or those who aren't necessarily cloaking but trying to do other things that Google considers to be manipulative.
"I was pretty struck by how few people use the noarchive tag for the reason it was intended [to protect copyright”," said Matt Cutts, the software engineer at Google who deals often with spam and webmaster issues.
Because of this, site owners should be careful in using the noarchive tag, as doing so will probably subject the page to greater scrutiny and a stronger penalty, if it is found to be spam.
"We can use the fact that they say noarchive to single them out," Cutts said. "And, while we don't penalize pages for using it, if a spam page uses the noarchive tag, then the penalty for that page becomes more severe."
In recent months, Google has also taken a stronger stance against spam. It's a change for a search engine whose founders used to quip back in 1999 that they weren't worried about spam. However, in 2001, attempts to spam the engine and tap into Google's rising popularity have become a problem.
"It's a growing priority. We've seen hundreds making attempts," Cutts said.
Google's hit with all the usual suspects, such as mirror sites, low-quality doorway pages and pages with invisible text. Link farms are also a problem, where sites are creating artificial link structures in hopes of boosting their popularity in Google's link analysis system. And cloaking, which Google considers to be spam, nonetheless can still get through the company's filters.
Going forward, Google is planning to tighten its spam filters even more. It's also taking new steps to adjust scoring. For example, text in the no frames area of frame pages is weighted less than ordinary HTML text. Why? Because no frames text is more or less invisible to users and thus seen as less trustworthy, since it is not constantly being evaluated by human visitors. Similarly, text or links that are hidden in some way from easy viewing are also likely to be downgraded in importance.
Google doesn't see such changes as specifically going after spam, however. Instead, the company views this as part of its overall job in producing the best search results possible.
"I would not cast our efforts as a growing hard-line on spam. I'd say that we're strengthening our algorithms in many different ways to improve quality, and that naturally has effect on spam as well. It's certainly true that people try all sorts of tricks, but Google is still more resistant to spam than other engines, and I expect it to become even better at scoring pages over time," Cutts said.
Part of improving results also means helping and educating site owners.
"I think that the overarching trend lately is becoming more responsive to webmasters. The 'info for webmasters' page is a good start, as is the URL removal tool. The Google newsgroup is another way to reach out to webmasters. We have even had employees join forums in a friendly but unofficial capacity. All of these efforts are in an attempt to help webmasters," Cutts said.
That Google newsgroup is a public forum that opened in early September. It's meant to be for all things Google, not just webmaster issues. Nevertheless, site owners may find help with listing issues there.
Google says it doesn't monitor the group on a regular basis but rather now and then. It primarily relies on Google users to help each other, though it will provide assistance directly, if seen as necessary or useful. Google says it also may break out sub-groups for particular topics in the future, if appropriate.
The "unofficial" capacity is a reference to the appearance of "Google Guy" at the Webmaster World web site. A real Google employee, he's asked for webmaster feedback and offered to bust myths occasionally, to those in the area.
Google Information for Webmasters
Google hitting your server too fast? Want to know the best way to get listed. Can a competitor hurt your rankings? Answer to these and more can be found in this area at Google.
Remove Content from Google's Index
Detailed information on removing pages, newsgroup posts, dead links or snippets from Google. Also has links to the fast, automatic removal tool.
Google Public Support Group
This is the Usenet area where Google questions are discussed.
Webmaster World: Greetings From Google
Lots of questions to Google are being posted here, with some infrequent answers. In particular, don't expect Flash content to get indexed anytime soon because there's not really a lot of text to Flash content that can be indexed.
How Google Works
Information for Search Engine Watch members that covers key details on how Google operates.
How To Block Search Engines
Beginner's guide to the robots.txt file, to block pages from being spidered. Also links to information on the similar meta robots tag.
Search Engines and Legal Issues
You'll find articles here that discuss issues about the legality of robot crawling as well as issues involving pagejacking and cloaking.