SES Chicago - December 7-11, 2009

May 8, 2008

SEW Experts: Giving Links Away

There are a few ways of controlling what pages of your site share their link love. In today's Link Building column, "Giving Links Away," Sage Lewis explains the concepts of PageRank "sculpting" and siloing: two methods that use the "nofollow" attribute to control which links are counted in search engine ranking algorithms.

Posted by Kevin Newcomb at 12:00 AM | Permalink | Comments (0)

April 23, 2007

David Naylor Launches Robots.txt File Builder

I was actually outside the Hilton discussing this project less than two weeks ago with David Naylor and he has already delivered the first part of his idea.

Many people screw up their robots.txt file and deny the search engine spiders access to their sites. Dave thought it would be a great idea to create a central site that acts autonomously where people can have their robots.txt file created and stored to insure good interaction with the spiders.

His initial offering allows people to create the file and then copy and paste it into a page they can upload to their own site. Eventually Dave wants to host the pages himself and make sure the spiders correctly spider them. The site would be the central location for all spiders to get the right written file for any website.

The subtle differences between the spiders can be adapted, but Dave felt it would also be a way to get uniformity from the engines once they saw people using the site in sizable numbers.

The other ancilary benefits he was discussing was the ability to determine load times for a given site and get the spiders to visit at low traffic times so as to not overload the client's site capabilities.

I am impressed how quickly he got started on this. But then again he did share it at SES NYC with other fast to market players.... maybe he correctly guessed he better move on it quick before someone else did.

Great job so far Dave... now don't forget the rest!

Posted by Frank Watson at 2:22 PM | Permalink

November 30, 2006

Google Ordered By Another North Carolina Court To Remove Pages

Apparently, North Carolina is going to start a trend of people who get court orders to remove material Google has spidered when left out in public view. This week, Google was ordered to remove material by a court in that state. It follows a similar court order in a different case earlier this year.

North Carolina County Gets Restraining Order Against Google from the Associated Press covers how social security numbers, cell phone numbers and other personal information was left online by Johnston County, which means Google (and likely other search engines) spidered the material.

When the country realized this, they sought to have it removed. However, they were told it might take up to five days to remove, prompting the county to go the legal route:

Fearing the possibility of identity theft, Johnston County officials asked Google on Monday to remove the information. It was first posted on the county's Web site by accident six weeks ago and discovered Friday. Mountain View, Calif.-based Google responded that removal could take up to about five days, said county attorney Mark Payne.

"It surprised me that Google didn't immediately recognize that this was something that posed a real danger of real damage to our citizens," Payne said.

Hey, it surprised me that Johnston County didn't immediately recognize that the information shouldn't have been put on the public web in the first place. However, that appears to have happened because of a third party contractor.

What about the automatic URL removal system? I seem to recall that as getting pages out in 48 hours or less (but I might be remembering incorrectly). Checking today, officially it is longer (unofficially, I hear it goes faster):

You may process your URL for removal from Google's search results. URLs will be removed after we've verified your request. Bear in mind that verification can take several days or longer and all pages submitted via the automatic URL removal system will be removed from the Google index temporarily for six months.

Google Blamed For Indexing Student Test Scores & Social Security Numbers and Follow-Up: School Couldn't Reach Google Until Injunction Filed cover how a school authority in North Carolina went to the courts to remove pages from Google in June.

Posted by Danny Sullivan at 1:06 PM | Permalink

Microsoft On How To Let MSNBot In, Keep Bad Bots Out

The Live Search Blog described how you can verify if the MSNBot you see crawling your site, is truly the MSNBot from Microsoft or some rogue spider trying to steal your content. Microsoft has added a way to look up the reverse DNS information for the IP of the bot and described what you should see, to ensure that it is the official MSNBot, if it is not, then you may want to block it or report it to Microsoft. A step by step guide is at the Live Search Blog.

What about Googlebot? We covered that here.

Posted by Barry Schwartz at 9:01 AM | Permalink

November 2, 2006

Yahoo Slurp Adds Wildcard Support For Robots.txt

The Yahoo Search Blog announced that Yahoo's web crawler, aka Yahoo Slurp, now supports wildcards in the robots.txt file. The two parameters that Yahoo now supports include the "*" and the "$." The * will tell Yahoo to do a "wildcard match a sequence of characters in your URL." The & will tell Yahoo to do a "anchor the match to the end of the URL string." Many more details at the Yahoo Search Blog.

Posted by Barry Schwartz at 4:04 PM | Permalink

September 26, 2006

Some Google Belgium Follow-Ups

Just a quick note that Google's posted on its official blog about the Google Belgian news issue that I've been covering, while William Slawski has a nice translation in the works on the ruling itself.

About the Google News case in Belgium from the Official Google Blog doesn't really provide much new information that you haven't already gotten in reports from me and others. What should it provide? How about answers to:

  • Exactly how did Google fail to react to the legal action before it went to trial? Information was sent to Google's headquarters in Belgium. If it had been acted upon, Google might have won in the first round of the case by actually presenting a defense, rather than being absent.  
  • Why did Google initially refuse to post the ruling on the Google web sites in Belgium after last Friday's decision, then change its mind?

The post does stress that there are ways for publishers to easily stay out of Google. Those ways don't appear to have been presented to the court itself. Writes William Slawski in Belgian Copyright Ruling Against Google News:

I'm surprised by the lack of mentions of the use of a noarchive meta tag or noindex meta tags or by the use of robots.txt to disallow Google from indexing or archiving the pages of the newpapers in question.

While the Court does note that the onus of keeping copyright from being infringed falls upon the owner of the technology used to take text from the newspapers in question, this seems like an omission worth noting.

Regardless of how the Court may have felt about those options, I think that they should have been addressed in some manner. The failure to do so makes it appear that they either weren't provided information about those by their expert, or didn't understand them, or may not have addressed those issues on purpose.

A simple noarchive tag would have kept information on those pages from being cached by Google. A noindex tag or disallow directive should have kept their pages from being indexed at all by Google. Were they using these and Google ignored them? I suspect that they weren't.

After some more analysis, including an important argument over whether Google is a portal competing with newspapers or a search engine (answer, in my view, probably both depending on whether you keyword search Google News or read by browsing), he provides a long and what seems fairly complete English translation of the French-language ruling.

For more background on the case, see my prior posts:

Posted by Danny Sullivan at 9:03 AM | Permalink

September 25, 2006

Google Changes Mind, Posts Belgian Ruling

Google has now posted the text of a Belgian ruling finding it violated copyright on the Google Belgium home page. The ruling has also been posted to the home pages of Google Images Belgium, Google News Belgium but not Google Groups Belgium.

Last week, a court ruled Google had violated the copyright of several Belgium newspapers by listing them within Google News. The court ordered the removal of those papers from Google, which the company quickly complied with.

The court also ordered Google to post the ruling on its Belgian web site within 10 days or face a heavy fine. Google appealed that punishment, but it was upheld last Friday.

Despite losing its appeal, Google looked ready to defy the order to post the ruling and take the fines, until a second appeal could be heard in November. Now, the company has reversed course. The ruling went up on Saturday. The company gave no reason for the reversal to Reuters:

A spokesperson for Google declined to elaborate on the reasons that made the company change its mind but said it would seek to cancel the ruling.

"We are pleased that a judge has given Google the opportunity to appeal the substance of this case. This will be heard in November," the spokesperson said.

From Dow Jones newswire:

Google spokeswoman Rachel Whetstone told Dow Jones Newswires the company had agreed to publish the ruling on its Web site after studying the court judgment.

Technically, Google never failed to comply with the court ruling. It has 10 days from receipt of the ruling to act, and it has done so within that time, saving it exposure to fines. As noted, a second appeal on the ruling will happen in November.

Past coverage is below:

Also, I note that Microsoft's Windows Live is now operating illegally under Belgian law. For example, site:www.lesoir.be shows how pages from Le Soir -- one of the publications involved in the lawsuit against Google -- has pages listed in Windows Live, as well as cached pages. In fact, here's an example of an article from Le Soir about the Belgian ruling against Google that I can read at Windows Live through its cached copy. To date, no news that Microsoft is about to be sued.

Finally, over at Threadwatch, an interesting comment points out that Google might have been OK in Belgium if it didn't show cached copies of pages:

The truly critical essence of this Belgian court ruling concerns Google's caching functionality. Here, protected content is being displayed a) in modified form; b) more often than not in its entirety (i.e. not restricted to mere snippets); and c) without copyright holders' permission. In most countries this would be viewed as a flagrant violation of copyright law - and obviously this is the stance the Belgian court has adopted. (And yes, there's been a contrary ruling by a US court, but that specific case seems to be rather more complicated on closer view; also, there's some indication that it was decided on arguably faulty assumptions, but that's another story.)

It is interesting to note that the Belgian ruling specifically acknowledges Google's right to store third party content (no mean concession, that, and far from self-evident) for search purposes only. But displaying it in the cache for everyone to see constitutes an act of re-publication which, like it or not, demands copyright holders' express permission.

This is a very important point. Search engines make copies of pages in order to make content searchable, as my Indexing Versus Caching & How Google Print Doesn't Reprint article explains in more detail. It's very difficult to argue this type of copying harms a site owner, especially when opting out is so easy.

Showing these actual copies through cached pages has long been disturbing for many people. While it's easy to opt-out of such display, it feels a step beyond what a content owner should have to do. With cached pages, content is literally being reprinted rather than made searchable. It seems absurd for the content owners to opt-out in that instance.

Within the US, cached copies has so far been upheld, something I disagree with. But if Google were to eliminate them -- along with picture thumbnails -- it sounds like it might have a better chance of winning in Belgium.

Posted by Danny Sullivan at 5:51 AM | Permalink

September 22, 2006

Google Loses Appeal On Posting Belgian Ruling

Google loses appeal on posting court ruling from Reuters covers Google losing an appeal that it should not be required to post the ruling of a Belgian court over a copyright infringement lawsuit on its Belgian web search and news sites. It now will be fined 500,000 euros per day for each day it fails to comply. Google has a further appeal on the entire case, including posting the ruling, that will be heard in November. My past article Google's Belgium Fight: Show Me The Money, Not The Opt-Out, Say Publishers has more about that and the entire case.

Posted by Danny Sullivan at 12:10 PM | Permalink

Publisher Groups To Test New Search Engine Rights Management System (Updated)

Several mostly print publisher groups say they are to test a new "Automated Content Access Protocol" that they feel will head off conflicts with search engines. A release with more information is below.

Exactly how the system will work, why it is different or better than existing systems like robots.txt or meta robots tags, isn't explained. More details are promised to be unveiled at the Frankfurt Book Fair on October 6.

I'm planning to talk with the World Association Of Newspapers to learn more about their plans next week, so I may have more before the formal unveiling. I've had a very informal talk already, and the view seems to be to find a way to make the existing systems work better. That's appreciated, and it's something the search marketing community has long wanted. But it's something I hope will involve more than just a group of publishers with mostly print interests.

My Google's Belgium Fight: Show Me The Money, Not The Opt-Out, Say Publishers article from earlier this week explains how in my view, the entire issue that has erupted in Belgium is less about keeping content out of search engines and more about trying to force them to pay publishers for inclusion. Right now, any publisher that feels copyright is somehow infringed by being in a search engine has a very easy, very selectable way to keep whatever they want out: robots.txt files or meta robots tags. These work on a web-wide basis, have support of all the major search engines, plus have been used by users from publishers of all types. They could definitely be improved -- but in the Belgium case in particular, using them would have solved the exact problem that was raised.

Here's the release:

GLOBAL PUBLISHERS HEAD OFF LEGAL CLASH WITH SEARCH ENGINES: NEW RIGHTS MANAGEMENT PILOT IMMINENT

In the week that the publishers of Le Soir and La Libre Belgique won their case in the Belgian Courts against Google for illegally publishing content on its news service without prior consent, the World Association of Newspapers (W.A.N.), the European Publishers Council (E.P.C.) the International Publishers Association (I.P.A.) and the European Newspapers Association (E.N.P.A), are preparing to launch a global industry pilot project that aims to avoid any future clash between search engines and newspaper, periodical, magazine and book publishers.

The new project, ACAP (Automated Content Access Protocol), is an automated enabling system by which the providers of content published on the World Wide Web can systematically grant permissions information (relating to access and use of their content) in a form that can be readily recognised and interpreted by a search engine “crawler”, so that the search engine operator (and ultimately, any other user) is enabled systematically to comply with such a policy or licence. Effectively, ACAP will be a technical solutions framework that will allow publishers worldwide to express use policies in a language that the search engine's robot “spiders” can be taught to understand.

Gavin O'Reilly, Chairman of the W.A.N., said: “This system is intended to remove completely any rights conflicts between publishers and search engines. Via ACAP, we look forward to fostering mutually beneficial relationships between publishers of original content and the search engine operators, in which the interests of both parties can be properly balanced. Importantly, ACAP is an enabling solution that will ensure that published content will be accessible to all and will encourage publication of increasing amounts of high-value content online. This industry-wide initiative positively answers the growing frustration of publishers, who continue to invest heavily in generating content for online dissemination and use.”

Francisco Pinto Balsemão, Chairman of the E.P.C., said: “ACAP will unambiguously express our preferred rights and terms and conditions. In doing so, it will facilitate greater access to our published content, making it more, not less available, to anyone wishing to use it, whilst avoiding copyright infringement and protecting search engines from future litigation.”

ACAP will be presented in more detail at the forthcoming Frankfurt Book Fair on 6th October and will be launched officially by the end of the year. W.A.N., the E.P.C. and I.P.A. will run the pilot for a period of up to 12 months and it will be managed by Rightscom Ltd.

===

The European Publishers Council is a high level group of Chairmen and CEOs of European media corporations actively involved in multimedia markets spanning newspaper, magazine and online database publishers. Many EPC members also have significant interests in commercial television and radio.

The World Association of Newspapers groups 72 national newspaper associations, individual newspaper executives in 100 nations, 13 news agencies, and nine regional press organizations, representing .more than 18,000 publications in all international discussions on media issues, to defend both press freedom and the professional and business interests of the press. The International Publishers Association is a Non Governmental Organisation with consultative relations with the United Nations. Its constituency is of book and journal publishers world-wide, assembled into 78 publishers associations at national, regional and specialised level. The European Newspaper Publishers' Association – is a non-profit association currently representing 5 100 national, regional and local newspapers. These daily, weekly and Sunday titles are published in 24 European countries where ENPA's members are operating in their national markets.

Postscript: I've just received this briefing paper that explains more. I've skimmed it and attached one note marked in bold. Basically, the existing robots.txt or meta robots systems can do a lot of what's already described here. What they cannot do is help search engines access content because the publisher allows this only through a licensing agreement, something the Belgian publishers seem to want. In addition, the pilot can do all it wants. Unless some major search engines agree to cooperate, the pilot will go nowhere. Again, I'll follow up more on this next week after talking with the groups involved.

ACAP Automated Content Access Protocol A briefing paper for publishers on a project in planning 1       Executive summary

All sectors of publishing face a “search engine dilemma”. The value of search engines to users – and to those who publish on the network – is incontrovertible. However, search engine activities can be very damaging to specific online publishing models. The undifferentiated model of permissions management (essentially either allowing or forbidding search of content) is inadequate to support the diverse present and future internet strategies and business models of online publishers.

At the beginning of 2006, the major publishing trade associations established a Working Party, chaired by Gavin O'Reilly, Chairman of the World Association of Newspapers, to consider the issues that this has raised. As a result, the World Association of Newspapers and the European Publishers Council are planning a project which will develop and pilot a technical framework which will allow publishers to express access and use policies in a language which the search engine's robot “spiders” can be taught to understand. This will make it possible to establish mutually beneficial business relationships between publishers and search engine operators, in which the interests of both parties can be properly balanced.

The project is provisionally called ACAP (for Automated Content Access Protocol). ACAP will develop and pilot a system by which the owners of content published on the World Wide Web can provide permissions information (relating to access and use of their content) in a form in which it can be recognised and where necessary interpreted by a search engine “crawler”, so that the search engine operator (and perhaps, ultimately, any other user) is enabled systematically to comply with such a policy or licence.

This paper is intended to brief publishers on the outline of this project and to encourage their active support and participation when the project is launched in September 2006.

2       Background – the “search engine” problem

At the beginning of 2006, the major Europe-based publishing trade associations – including the World Association of Newspapers (WAN); the European Publishers Council (EPC); the European Newspaper Publishers Association (ENPA); the International Publishers Association (IPA); the European Federation of Magazine Publishers FAEP); the Federation of European Publishers (FEP); the World Editors Forum (WEF); the International Federation of the Periodical Press (FIPP) and Agence France Presse  – established a Working Party to consider the issues that are posed by search engines for publishers, and to look at ways in which mutually beneficial relationships can be established between publishers and search engine operators, in which the interests of both parties can be properly balanced.

All sectors of publishing have a “search engine dilemma” (even if we disregard the particular problems that book publishers have with mass digitisation programmes). Search engines are an unavoidable and valued port of call for anyone seeking an audience on the internet. Search engines sit between internet users and the content they are seeking out and have found brilliantly simple and effective ways to make money from that audience. They have become so dominant that no individual website owner is large enough to have any serious impact on their commercial fortunes.

The benefits of powerful search technology to both users and providers of content are well recognised by publishers – although even “mere” search functionality can have a negative impact on some publishing business models. At the same time, publishers are aware that search engines are, in following their business logic, inevitably and gradually moving into a publisher-like role, initially merely pointing, then caching and, finally, aggregating and “publishing” and perhaps even creating content themselves, while using publishers' content at will.

In the current state of technology, there can be none of the differentiation of terms of access and use which characterises copyright-based relationships in publishing environments, whether electronic or physical. The search engines can and do reasonably argue that, since their systems are completely automated, and they cannot possibly enter into and manage individual and different agreements with every website they encounter, there is no practical alternative to their current modus operandi.

Whether this (technological and political) gap is there by design or by accident, the search engines are able to make their own rules and decide for themselves whose interests are worth considering.

If publishers are to take the initiative in establishing orderly business relationships with the search engine operators, the response must be to help them to address the problem, both to fill the technical gap and ensure its political implementation. To paraphrase the former copyright adviser to the UK Publishers Association Charles Clark's famous claim that “the answer to the machine is in the machine”, the challenges that are created by technology are best resolved by technology. Since search engine operators rely on robotic “spiders” to manage their automated processes, publishers' web sites need to start speaking a language which the operators can teach their robots to understand. What is required is a standardised way of describing the permissions which apply to a website or webpage so that it can be decoded by a dumb machine without the help of an expensive lawyer.

In this way, one of the search engines' most reliable rationalisations of their “our way or no way” approach will have been removed, and a structure which embraces and supports the diverse present and future internet strategies and business models of online publishers will have been created.

As a result of the work of the Working Party, a proposal was made to develop a permissions based framework for online content. This would be a technical specification which would allow the publisher of a website or any piece of content to attach extra data which would specify what use by search engines was allowable for that piece of content or website. The aim will be for this to become a widely implemented standard, ultimately embedded into website and content creation software.

Following the commissioning of a brief feasibility study, WAN and EPC have taken the initiative to establish a project to develop and pilot this framework to express publishers' access and use policies. A detailed plan for this project – provisionally called ACAP (for Automated Content Access Protocol) – is currently in development.

This paper is intended to brief publishers on the outline of this project and to encourage their active support and participation when the project is launched in September 2006.

3       ACAP – the vision

ACAP will develop and pilot a system by which the owners of content published on the World Wide Web can provide permissions information (relating to access and use of their content) in a form in which it can be recognised and where necessary interpreted by a search engine “crawler”, so that the search engine operator (and perhaps, ultimately, any other user) is enabled  systematically to comply with such a policy or licence. Permissions may be in the form of  

• policy statements which require no formal agreement on the part of a user • formal licences agreed between the content owner and the search engine operator. There are two distinct levels of permissions which need to be managed within this framework: • The permission given to the search engine operators for their own operations (access, copy and download, cache, index, make available for display) • The delegation of rights given to the search engine operators to grant permissions of access and use to search engine users (search, access, view, copy, download, etc)

Although these can be managed within the same framework, it is important that the differences between them are recognised.

4       Use Cases

We include two informal Use Cases which are illustrative of the type of challenge that we seek to solve through ACAP.

4.1     USE CASE A: NEWSPAPERS

Newspaper publisher A would like all search engines to index his site, but only search engines X, Y and Z may display articles (because they have paid a royalty) on their news pages, and then only for 30 days. All images must be fully attributed as they are in the newspaper. The newspaper publisher uses articles syndicated by other newspapers and news agencies and cannot grant permission for those items, to the extent of the third party rights. Articles should not be permanently cached.

NOTE FROM DANNY: Using existing systems, publishers privileged enough to be included in news search engines don't have their articles displayed. They have links to those articles displayed, along with a description, something that people do all over the web and is generally accepted as fair use. Specific search engines can be blocked, if that's the desire. Specific images can also be blocked. Publishers can require those reprinting their content to install blocks as well.

4.2     USE CASE B: BOOKS

Book Publisher B invites search engine operators X, Y and Z to index the full text of his latest college text books. The web site where the full text is stored should not be made visible to search engine clients. He wishes that search engine users can browse only 2 pages of a maths book, but 20 pages of a philosophy text book. Search engine users should be able to buy individual chapters for private use, at $5 and $3 per chapter respectively.

5       Business requirements

Although it will be an integral part of the ACAP project to further develop and confirm the business requirements of publishers for the operation of the framework, significant progress has already been made in identifying the high level business requirements against which any technical solution must be measured. In summary, the solution must be:

• enabling not obstructive: facilitating normal business relationships, not interfering with them, while providing content owners with proper control over their content • flexible and extensible: the technical approach should not impose limitations on individual business relationships which might be agreed between content owners and search engine operators; and it should be compatible with different search technologies, so that it does not become rapidly obsolete. • able to manage permissions associated with arbitrary levels of granularity of content: from a single digital object to a complete website, to many websites managed by the same content owner • universally applicable: the technical approach should initially be suitable for implementation by all text-based content industries, and so far as possible should be extensible to (or at the very least interoperable with) solutions adopted in other media • able to manage both generic and specific: able to express default terms which a content owner might choose to apply to any search engine operator and equally able to express the terms of a specific licence between an individual search engine operator and an individual content owners • as fully automated as possible: requiring human intervention only where this essential to make decisions which cannot be made by machines • efficient: inexpensive to implement, by enabling seamless integration with electronic production processes and simple maintenance tools • open standards based: A pro-competitive development open to all, with the lowest possible barriers to entry for both content owners and search engine operators • based on existing technologies and existing infrastructure: wherever suitable solutions exist, we should adopt and (where necessary) extend them – not reinvent the wheel

The approach taken should also be capable of staged implementation – it should be possible for initial applications to be relatively simple, while providing the basis for seamless extension into more sophisticated permissions management.

Although the scope of the project is initially limited to the relationship between publishers and search engine operators, a framework which meets these requirements should be readily extensible to other business relationships (although details of implementation would not be the same in every case).

6       The Pilot Project

The ACAP pilot project is expected to last for around 12 months. In outline, it anticipated that the project will: • confirm and prioritise the business and technical requirements with the widest possible constituency: agreement with all stakeholders is essential if the project is to succeed in the long term • agree which specific Use Cases should be implemented in the pilot phase of the project, starting with a relatively simple approach • develop the elements of the technical solution: it is anticipated that this will primarily involve the development of standards for policy expression, although it will also be necessary to develop the tools for the implementation of those standards • identify a suitable group of organisations willing and able to participate in the pilot project; it is currently anticipated that this could involve four or five publishers and one of the major search engines; participants will need to be in a position to dedicate technical and time resources to the project to enable it to succeed • pilot the standards and the tools, to prove the underlying concepts In parallel with the development of the technical solution, a significant stream of project work will involve the development of a sustainable governance structure to manage and extend the standards (and any related technical services) which will be needed after the project phase of ACAP is complete. To avoid duplication of effort, ACAP will also establish liaisons with relevant standards developments elsewhere. In particular, the project is already in contact with EDItEUR  with respect to its development of ONIX for Licensing Terms; and, in view of the significance of identification issues, with the International DOI Foundation.

7       Next steps

It is anticipated that the project will be launched publicly in September 2006; there is a great deal to be achieved between now and then, and at launch it will be possible to be much more explicit about plans and expectations.  However, it is very important that the publishing community as a whole is ready and willing to respond positively when the project is launched.

The feasibility study commissioned by WAN, EPC and ENPA concluded that this project is technically feasible – and indeed requires little in the way of genuinely new technology. Rather, it requires the integration and implementation of identification and metadata technologies that are already well understood. It is also possible to chart a developmental path which does not demand that every element of the framework must be in place before any of it can be usefully implemented.

However, this is not to suggest that everything will be simple, not that it can be achieved without cost. A significant part of the project cost will have to be borne by those organisations that agree to participate in the pilot, in the development of their own systems; however, there will also be central costs, to which it is hoped that other publishers will be prepared to contribute.

If you have any questions about this project, or would simply like to express your support, please contact: info@the-acap.org

Posted by Danny Sullivan at 10:41 AM | Permalink

Google On How To Let Googlebot In, Keep Bad Bots Out

One of the things that came out of our Bot Obedience Course at SES San Jose last month was a wish that search engines somehow made it possible for site owners to know they were sending "trusted" or "certified" spiders. Now Google's suggested one way this can be done.

Those blocking rogue spiders through IP filtering run the risk that they might accidentally keep some of the "good" bots out. If you don't know all the Google IP addresses, there's a chance you might reject a Google spider accidentally. That might cause your pages to be dropped from Google.

How to verify Googlebot from Matt Cutts at the Official Google Webmaster Central Blog covers a suggested technique to avoid this. Basically, all Google spiders will report they are from the googlebot.com domain. So do a DNS lookup on the IP address. If it comes back as googlebot.com, then you're halfway there. Halfway? Yes, that's because people can lie about domain names. To avoid spoofers, you then have to look up the domain name you found to see if it matches the original IP range.

The blog post explains more, and it's going to make the most sense to tech-savvy webmasters that are implementing some type of IP filtering or blocking already. Not doing that? Then don't worry about this -- it's not really for you.

Down the line, perhaps we'll see less tech-savvy solutions come up, for those sites getting slammed by bad bots but without IP filtering. But this is a great start for now.

Matt's also mentioned this on his personal blog, where people are commenting on the technique.

Posted by Danny Sullivan at 5:25 AM | Permalink

August 31, 2006

When Good Search Bots go Bad

Most people realize the importance of creating a search engine friendly site, but many don't take the final step of assuring that search engine spiders or bots can fully access the site. Even worse, they fail to block bots from non-public parts of the site, or don't recognize rogue bots that are crawling a site to steal content or for other nefarious reasons. In today's SearchDay article, The Taming of the Bots, guest writer Tony Wright has coverage of a recent SES panel where search marketers and representatives from search engines offered tips on managing bots, whether their intent is good or ill.

Posted by Chris Sherman at 11:39 AM | Permalink

Search Engines Handle No Index Inconsistently

Matt Cutts has a nice illustrated survey of how various major search engines deal with the meta noindex tag in Handling noindex meta tags. He finds inconsistency, with this being the summary:

  • Google doesn't show the page in any way
  • Ask doesn't show the page in any way
  • MSN shows a url reference and Cached link, but no snippet. Clicking the cached link doesn't return anything.
  • Yahoo! shows a url reference and Cached link, but no snippet. Clicking on the cached link returns the cached page.

Interestingly, if you use a robots.txt file to ban indexing, in that case Google DOES show the page in some ways. Matt acknowledges this, but it still raises the question why Google operates differently when the intent of both mechanisms (explained here) is the same. I've commented in his blog on the issue as follows:

Why would Google want to treat meta noindex and robots.txt differently. They are both intended to do the same thing — keep pages out of an index. The only reason we have two options is simply because some people can't setup robots.txt files for their sites, which might be within the domains of others. However technically they are implemented, it seems like they should be treated the same way.

My gut tells me most webmasters would prefer that all the search engines not list any pages that use either a robots.txt or meta noindex command.

From a user perspective, I think the technique of showing a link to a site if you can learn about it another way is fine, such as being listed in the Open Directory or from links on the public web to those sites.

The Yahoo implementation of meta noindex is odd — why show a cached page. But I can see a hole here. They might not be actually indexing the page but still caching is since the specific noarchive tag isn't also being used:

Sounds like summit time! Not only would a standard on how meta robots and robots.txt be handy, but it would also be nice to know if blocking a page also inherently blocks caching.

A summit -- or consistent standards, is something the first person commenting on Matt's blog is calling for. If it happens, perhaps it could also be extended to feeds. Ask.com & Bloglines Proposes Blog Search Exclusion Tag from us earlier this month covers a proposed standard from Ask.com. The robots are coming! The robots are coming! over at SEOmoz gives some brief examples of why this might be useful.

Matt's blog already has a good discussion going on this topic, so if you have thoughts and ideas, add more over there.

Posted by Danny Sullivan at 9:39 AM | Permalink

August 2, 2006

Ask.com & Bloglines Proposes Blog Search Exclusion Tag

TechCrunch reports that Ask.com & Bloglines has released a new tag that can be added to your feeds named access:restriction. The tag will tell Ask.com and Bloglines that this content is private and you do not want it included in the Ask.com or Bloglines blog search engines. The goal is for other rss search engines, from blogs, news, pictures, movies and so on, do make this a standard. More details at Bloglines.

Posted by Barry Schwartz at 10:28 AM | Permalink

August 1, 2006

MSN Assigns Names To Vertical Search Crawlers

I covered news at my blog this morning that MSN has assigned names to all their robots or crawlers. When MSN Search first launched, they had one robot named, msnbot. MSNbot did the work of all, from normal web search to image search to news and images. Now, MSN has clarified the roles and assigned names to each robot.

The MSN Shopping bot is msnbot-products, the MSN News bot is msnbot-news, the MSN Image Search bot is msnbot-media and the MSN Search bot is still msnbot. This is important for SEOs, now you can define in your robots.txt file if you want msnbot-media to index your images or not.

Posted by Barry Schwartz at 9:53 AM | Permalink

July 21, 2006

Site Diagnostics Tab Added to Google AdSense Console

Google has added a new tab, a tab they have been beta testing for a couple months, named Site Diagnostics. What this tool does is show you which pages the AdSense crawler is having problems getting to. Why would they crawler have a problem getting to those pages? The several possible reasons include a robots.txt file blocking then, password protected pages, server down or slow and other reasons explained in the AdSense help pages.

I have posted screen captures at the Search Engine Roundtable.

Posted by Barry Schwartz at 8:37 AM | Permalink

July 18, 2006

AFP Content Still In Google News, Probably Via AFP's Own Partners

"Despite suit, Google News still indexing AFP content" from IDG News Service covers Agence France Press content still appearing in Google News after the company said last year that it would no longer carry AFP content, following a copyright infringement lawsuit. The problem seems to be that AFP content is distributed by other publishers, such as the New York Times.

There's no foolproof way for Google to flag these articles as AFP content and thus remove them. Honestly, it's down to AFP itself to teach its distributors to learn out to use the meta robots tag to flag this content as not to be indexed.

Then again, I'm sure that over time, the situation will resolve itself. After all, if AFP is stupid enough not to understand the value of search traffic, smarter publications that do understand this like the New York Times itself will overtake it as people turn to them for content online.

Posted by Danny Sullivan at 5:18 AM | Permalink

June 29, 2006

MSN Search Requires Unique Robots.txt Files In HTTPS Vs. HTTP Cases

A quick note from the MSN Search Blog on secure content that https and http are considered separate "from a robots.txt perspective and requires its own robots.txt file." Nothing more to add here, but thought some of you would like to know this bit of information.

Posted by Barry Schwartz at 9:10 AM | Permalink

June 26, 2006

Follow-Up: School Couldn't Reach Google Until Injunction Filed

Catawba County Schools in North Carolina obtained an injunction to remove private material from Google because it had no luck getting action from the search engine after trying other routes, the district tells me. The school district also stressed that it didn't claim that Google had somehow hacked into its servers. Here's what Catawba County School's chief technology officer Judith Ray emailed me about the situation:

We asserted that Google had somehow bypassed our login information, not that they had hacked their way into the system. Hacking, to me assumes malicious intent and we never intended to imply that Google was doing anything other than spidering all the web sites available.

There is also miscommunication about "all users" being required to log in. The DocuShare server is a repository for both public and private information with logins being required for users who are authorized to view the restricted information. There are hundreds of pages of information that we share from DocuShare with users around the state. These are completely open and are not supposed to [be] password protected.

We did troubleshoot this situation by searching for the students' information at Yahoo, Dogpile, and AltaVista. We did not find any information on these three search engine returns and we attempted the searches over a three-day period.

We acted so aggressively with Google because, until the media got involved, we could not get beyond an operator at Google. We could not get operators to connect us with technical support, the legal department, or to anyone higher up in the organization. We were only given an email address to which we could submit a complain - which we did but got no response. Google has a link to submit an emergency request [see here] but on both Thursday and Friday of last week, the link took you to a dead page. Only when the news media submitted its own inquiry to Google did we get a call regarding the situation. And [Google] has been most helpful in working through this situation with us.

Of course, none of us who are employed with Catawba County Schools at the current time were involved when Xerox set up this server. We are trying to ascertain if the server was incorrectly setup/protected or if the appropriate include meta tags or strings were not included.

Google Blamed For Indexing Student Test Scores & Social Security Numbers from us earlier has more background on the injunction plus how I was finding pages from what the district said was a password protected area to still be available through Yahoo. As clarified above, some of these pages indeed didn't require a login to view.

Our story originally was headlined "Google Blamed For Hacking & Indexing Students Test Scores & Social Security Numbers" and said in one part, "the school [district] blames Google for some how breaking into a password protected area and indexing the content."

As stated above, the school district itself never appears to have said anything about being hacked, only that Google somehow got into information it believed was password protected, as it says on the home page of the district site:

We do not know how Google was able to access the secure, password-protected site. Once Google does access a site, it places a copy of the data on its own server. We immediately called and emailed Google, requesting the urgent removal of the link and site data. We have eliminated the link from our end and it appears that as of Friday night, June 23, 2006, Google eliminated the site from their end.

The hacking reference seems to come from the "Google 'hacked our website'" story at The Inquirer, which we linked to in our original story. While the headline says "hacked" in quotes, the story itself doesn't have anyone from the school district saying this.

Digg also has a School claimed google hacked it's private servers and then posted that data article. Again, the school district isn't alleging hacking, only that Google somehow got into information it believed was restricted. How that happened is still being investigated.

As for the reference to Xerox in the school district's explanation, in doing some investigating in our original piece, I noted that the server seemed to be managed by Xerox and shared by other companies as well, with material for those companies appearing to be hosted on the school district's domain. As noted, the school district doesn't know why this was happening, and it remains something they are looking at.

Finally, Google's had problems with the automated page removal tool before, though not that it was down but instead allowing people to remove pages from sites they didn't own. More on that in our 2004 story, Google Confirms Automated Page Removal Bug.

Posted by Danny Sullivan at 1:35 PM | Permalink

June 15, 2006

SEO for All the News That's Fit to Search

The New York Times has one of the most popular news web sites, but until this year that was largely because of the strength of its brand. After its acquisition of About.com, the Times embarked on an aggressive campaign to make its web site more search friendly, a complex process that's paid off with notable traffic gains for the company. Today's SearchDay article, Getting The New York Times More Search Engine Friendly, takes a look behind the scenes at how the Times and its vice president of enterprise search, Marshall Simmonds, pulled it off.

Posted by Chris Sherman at 5:48 AM | Permalink

June 14, 2006

Google Not Obeying NoIndex Meta Tag?

I reported at the Search Engine Roundtable that Google.com Displaying Pages in Index with NoIndex Meta Tags. The details come from a WebmasterWorld thread where two members I would trust claim Google is not obeying the noindex meta tag. Currently, I have no evidence, since examples are not allowed at WebmasterWorld. If you have examples of this in action, please let us know by starting a thread in our Google Web Search Forum at Search Engine Watch Forums.

Posted by Barry Schwartz at 9:59 AM | Permalink

May 26, 2006

Google AdsBot Now Coming To Assess Your Landing Pages, Will Impact Your AdRank

Google's rolling out a new system where ad landing pages will be automatically spidered by a new AdsBot. The content of landing pages will help determine the quality of an ad campaign. That quality score, along with the amount you are willing to pay, is then used to determine an ad's AdRank, the position where an ad will appear in the results. A high quality score means you can rank higher even if you pay less than others. And not participating in the new spidering system can hurt your AdRank.

What's the deal? Didn't Google already spider landing pages as part of the announcement back in December that landing page content would be assessed? To my understanding from Google, only if the AdSense spider had seen the page for ad content placement purposes or if regular Googlebot had already indexed the page for inclusion in the web search index. If the page wasn't already visible to these or perhaps some other Google spiders, or had been specifically blocked from spidering, then AdWords couldn't assess it.

Sometime in the coming weeks, a new AdsBot crawler will be grabbing all landing pages independently of AdSense, Googlebot or other Google spiders. Can you still block being spidered? Yes. But if you do so, Google AdWords will consider you a "non-participating advertiser" in the review process. As a result, you'll take a ding on your overall AdWords quality score.

From new information about the change:

While you can exclude your site from review, this will provide us with little information about your landing page's quality and relevance. Therefore, if you restrict AdWords from visiting your landing pages, you will experience a drop in Quality Scores for your related keywords. (This will cause higher minimum bid requirements for any landing page for which you've restricted access.)

That page also explains how to block AdsBot from getting your pages, how the visits won't cost you money even though AdsBot is following your ad links and how blocking or allowing AdsBot to your site will have no impact on what Googlebot thinks about it in terms of ranking it for free, organic results.

For Search Engine Watch members, the longer version of this article covers more on the change from my talk with Google during a visit there last week, such as how it is designed to improve relevancy and ease concerns that users (rather than advertisers) might be harmed by search arbitrage.

Want to comment or discuss? Visit our Search Engine Watch Forum thread, AdWords To Begin Crawling Landing Pages & Analyzing For AdRank.

Posted by Danny Sullivan at 7:21 AM | Permalink

April 4, 2006

Stupid Newspaper Publishers, Search Engines Are Your Friends

I left newspaper reporting about ten years ago because it was clear the industry had no idea how to transition to an online world, and I didn't want to be stuck behind. Today's Chicago Tribune article, Papers, Web sites in scrape on stories, just tells me things don't appear to have improved much. Search engines, including Google, get a fresh dose of being leeches for using content. Except publishers, they don't reprint your content. They reprint summaries and link to your articles. And if you'd get a clue, you'd understand that brings you traffic, which should make you money.

Don't like it? Then slap up a robots.txt file to ban the news search engines and leave the traffic for the rest of us. The story's not all bad news. Some publishers are waking up to search and figuring out how to deal with it. For another example of the search engines as menace to newspapers concept, see World Association Of Newspapers Dislikes Search Engine Exploitation, Clueless About Robots.txt Banning from February.

Posted by Danny Sullivan at 7:56 AM | Permalink

March 10, 2006

Didn't Get That Wedding Gift? Blame The Search Engines!

Search engines are appearing to be indexing gift registries and ranking them well enough to be found by unsuspecting online consumers. This causes the registry to show that people have ordered an item or two that the bride and groom have never received. I wrote up the long-winded story here but let me explain in short.

  1. I placed a link to my wedding registries from my wedding site.
  2. Search engine found the link and indexed my gift registry pages.
  3. Unsuspecting consumer searches for product ABC in a search engine.
  4. Searcher is presented with several options including my registry.
  5. They click over to my registry and find the product.
  6. They order the product and ship it to their home.

The unsuspecting consumer has no idea that they ordered it from my registry; they just know they ordered it from XYZ Store.com. However, the registry now shows that the items that the unsuspecting customers purchased have been fulfilled and the "wedding party" no longer needs them. But we do need them! Andy Beal says that these are not "fraudulent" orders, as I called them here but rather "unfortunate" orders. GrayWolf adds that he is aware of several wedding registry scams that we should be aware of.

Posted by Barry Schwartz at 8:27 AM | Permalink

February 8, 2006

IAB Releases Updated Spider List

Spider/Bot Chase Goes Trans-Continental at ClickZ and IAB Updates Bot and Spider List at MediaPost both cover how the Interactive Advertising Bureau has released a new list of spiders, part of its long-standing effort to help advertisers filter out non-human visits. The list is to be maintained going forward in conjunction with ABC Electronic in the UK. It covers spiders known to visit sites in the US and UK. More in this IAB press release. Want the list? You have to be an IAB subscriber. More info on the list is here.

Posted by Danny Sullivan at 11:05 AM | Permalink

February 6, 2006

Google Launches Robots.txt File Checker; Now We Need Robots.txt Standardization

Very nice. Wondering how a search engine will process your robots.txt file? Google now provides a way to check on that through the Google Sitemaps program. More stats and analysis of robots.txt files from the official Inside Google Sitemaps blog explains more.

For Search Engine Watch members, the longer version of this article gives a real life example of how nice the checker is in action.

Overall, I'm thrilled with the new tool. I'd like to see the other search engines add similar ones. Even better, I'd like to see them all come together on creating an enhanced and more standardized robots.txt standard. Consider:

Postscript: Matt Cutts from Google has some good comments over here, pointing out Google also has an allow command (I've updated my list above) and further in comments to the post, explaining why they don't support crawl-delay yet because of concerns it might be set too low by mistake by some webmasters.

Posted by Danny Sullivan at 8:08 PM | Permalink

February 2, 2006

Oops, Specs for Dell Computers Found in Google Cache

The News.com story: How to evade Google search, reports that once again a company, in this case Dell, has learned the hard way that what's put on a public web server is open to crawling, caching, and discovery.

Specifications for future Dell notebooks were accessible via Google's search site before the content was pulled from a Dell file transfer protocol site and from Google's cache.

It's very likely, almost a given, that most of you know about keeping content from being crawled and/or cached using robots.txt or one of many other methods. If you don't or need a quick review, one of my favorite info compilations about robots.txt comes via SearchTools.com.

It's very possible tha this article will reach many people who have little to no idead about how crawlers operate and how to keep content out of Google.

The article would have been more useful if it stressed that this is a webmaster and web-wide issue and not a Google issue. Every webmaster who places content on publicly accessible servers should have a basic understanding of how web crawlers work and that many large engines (and even some verticals) cache content.

Google is the most widely used web engine but the webmaster who only focuses their attention on Google might not realize that the searcher who knows about cached content, and then goes looking for it, will know about many other web caches.

In other words, keeping content only out of Google doesn't mean it's not accessible elsewhere and off the web. SEO's know this to be true but I often wonder about others.

Postscript: I noticed that this News.com article about what the Dell notebook specs contained does point out (at the very end) that the material was also cached by Yahoo.

Posted by Gary Price at 1:16 AM | Permalink

February 1, 2006

World Association Of Newspapers Dislikes Search Engine Exploitation, Clueless About Robots.txt Banning

Newspapers want search engines to pay over at News.com covers the World Association Of Newspapers planning to challenge the "exploitation of content" by search engines. Apparently search engines are taking newspaper content for free and repacking it up within things like Google News and Yahoo News. A task force to study the isssue is being formed, DMNews reports in Newspaper Group Questions Aggregation of News Content. Reuters also has coverage here.

Hey WAN. Don't like being in search engines? Tell your members to put up a robots.txt file to block the search engines, and they'll be happy to drop them. When they do, then blogs and other news sources can have the traffic the search engines were previously sending to your members.

FYI, I'm trying to finishing a rundown on what the New York Times has been doing recently to gain search engine traffic. Watch for that soon. In the meantime, see this past post about what Marshall Simmonds did for About.com and is now doing for the NYT.

Posted by Danny Sullivan at 9:46 AM | Permalink

January 26, 2006

Google & Search Engine Cached Pages Legal, If They Offer Opt-Out

Via Boing Boing and News.com, interesting news that a case saying the Google cache violates copyright has been ruled in Google's favor. Since Google makes it possible to prevent it from showing a cached page, the court ruled the publisher should have used that. In short, it you don't block caching, Google and other search engines have an "implied license" to reproduce your material. More in the court documents on the EFF site here (PDF format), and an EFF write up is here.

Postscript: Caching Made Legal - Do You Agree? I Don't! at the Search Engine Watch Forums has more analysis of this by me and some of the major concerns it raises. Read more or comment yourself over there. There's also excellent discussion at WebmasterWorld here.

Posted by Danny Sullivan at 7:57 AM | Permalink

January 13, 2006

Show Me the Content: Web Search, Verticals, and Metasearch

Putting the Screws to Google, by Jon Fine from BusinessWeek offers a look at how, "old media could take back its share of search's ad bounty." So, in a sense it's not only putting it to Google but to Yahoo, Ask and other general purpose web engines. Of course, the word Google in a headline gets people to look.

It's an interesting read. How would these "old media" players do it? Fine offers an example of Walt Disney, News Corp., NBC Universal, and The New York Times, joining together to form a "Content Consortium" that offers a search engine containing content that, "no outside search engines can access."

Of course, Google is well aware of proprietary content issues that Fine raises. If you look at the "Risks Related to Our Business and Industry" section of many of Google's SEC filings (including their IPO filing) you'll read:

Proprietary document formats may limit the effectiveness of our search technology by preventing our technology from accessing the content of documents in such formats which could limit the effectiveness of our products and services. A large amount of information on the Internet is provided in proprietary document formats such as Microsoft Word. The providers of the software application used to create these documents could engineer the document format to prevent or interfere with our ability to access the document contents with our search technology. This would mean that the document contents would not be included in our search results even if the contents were directly relevant to a search. These types of activities could assist our competitors or diminish the value of our search results. The software providers may also seek to require us to pay them royalties in exchange for giving us the ability to search documents in their format. If the software provider also competes with us in the search business, they may give their search technology a preferential ability to search documents in their proprietary format. Any of these results could harm our brand and our operating results.

From the BusinessWeek article: "For the life of me, I can't imagine why they haven't done it," says Tom Curley, CEO of Associated Press. Here's one reason: Doing it would require spinal implants for intimidated media barons. But the notion that some pushback is pending is not far-fetched. Curley says he is talking with potential partners about setting up subject-specific Web packages -- say, for travel or basketball -- that will include content from multiple media. Once partners are on board and packages are finalized, search engines will be invited to bid for that traffic.

So the AP might be getting into the vertical search business, interesting.

For a long time I've said verticals will continue to grow in popularity and importance as meta search tools which are getting better all of the time will allow various database and content publishers to offer material (free or fee) to end users who will select these databases at the time of their search based on their information need. Of course, database selection tools to assist users in making these decisions that incorporate personalization, social networks, etc. will also be available.

The metasearch tool could be sponsored and/or have contextually based advertising included as a part of it.

Fee-based content could be made available for free if, for example, the user would view a certain number of ads over a given period of time. Marketers could also sponsor access to databases with fee-based content. For example, Kayak or Expedia might sponsor access to a database containing digitized travel books and videos.

Smaller but focused databases, can potentially offer more precise results (higher precision, lower recall). Don't forget that for many web searchers, the Invisible or Deep Web is everything beyond the first six or seven results. Advanced searchers might also benefit with a unified interface versus numerous interfaces and syntaxes. Training sure would be easier.

In many respects, what I'm talking (in concept not content) has been around for years with services like Dialog and LexisNexis. For example, Dialog offers access to over 1000 databases with many coming from various database producers. I often describe it as a supermarket of databases with a common syntax. Users select various databases depending on their information need.

Another example. I've written numerous times about the many full-text databases (available for free, without going to the library, for personal use). Well, the San Francisco Public Library offers searchable access to many of these databases using a single interface. They call it a cross-database search. Instead of having to go to 20 databases and then search each one, you can pick and choose databases depending on what you're looking for. Articles? Reference answers? Images? Directory info? Business? Local?

The SF Public Library is hardly the only organization offering this type of service. The topic of cross-database (aka federated or metasearching) is a hot topic these days. In fact, NISO, the National Information Standards Organization, has a large initiative in developing metaseach standards.

Postscript: Cold North Wind is another company involved in large newspaper digitization projects. Their PaperofRecord.com site is their public database where you can actually see what they have digitized to this point.

Posted by Gary Price at 2:03 PM | Permalink

January 9, 2006

Search Engines As Leeches, The Difference Between Paid & Free Listings & Keyword Price Rises

Jakob Nielsen's just posted a Search Engines as Leeches on the Web article that makes a good point, don't be too search engine dependent. However, he muddles his point by confusing the issue of paid search advertising and free "organic" listings. A closer look at that, plus how "super conversion trackers" and "brand idiots" are likely to keep pressure on keyword prices.

As a reminder, the major search engines give you two main types of listings when you do a search. There are the "organic" or "free" or "natural" listings that they gather from crawling the web. They don't charge for these listings (though Yahoo's paid inclusion program kind of clouds the water over there). These listings are like the editorial content you get at newspapers.

Search engines also carry paid ads. Pay, and you can get listed for terms you want without hoping that it just happens naturally.

Jakob says:

I worry that search engines are sucking out too much of the Web's value, acting as leeches on companies that create the very source materials the search engines index.

That will resonate with many who have long voiced similar concerns that search engines are making tons of money by gathering "content" from sites from across the web to make their listings.

If suddenly every site on the web were to block Google from indexing them, Google would have a crisis in short order. Its main "content" would have gone away, and the ads alone aren't going to keep attracting searchers.

Web site owners have not done that, however. That's because by and large, they've found that search engines drive more traffic to them than they cost in terms of bandwidth of being indexed.

WebmasterWorld has become a classic case study of this. Google and other search engines were banned in November along with "rogue" spiders, because somewhat similar to Jakob's "leech" metaphor, they were seen to have been sucking down more bandwidth than it was worth supporting.

WebmasterWorld founder Brett Tabke was often quoted saying he had the best sleep in months after blocking the spiders. His sleep may have improved, but what to do about the major spiders didn't go away. By the end of December, Brett had done a 180 degree turn and let the major spiders back in.

Until now, WebmasterWorld's been about the only major site I can think of that has tried to block spiders. Craigslist was rumored to have done so, but that wasn't true.

I do believe concerns over spidering are growing, especially as we have more spidering from both the major search engines and from rogues that are out there. Back in October, The lie of distribution--search engines return very little value to news/blog sites yet hog bandwidth and increase server loads from Tom Foremski was an example of this.

As I commented on his blog, it's fair to say that despite grumbles, that the vast majority of site owners do not consider search engines leeches. If they did, they would deleech themselves by blocking spiders. It's not hard to do. A simple change to the robots.txt file will block all the major search spiders. But no on does this, because they want the traffic. Even Jakob's own file isn't blocking Google and gang.

But back to Jakob's point, it turns out he really isn't talking about the "source material" being leeched but instead about the high cost of advertising. Again to his opening statement, with the key part in bold.

I worry that search engines are sucking out too much of the Web's value, acting as leeches on companies that create the very source materials the search engines index.

And the evidence of this?

Paid search confiscates too much of a website's value.

What? Paid search "confiscates" a site's value? Since when did search engines suddenly show up at a web site and demand the owner sign-up for advertising? We've long had rumors that a site that doesn't advertise might find themselves banned with various major search engines. We've even had reports of "monetization targeting" where site owners have found that doing an ad or paid inclusion buy might clear up a spam banning problem. But by and large, there are plenty of web sites that spend nil with search engines on advertising and get plenty of traffic.

In fact, the exhausting, annoying, tiring, boring, you name it regular updates to Google generate plenty of forum fodder that show people aren't spending and getting traffic from search engines. If they weren't, they wouldn't be freaking out any time Google undergoes a major algorithm change that sends rankings dancing and for some, traffic plunging. They wouldn't worry, because they'd have had both a balance of paid and natural search listings that helped them ride out the rough times if there was an issue on the natural side.

Instead, the October 2005 "Jagger" update showed plenty of site owners are still dependent on getting traffic from search engines for free. The Nov. 2003 Google Florida update should have taught many not to be free listing dependent, but clearly they remain this way. And the lessons not to be dependent were in place even before this.

So overall, the issue doesn't involve free listings. Jakob's really concerned about the rising cost in search advertising. Over time, as he's worked with client sites, they've been able to pay more by pushing up conversion rates. But at some point, others catch up and the margins of what his clients can pay is reached. He says:

If your search bid stays the same, your ad will sink off the page as more and more competing sites improve their design enough to afford higher bids. Our site therefore has no choice but to increase its own bid to $7.99 per click if it wants to stay in business.

This simply isn't true unless Jakob's clients are making the mistake of depending solely upon paid search ads to gain customers. If that's the case, yep, you should be looking to diversify. More on this in a moment.

Jakob also says (and the bolding is his):

This is great news for search engines: they can double their income by doing nothing. Just sit and wait for all other websites to improve -- then skim off the increased earnings.

In other words, the search engines get to make more by doing nothing because the advertisers are learning they can afford to pay more. And they sound pretty evil. But rewrite it this way:

This is great news for the Super Bowl: the cost of buying a commercial keeps going up even though they do nothing. Just wait for advertisers to be willing to pay more -- then skim off the increased earnings.

Honestly, perspective like the above is very much in order. Consider:

  • Search advertising has long been undervalued. People still pay less than a dollar for some paid search ads and they obtain clients that will be with them for life. But few advertisers are calculating the lifetime value of search. Jakob doesn't appear to be consider this. The examples he gives are to the purchases made directly from a click on an ad. Do his B2B customers come back to the site directly the next time and buy? If so, the only reason they did so was because they found the site through search in the first place. If you don't factor that lifetime value, then you fail to understand how much you really can afford to pay.  
  • Advertisers are getting smarter. Those who better measure conversion know they can pay more and they are. The search engines are to blame for this? They're simply seeing that their undervalued advertising medium is finally growing to what it really is worth. The real person to be upset with is that other advertiser. And the solution is to get smarter.  
  • It's an open marketplace. Is Apple is a leech on consumers for overcharging for MP3 players that you can buy from others that work as well if not better but lack the Apple logo? The search engines aren't forcing people to buy ads. If advertisers can't afford to continue paying high prices, they won't. And when they don't, the prices will fall. The exceptions are if the search engine conspire in some way to force purchases (say they really do link buying ads to getting other types of listings) or when they set artificially high minimum bids (Google tried doing this but had to drop many because people weren't willing to pay).

So rather than "despite search engines, websites can make money," as Jakob says, the reality remains that because of search engines, plenty of web sites are making money without spending a dime, pence, euro, yen or whatever on search advertising.

I completely agree that anyone running a web site should heed Jakob's "search engine liberation" advice of alternative ways to promote a web site, such as considering RSS, email newsletters and so on. But this isn't suddenly new advice. Any long-time internet marketer would tell you not to depend just on search engines. Thinking "beyond search engines" has been the core of my basic tips since I put them up back in 1997:

Search engines are a primary way people look for web sites, but they are not the only way. People also find sites through word-of-mouth, traditional advertising, the traditional media, newsgroup postings, web directories and links from other sites. Many times, these alternative forms are far more effective draws than are search engines. The audience you want may be visiting a site that you can partner with or reading a magazine that you've never informed of your site. Do the simple things to best make your site relevant to search engines, then concentrate on the other areas.

It's all a matter of balance. Don't obsess over search engine listings, but don't ignore them, either. Do a variety of online marketing activities -- and do a variety of offline ones, as well. Search -- both paid and free -- is a component of any campaign. But it isn't something you should depend on, any more than you should depend on all television advertising, all print ads, all RSS ads or a strategy of no advertising at all. If you are not diversified, you'll have a weakness that might hit when you least expect it.

To conclude, no one should put all their eggs into any basket, search or otherwise. It's absolutely true that search engines are not the end all be all and that sites can thrive and survive without them. But many sites also can thrive and survive better by incorporating them into a diversified publicity campaign.

Search engines definitely can do more to help those with support on the organic side of things, which is especially needed since webmasters do indeed provide the content that the search engines depend on. The good news is that last year, we saw more changes and developments to give webmasters new tools than ever before.

Finally, ad prices will likely continue to rise, and different advertisers will react in various ways. John Battelle recently pointed at a blog suggesting that FTD might be nearing all it can afford to spend. But just today, we reported on a survey showing four out of five advertisers saying they can still afford to pay more, though the question of whether a plateau is being reached is raised. Then the latest Fathom report on keyword prices saw a continuing "downward spiral," as MediaPost put it. I haven't looked closely at the latest numbers, but the sample is so small (500 terms) that I'm generally wary on depending on it as a foolproof predictor.

From my part, I see two main issues with keyword prices going forward: Super Conversion Trackers & Brand Idiots.

Super Conversion Trackers are those who will indeed track a lifetime value of someone who comes to them from search. They'll understand that the initial purchase may lead to more and more purchases over time and feel comfortable paying multiples above competitors to gain a lead. That will push some out of the bidding. See Most Conversions Happen Offline; You Need To Measure These! for some further thoughts on this.

Brand Idiots are what some marketers think derisively of others who jump into bidding without linking it to a direct ROI target. They can screw up bidding on what seems to be "logical" or "fair" amount that most in the marketplace may assume. But brand idiots are part of that marketplace, and you can expect to see more of them.

Automakers Buy Up 80% Of Ad Space On Car Sites For 2006 from AdAge is a good example of this. It explains how automakers are going more and more online to extend their brands. Edmunds, a car research site, expects to take those brand dollars and buy more search as well as display ads to fuel that desire. That's big brand money that's going to be fueling those buys and putting pressure on others trying to compete.

Big Guys Crowd Out Little Guys in SEM Arena; Some Branding Focused Advertisers Willing to Spend "Whatever" It Takes and C'mon In Brand Owners, The Search Water's Fine has more on these type of moves.

Stuck in a bidding war? How To Get Out Of Bid Wars A Winner? over at our Search Engine Watch Forums may have some helpful advice if you're already tracking and improving conversions as much as possible and getting some brand idiot money is not an option.

Want to comment or discuss? Visit our SEW Forums thread, Search Engine Leeches, Dependency & Losing Perspective.

Posted by Danny Sullivan at 1:27 PM | Permalink

January 4, 2006

Craigslist Not Blocking Major Crawlers

Craigslist Delists Millions of Pages from Search Engine Indexes over at the Search Engine Roundtable Forums gives the impression that Craigslist has embarked upon a new policy of blocking search engine spiders, but talking with Craigslist along with some further poking at the situation shows that's not the case. A summary of the situation below, and if you're a Search Engine Watch member, be sure to read the more detailed longer version of this post.

Avi Wilensky, who posted at the forums, assumed some new change must be in place when he couldn't find a real estate listing from Craigslist via a Google search that brought it up that listing only a few days before. Checking the Craigslist robots.txt file, he noticed that sections with listings about community, housing, for sale, services, gigs and jobs items seemed to be blocked.

At a quick glance, I could see why someone might assume that entire swaths of listings were being blocked. However, the listings themselves are not contained within these sections.

For example, here's the home page of the "blocked" housing area at Craigslist. The URL takes this form:

http://www.craigslist.org/hhh/

See the part in bold? Anything that begins with /hhh after the domain name is restricted by the Craigslist robots.txt file and not open to crawling by Google, Yahoo or others. So clearly all housing listings wouldn't be accessible! Wrong. That's because the listings within the housing section actually don't begin with the path of /hhh.

For example, here are the URLs for the first three listings shown on that housing area home page:

None of them begin with /hhh, as I've shown in bold, so all of them are fully open to being spidered.

Why block those specific table of content pages plus any pages below those particular sections? Craiglist chief executive Jim Buckmaster told me via email:

The URLs in question are sectional header links, which from a crawler standpoint represent a duplicate pathway to our listings, one which I understand from our tech team is disproportionately load-intensive when hit by crawlers.

Am I off the mark and have millions of pages with Craigslist listings now gone? Not from a few checks. At Google, site:craigslist.org shows nearly 12 million pages are indexed from various Craiglist sites, such as sandiego.craigslist.com and charlotte.craigslist.com.

Here are 631 listings for rooms in the North Bay area of San Francisco, for example. Aside from anyone being able to check on this, Buckmaster himself wasn't aware of any reason that content should have gone missing from the major crawlers.

As noted earlier, if you're a Search Engine Watch member, there's a longer version of this post available to you. It goes into more depth of explaining what's in the current robots.txt file, how it has changed plus how while Craigslist does prohibit crawling by classified ad search engines through its terms of use, it still allows general purpose search engines such as Google and Yahoo to crawl freely.

Posted by Danny Sullivan at 8:55 AM | Permalink

December 22, 2005

WebmasterWorld Back Among The Spiderable

WebmasterWorld, which banned Google, Yahoo, MSN, Ask Jeeves and other search spiders last month, is now allowing them back in and thus returned to the land of the living, in terms of being listed with search engines.

WebmasterWorld chief Brett Tabke gives his rundown on the situation more in the site's robots.txt file, which he's now using as a blog. C'mon Brett -- you're posting good stuff in there beyond the whole robots things. Put the material into proper web pages, if not an actual blog, so we can link to individual items.

Look close at that file, and you'll see that it seems to still ban all the robots. Now look here at what the robots.txt file tells you is the "real" robots.txt file. That's made real to the major search spiders through this code, which checks to see if a spider is reporting a useragent from any major search engines. If so, then a cloaked robots.txt file is sent to them.

Cloaked! Cloaked! You mean Google and gang are all anti-cloaking but they don't mind this cloaking? Apparently so, and not that surprising. The robots.txt file really isn't designed to be read by humans, though they can. So while technically this is another example of search engines allowing cloaking, it's more a footnote than a big exception as with things like Google Scholar.

Ah, but what about people who might visit WebmasterWorld while pretending to be one of the major spiders? How could you do that? Here, Greg Boser points you at one of many tools that let you do this.

Greg's pointing at that because last week, he found himself blocked from WebmasterWorld after surfing in there as if he was from Google. He wasn't alone in being caught by some detection stuff Brett's setup, and now he and others are back with access, as Greg explains. Found yourself in the same situation? Brett explains here to send a sticky mail to an admin to have access restored. I'm told from a good source that a number of Google folks found themselves locked out as well, because many of them use browsers that report the Google useragent.

What about the entire rogue spider thing? They were ignoring robots.txt in the first place. That's why, as I covered earlier, WebmasterWorld also set up required logins to block the spiders. My understanding is that the major search spiders are being excluded from this requirement, plus referring data is also being used to help prevent some people clicking from the search engines from getting a login request for the first two or three clicks.

WebmasterWorld Back In Google Index? has discussion at WebmasterWorld, WMW - the bots are back has discussion over at Threadwatch and WebmasterWorld Off Of Google & Others Due To Banning Spiders our Search Engine Watch Forums has older discussion and is a place also you can comment or discuss the latest developments.

Posted by Danny Sullivan at 11:34 AM | Permalink

November 28, 2005

WebmasterWorld's Brett Tabke Speaks On Rogue Spidering Woes, Plus The Need For Expanded Feeds

Brett Tabke from WebmasterWorld dropped me a note about a new thread where he's answering many questions about WebmasterWorld banning all spiders, while Barry over at Search Engine Roundtable also has an interview with him. In both places, you'll learn of spiders being an increasing burden to the site, though I still am very, very wary that others should follow the route that Brett's taken.

Attack of the Robots, Spiders, Crawlers.etc at WebmasterWorld picks up from the Lets try this for a month or three thread where Brett announced last week that WebmasterWorld was banning all spiders by excluding them via robots.txt and through other measures such as required logins.

WebmasterWorld Bans Spiders From Crawling and WebmasterWorld Out Of Google & MSN from the SEW Blog covers more about the move fallout with WebmasterWorld no longer being visible in two major search engines.

In his latest posts, Brett explains:

  • The flat file nature of WebmasterWorld makes it apparently more vulnerable to spiders.  
  • Spider fighting has been taking a considerable and increasing amount of time.  
  • A ton of efforts have been done to stop spiders but cookie-based login still seen as necessary  
  • Major search engines other than Google (Ask Jeeves, MSN and Yahoo) were all banned for more than 60 days before this latest move.

Brett Tabke Interviewed on Bot Banning from Search Engine Roundtable takes the interview approach, where it is much easier to see what Brett's thinking and reacting to than wandering through the forum posts. Beyond the points above, he addresses not wanting to make use of non-standard extensions to robots.txt that Google, MSN and some other search engines have added precisely because they aren't standard.

Overall, I can appreciate much of what Brett's going through, but there still have to be better ways for this to be addressed. His solution is simply not one that the vast majority of sites will want to try, because it will simply wipe out the valuable search traffic they gain.

To be clear, I'm NOT saying that any site should be entirely dependent on search traffic. But neither do you cut yourself off from them, either. It's a matter of balance and moderation. To quote from what I posted in our forum thread on the WebmasterWorld situation:

People would often ask how much of their traffic they get from search engines. There is no right answer, but I'd often said that if you were looking at 60, 70, 80 percent or higher, you might have a search engine dependency problem. You want to have a variety of sources sending you traffic, so no one single thing wipes you out.

But to suggest that a site is so successful that it doesn't need search traffic at all? That's foolishness. I have absolutely no doubt that WMW will survive. It's a healthy community with plenty of alternative traffic. But people seeking answers to things it has answers to give are no longer going to be finding it.

Hmm, we'll maybe those people aren't good members, just generate to noise and so on. Yeah, maybe. But that also assumes that every single quality person must be there already. That's just not so. You always have good new people coming onto the web.

Search engines are a way you build up loyal users. People often discover you for the first time through search, then they keep coming back. It's not a dependency to have a small amount of your traffic bringing in new people this way. But it is, in my view, a marketing screw-up to cut yourself off from that potential audience.

Geez, it's like the basic rule of SEO/SEM. Ensure your site is accessible to search engines. If they can't get in, you stand no chance of getting traffic at all from them. And when people are paying by the click for search traffic, why don't you want that free publicity. Why wouldn't you seek other ways of retaining it but also restricting the bad bandwidth you don't want?

Overall, WMW obviously can and will do what it wants, and perhaps there's some magical master plan that down the line will make us all say "Genius!" Maybe. But this is a very, very bad model for any site to be considering, if they're having the same spidering problems that are the stated reason for why WMW is doing this. It's like saying you're getting too many phone calls to your business, so you're going to pull out the phone entirely!

So what is a site owner to do, if they are suffering from rough spiders? I'll share a bit from our own experience, plus point at what maybe the search engines should be doing.

We've encountered rogue spiders. It was one reason why our own Search Engine Watch Forums were down briefly last month, coincidentally the same time WebmasterWorld and Threadwatch went offline for different reasons. Rogue spiders aren't just something unique to Brett's set-up. They can and do indeed cause problems even for less "flat file" sites and URL structures. In fact, want to have some fun. Check this out. That shows you all the people on our SEW Forums at the moment you click on the link, up to 200 visitors. Scan the IP Address column, and you'll see how Yahoo's Slurp spider is in many, many different threads all at once. That's a burden on our server, though since we're getting well indexed as a result, it's a burden we live with.

Our own solution has been for our developers to throttle or ban spiders at the IP level that seem to be hitting us hard, in particular spiders that aren't identifying themselves as to their purpose. Good spiders often leave behind a URL string in your logs so you know they are from Google, Yahoo or whatever. For example, Yahoo points you here. Google points you here. No good identification? Then we don't worry that banning you is going to harm us seriously in some way.

What about improving the robots.txt system? Unfortunately, that's not a solution for rogue spiders. Brett's right when he points out the real story is moving to required logins. Rogue spiders aren't paying attention to robots.txt. Put in a ban against them, and they'll ignore it. Robots.txt only works with "polite" spiders.

Because robots.txt isn't a solution, it also means that wishing that the major search engines would come together to endorse new improved "standards" for the protocol also isn't a solution. Since rogue spiders are ignoring robots.txt, it doesn't then matter for there to be some type of universal agreement to have a "crawl delay" feature or more wildcard support, for example.

Still, while improving robots.txt isn't a solution to rogue spiders, there are things it could do if improved, and I'm right with Brett in wishing that the major search engines wouldn't unilaterally make their own improvements, as I've written before (and here).

So if we can't depend on robots.txt, what is the solution? If more and more sites face heavy spidering, we'll likely have to see a shift toward feeding content to search engines.

Feeding content isn't a new idea. Yahoo's paid inclusion program is pretty well known as a way for site owners to feed not URLs into the search engine but actual page content. Yahoo also has partnerships with some sites to take in content on a non-paid basis. Google also takes in feeds of content through things like Google Scholar or Google's Froogle shopping feeds program.

To be absolutely clear, these types of program aren't situations where you feed URLs, as with Google Sitemaps or Yahoo's bulk submit. These are programs where you feed actual page content. The spider doesn't come to you and hunt and guess at what you've got. You tell the spider what you've got.

Expanding feed programs to everyone would be a much more efficient way of gathering content, with one exception. You can expect that some sites will abuse feeds to send misleading content. Heck, it's bad enough how ping servers are already abused being wide open this way, as I wrote about on Matt Mullenweg's blog last month, when the future of ping servers was raised:

Whether we have an "independent" ping service almost seems beyond the point when both Dave and Matt are talking about the ping spam problem they have experienced. I'm actually surprised any the open ping servers are surviving. If they are open to anyone to ping, a small number of people will abusively ping for marketing gains

We?ve had 10 years of history knowing this with web search. Web search engines could long ago have had instant add facilities. Indeed, Infoseek and AltaVista even did for a short period of time. They found that without barriers, a small number of people would flood them with garbage. That?s why they don?t take content in rapidly. It?s not that they aren?t smart enough to take pings or let website owners flow content in. Instead, it is that they?ve learned you can?t leave a wide door open like that without being abused.

There?s absolutely no reason for anyone to have assumed that RSS/blog/feed search services were going to be immune to the same problem. If the ping outlook is bleak, it?s not because Verisign or Yahoo has purchased some service. It?s because you simply can?t leave doors open on the web like this for search, not for any search that?s going to attract significant traffic. Blog search is gaining that traffic, and you can expect the spam problem will simply get worse and worse until some barriers are put into place. You also cannot expect that you?ll simply come up with some algorithmic way to stop ping spam. Again, 10 years of web search engines diligently trying to stop spam has simply found it?s a never ending arms race.

I don?t know what the solution is. I suspect that for the major search players, the Googles & Yahoos, they?ll eventually move to a combination of rapid crawling, trusted pings and open pings as a backup. Remember, they get news content very fast. If they have a set of trusted sites, they can spider and hammer those hard. They?ll know to keep checking Boing Boing, Scripting and maybe 1,000 other major blogs that really, really matter ? and that when you check them, you quickly discover other links from blogs you may want to fetch quickly.

So throwing feeds wide open to everyone without vetting isn't the solution. But certainly we're overdue for feeds to be available to more people without requiring payment, through some type of trusted mechanism.

WebmasterWorld is a perfect poster child for this. People want the content there, and the search engines should want the content to be found via their sites as well. Allowing the site to feed its content gets around the barriers erected to stop rogue spiders very nicely.

But WebmasterWorld isn't the only candidate in this class. Many others, including myself, want the ability to feed actual content to the search engines. Let's see them move ahead with a way to make this more a reality, to establish real "trusted feeds" that aren't based on payment or whether your site falls within an area that the business development teams think need more support. Google Base may become Google's means of doing this, but at the moment, that's not feeding into web search.

Want to comment or discuss? Visit our Search Engine Watch Forum thread, WebmasterWorld Off Of Google & Others Due To Banning Spiders.

Posted by Danny Sullivan at 4:16 PM | Permalink

November 23, 2005

WebmasterWorld Out Of Google & MSN

Well, that didn't take long. I wrote on Monday about how WebmasterWorld head Brett Tabke decided to ban all search spiders including those from the major search engines in an effort to combat bandwidth loss and server sluggishness due to rogue spiders. Brett figured he had about 60 days until he'd see pages get dropped. It took two.

As of this moment, site:webmasterworld.com at Google shows NO pages being listed from the site. Prior to the ban, about 2 million pages were listed. Oddly, Google's not even returning the site's home page using the listing out of the Open Directory.

In other words on Monday, as I recall, a search for webmasterworld brought up the WebmasterWorld home page with the title and description like this

Webmaster World Brett Tabke hosts professional webmaster and search engine promotion discussions.

That title and description was being pulled from the Open Directory listing as you'll find over here.

A search today for the same thing doesn't bring up the site at all. Yes, WebmasterWorld banned Google from spidering it. However, that doesn't prevent Google from listing at least the home page by making use of the Open Directory information, which doesn't require spidering the WebmasterWorld web site.

Interestingly, checking the Google Directory -- which is powered by the Open Directory -- there is no listing for WebmasterWorld in the same exact category as you'll see at the Open Directory. It suggests that the robots.txt ban had the effect of pulling WebmasterWorld not only out of the Google web search results but Google Directory listings as well. That would be an entirely new thing I don't recall hearing happening before.

Checking with Dave Naylor, who's been watching the situation, he suspects that this is indicative of Google manually pulling everything about the site from Google.

Over at MSN, site:webmasterworld.com brings back one match, but since it lacks a title and description, this looks to be a listing of the WebmasterWorld home page based on the fact MSN sees links to it, rather than having crawled it. Google can and does do a similar thing, calling these "partially indexed" URLs. It's not doing that for WebmasterWorld, however.

To understand more about the entire situation of how a page that bans spiders might still appear, check out my The US White House & Blocking Search Engines page.

At Yahoo, site:webmasterworld.com shows 83,300 matches for me, which is steady from what I saw earlier this week.

Should the pages have dropped so quickly? With Google, things might have been helped along by the fact it has an automatic page removal system. Don't panic! It only works if a site has specifically put up a robots.txt file blocking Google. People just can't come along and remove your pages unless you yourself have installed such a robots.txt file.

Todd Freisen describes the system more in Blink And It?s Gone, and he's at least one person who submitted the new WebmasterWorld robots.txt file to speed up the removal process. Todd's also been tracking page counts for the site in various search engines: WebmasterWorld Index Watch 3, WebmasterWorld Index Watch 2 and WebmasterWorld Index Watch.

Even if this hadn't happened (submission to the automated Google page removal tool), I still thought it was way overly optimistic to assume a popular site like WebmasterWorld would be allowed to retain pages after expressly banning spiders. MSN certainly has no automated page removal system, and it matched Google in dropping pages.

Want to discuss or comment? Members at WebmasterWorld are talking in this thread, and we also have discussion starting at our own Search Engine Watch Forums in WebmasterWorld Off Of Google & Others Due To Banning Spiders.

Posted by Danny Sullivan at 9:39 AM | Permalink

November 21, 2005

WebmasterWorld Bans Spiders From Crawling

Wow. Brett Tabke drops the nuclear bomb of banning all spiders from WebmasterWorld. He explains here that heavy rogue spidering is the reason behind the move. Members worry in the thread that as pages drop out of search engines, it will become difficult to impossible to find anything at WebmasterWorld, which self-admittedly lacks good site search.

Brett figures he's got 60 days until pages drop from places like Google to get an alternative search solution in place. That seems optimistic to me. WebmasterWorld is a prominent site and should get getting revisited on a sub-daily basis. If search engines are hitting that robots.txt ban repeatedly, they ought to be dropping those pages in short order, or they aren't very good search engines. I mean, can you imagine the irony of Google and Yahoo getting pilloried on WebmasterWorld for taking so long to drop pages after they were told to do so after the ban was put into place?

A separate issue is the potential loss of search traffic. We have had the odd site from time-to-time declare that it might ban Google or Microsoft because of opposition to those companies, and we've certainly had companies ban all spiders for other reasons. But in one bold move, WebmasterWorld suddenly is about to become a big giant test case about what happens to a site if it cuts itself off from the oxygen of search results -- an incredible irony when so many come to the site looking specifically on how to gain more search traffic.

Realistically, any established site should be able to ride out having no search traffic at all. WebmasterWorld has plenty of people who will seek it out directly, plus referral links from other sites will keep traffic going and perhaps even growing. But search has been estimated to drive anywhere between 7 to 13 percent of new visitors to a web site, visitors who after they arrive continue to come back. I wouldn't want to roll the dice against losing them.

It'll be interesting to see if WebmasterWorld really sticks with this ban or seeks other ways of getting its content into the major search engine without spidering, such as via Google Base or Yahoo's paid inclusion programs, for example.

Posted by Danny Sullivan at 10:51 AM | Permalink

November 23, 2004

Keeping Secure When Using Robots.txt

Many know that the robots.txt file is a way to prevent search engines from spidering content. However, using it may also cause security issues for those who don't realize that anyone can see the file. You'll find some tips on keeping safe in our recent forum thread, Robots.txt & Security Issues

Posted by Danny Sullivan at 12:41 PM | Permalink

November 16, 2004

New Robots.txt Tool

The robots.txt file allows you to prevent search engines from indexing documents on your web server, assuming they respect the convention. Most major search engines do. Ian McCanerin, one of our forum moderators, has posted a nice new tool to help you generate a robots.txt file. You can also discuss it in this forum thread.

Posted by Danny Sullivan at 7:10 AM | Permalink

October 17, 2004

Search Engine Watch Forum's 101 Threads

Last week, one of our most energetic forum moderators Nacho Hernandez started a thread called Search Engine Marketing 101. In it, he leads off with a variety of resources useful for those getting started with search engine marketing. Comments and further contributions follow.

Nacho also kicked off a theme. Orion, one of our newest moderators, followed up with Block Analysis 101. That looks at the concept of search engines breaking up a page into "blocks," to better understand which particular content or links within that content should be given greater or less weight.

Member Nick W's now dived in to look at the often controversial issue of cloaking: Cloaking 101 - Questions and Answers. Some previous good threads and debate on this topic include The Great Doorway Debate, How Do I Spot Cloaked Sites?. You might also look over an article I did last year, Ending The Debate Over Cloaking.

Returning back to Nacho, he's compiled a great list of Google Sandbox 101-style resources in Sandbox - IN or OUT? The sandbox concept relates to the idea that new pages, new links or new sites might not be allowed to do well in Google until a certain period of time has passed. The Filthy Linking Rich thread touches on this, as well.

Posted by Danny Sullivan at 11:24 AM | Permalink | Comments (0)

See More Posts From:

This Week | This Month

  var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E")); var pageTracker = _gat._getTracker("UA-564586-7"); pageTracker._setDomainName(".searchenginewatch.com"); pageTracker._trackPageview(); window.collarity_appid = "incmedia"; //> //>

Senior Digital Planner
U.S. International Media Los Angeles, United States

Senior Search Analyst
U.S. International Media Los Angeles, United States New York, United States

Webmaster - Marketing
West Virginia School of Osteopathic Medicine Lewisburg, United States

Web Marketing Manager
Harvard Business Publishing Watertown, United States


0