Revisiting Hijacking & Redirects: Moving To A Solution

Cast your mind back to May, when I wrote about the Google AdSense page getting "hijacked" by another site for searches on adsense and google adsense. Why hadn't Google yet solved this problem, I asked, especially when it was something that competitor Yahoo seemed to have cracked?

I talked with Google about the issue in more depth a couple of days later. In short, they're aware that the issue needs to be solved, but they're looking for the best way forward and would like your help.

Moreover, it's something we're going to explore at the Indexing Summit 2 at the SES San Jose show next week, to se if there's an across the board solution for all the major search engines. So any advice and suggestions you have on the topic are welcomed. More on how to contribute is below.

Redirection - From Old To New

To understand the issue, let's revisit the basics. What are redirects, and how are they supposed to operate?

It would be great if pages always had the same URL forever and ever, but that's often not the case. People get new domain names, reorganize their web sites or use a new content delivery system. Those are just some of many reasons that URLs can change.

Change your URL, and people -- or search engines -- can't find the page when they look in the old location. Redirection is the solution. With redirection, a site owner can automatically send requests for the page at the old location to the new one. Think of it as mail forwarding or call forwarding for web pages.

Redirection can be done through a meta refresh tag, but I'm not going to get into that here. Just understand that meta tag redirection is a page-based method of pointing from one place to another. In other words, it's something you put on the actual web page.

Instead, I'm going to instead focus on server-side redirection, where your web server does all the work of pointing from one place to another.

When your server does a redirection (or responds to anything), it sends out a little status code about the request it's processed. You know how when you try to reach a web site that's gone, and you sometimes get a "404 Not Found" error? That's a type of status code a server sends, which in turn may trigger the appearance of that "Not Found" page.

With redirection, there are two main codes that go out, corresponding to the type of redirection happening. These are 301 and 302, numbers you've probably heard if you've been reading about redirection and hijacking issues. Let's look further and what they represent and how things are supposed to go.

301 Permanent Redirect

The W3C guidelines for a 301 "permanent redirect" say that this is for use when a page has been permanently moved and you want people to record the new address in place of the old one.

In other words, say you change domains from superdupersite.com to reallycoolhotsite.com. You want people and search engines to know that the new domain should be used in place of the old one. You'd do a 301 permanent redirect like this:

superdupersite.com --- 301 Permanent Redirect ---> reallycoolhotsite.com

Do that, and it's the URL pointed at (that I've highlighted in bold) that should be retained for use in, say, search listings.

302 Temporary Redirect

The W3C guidelines for a 302 "temporary redirect," as it's commonly though inaccurately referred to, say that this is for use when a page has been temporarily moved to a new location and you want people to KEEP the old address rather than use the new one.

In other words, say one day superdupersite.com gets Slashdotted or receives heavy traffic, too much to keep serving up things as normal. You decide to temporarily send people off to a mirror site, mirror.superdupersite.com. You want people and search engines to reach the new location but not record the address as a permanent change. You'd do a 302 temporary redirect like this:

superdupersite.com --- 302 Temporary Redirect ---> mirror.superdupersite.com

Do that, and it's the site doing the pointing whose URL should be retained for use in search listings.

What Yahoo Does

Those are the official "rules," to some degree, on how redirects are supposed to be handled. However, following them has caused problems with the search engines.

In reaction, Yahoo made changes last year in how it handles redirects, as this slide (PDF file) illustrates. Exactly what Yahoo records will differ from what the W3C suggests, depending on the situation. I'll break that down below.

Yahoo: Redirects Between Domain

The first situation is for redirects between two DIFFERENT web sites or domains. Redirects work like this:

  • 301 Permanent Redirect
    source-domain --> target-domain = target-domain URL kept and used for listings
  • 302 Temporary Redirect
    source-domain --> target-domain = target-domain URL kept and used for listings

In the examples above, things work exactly the same. If the "source-domain" (the site doing the redirection) redirects in any way to "target-domain" (the site getting the traffic), the target-domain URL is kept.

This solves any hijacking problem. You can't point at someone else and possibly, as with Google (and likely MSN and Ask Jeeves) somehow get your own URL listed rather than the site you're pointing at. Or more important, somone can't redirect to you and manage to get their own URL listed instead of yours.

Here's what Yahoo's Eric Baldeschwieler, director of software development, emailed me about the change:

We decided to handle all cross domain redirects as permanent redirects to remove possibilities for abuse. We were able to avoid the "hijacking" problem. Also, the webmaster community was vocal in its desire to have good permanent redirect support and we have received very positive feedback on this change

Yahoo: Redirects Within A Domain, Root Pages

Things are more complicated at Yahoo when you redirect within your own web site. First the situation with 301 permanent redirects and root pages or "home" pages:

  • root-page --> deep-page = rootpage kept and used for listings

What's the logic here? Baldeschwieler said:

When a user searches for a domain, we would like to return a domain root page. This motivates our exceptional handling of domain root pages. We put this in place because it reduces the number of user complaints.

Let's use Amazon as an example. Say you go to Yahoo and search for Amazon by name. You want to reach the Amazon home page, in all likelihood. Type in Amazon.com into your browser to see what the home page URL looks like. It should be something similar to this:

http://www.amazon.com/exec/obidos/subst/home/home.html/LOTSOFNUMBERS

Why isn't the address of the Amazon home page just www.amazon.com? Technically, it is. Enter that address, just the domain name, and the "root" page loads up, the default page the server sends out if you don't give it a specific page. But Amazon redirects requests for that page deep within its site and appends a number for tracking purposes.

Now URLs that show in search results are important to users. Studies have shown that they rely on them for making choices. If you want to reach the Amazon home page, it makes a lot of sense to show a nice, short URL rather than the redirection URL that Amazon uses. That's what Yahoo does with this change. Do a search for Amazon, and you'll see the URL shown is:

www.amazon.com

Yahoo does this by technically breaking the rules, but to me, it's a good reason to break them (and hopefully the "rules" will officially change for search engines).

In contrast, Google follows the rules and so the domain it lists for a search on Amazon looks like this (FYI, the tracking numbers aren't shown because as Google can't be cookied, I believe Amazon doesn't generate a unique code for its spider:

www.amazon.com/exec/obidos/subst/home/home.html

You can see the difference. There are some rare occasions when someone might redirect from their home page and want the "deep" page URL to be used. Yahoo admits this and says perhaps other mechanisms will be found to help solve that. But for the most part, I think this "rule breaking" is a good idea.

Yahoo: Redirects Within A Domain, Deep Pages

Now clear your mind of home pages. What if you want to redirect from one "deep" page within a web site to another deep page. Yahoo does this if you do a 301 permanent redirect:

  • deep-page --> other-deep-page = the page directed to is used for listings

That makes sense (and follows the rules). If you are redirecting from deep pages within a domain, chances are you really do want the "new" address used in listings.

What if you don't want the new address recorded? Easy. Just use a 302 temporary redirect and it follows this logic:

  • deep-page --> other-deep-page = the redirect is NOT used for listings

As this is happening WITHIN your domain, Yahoo feels it can follow the rules in this case without risk that someone else is trying to "hijack" your listing, as happens with cross-domain redirection.

What Google Does

How does Google handle redirects? With 301 permanent redirects, it uses the URL of whatever is pointed at. There are no potential hijacking problems with this.

With 302 temporary redirects, technically a search engine should keep the URL of the page doing the pointing as explained. If Google actually did this, the hijacking situation would be far worse than it is now. Anyone could point at anyone else and potentially hijack listings. And what if you encounter multiple sites all temporarily redirecting to another site? Which of the "pointing" domains should be kept?

To date, Google has primarily relied on PageRank to help sort the situation out. It generally assumes that the page with the highest PageRank score is the URL that should be used in search listings. Fair to say, this generally works out to be the case.

With Google's much publicized situation in May, there were a number of very unusual glitches that caused its AdSense page to end up having a lower PageRank score than the person pointing at it. Nevertheless, with so many sites out there, such unusual situations can add up.

Solving The Problem

Back to my original question. Why not do what Yahoo does? Perhaps that's what Google will do. Google said when I spoke with it after the May incident that it wanted to spend the next few months getting feedback and experimenting with what seems to be the best solution. Indeed, it has been doing this.

What's so hard? Consider this situation:

  • Someone registers a domain name -- myname.com
  • Someone temporarily redirects from that domain to the home page of their blog, such as radio.weblogs.com/30383482/

Ideally, you'd want to keep the actual domain name used for listings, at least for the home page, rather than pointing at that big giant URL. Following the "rules" for a 302 permanent redirect allows this. And the Yahoo solution doesn't work because this is a cross-domain situation.

Dave Naylor, one of our SEW Forum moderators, recently summed it up even better from someone who was redirecting between two domains:

A long long time ago in a distant valley a young search engineer decides that the Google surfers would be better off seeing www.johnnybladeproductions.com rather than seeing johnnybladeproductions.home.att.net, and for a long time everyone was happy.. the webmaster was happy because Google showed his www. and not the free host. Then people realized that pointing a domain with a 302 would replace other people's URLs...and the whole 302 hijack problems started.

The "everyone was happy days" may have changed but not the fact that there can be unusual situations that come up requiring some thought on how to do redirects. Google wants to explore these more before settling on changes.

Ideally, the other search engines would look at them and all come together with a standard. Indeed, while Yahoo has made changes, those could change again. Emailed Yahoo's Baldeschwieler:

Setting redirect handling policy requires balancing the needs of our users and webmasters. The first priority is to answer our users' queries. We also strive to give webmasters as much control as possible of their contents' representation in our index. Hence we are as open, transparent and standards compliant as possible.

I'd like to emphasize that redirect policy handling is not something we have set in stone. We continue to tune our approach as our systems improve and we receive feedback on our current policies. When we last reviewed our policy, we set out a set of principles which have guided our recent redirect handling decisions. Our current implementation choices were driven by the following principles.

  • Fix the Problems NOW
  • Don't wait for the perfect solution, work on creating one and remain flexible.
  • Pay attention to community feedback and user experience, not just standards when designing solutions.
  • Webmasters and tool vendors often ignore or reinterpret existing standards

Where possible, conform to standards and give control to webmasters. In cases where the above principles don't suggest a divergence from the standard, we conform to it. In all cases, keep web masters informed of how we are handling redirects.

More Background & Your Feedback

For more on redirection issues, I highly recommend Claus Schmidt's Page Hijack: The 302 Exploit, Redirects and Google. He has lots of information there and has played a major role in helping people understand the problem that redirections can cause. You can also see:

Those are blog posts that reference past forum discussions and information from across the web on the topic, which has just continued to grow over the past year and a half.

In this article, I've focused on the situation with Google and Yahoo. I haven't yet have the chance to talk with Ask Jeeves and Microsoft on the topic, but I hoped this would be a good starting point to go forward.

In particular, I hope you can help. Please provide your comments, suggestions and unusual situations you think need to be considered over in this forum thread: Indexing Summit 2: Give Your Feedback On Handling Redirects. Google especially is looking for that feedback. Perhaps a content-location header solution makes sense. Perhaps there are other solutions people have out there.

I'll be mining that thread for possible solutions that we'll explore at Indexing Summit 2, which will be happening at the next Search Engine Strategies show in San Jose this August. Redirection is one of the two key subjects of that. So please speak up now in the thread, so I can bring your voice out along with the others at the actual summit next month.