Friday, March 11, 2005

302 redirects and Googlejacking

There are currently a lot of angry webmasters up in arms about the way G is handling 302 redirects.

The effect of this problem is that one can damage another site's rankings in Google simply by "linking to it in a special way" ie; redirecting to it using a code 302 redirect. to put it as simply as possible.

It seems that it is almost as easy as that. Many sites use 302s to link out to other sites for various legitimate reasons, even Yahoo and Alexa use such a linking scheme. For some reason G seems to know that these big players are linking out and doesn't create a new index with the content of the target page, and the URL of the redirect script, but on some smaller directories I have noticed this is happening. Im still not sure if it's something in the way the 302 is executed that is causing the double indexing, or if Google just knows Alexa and Yahoo. I have found a couple of directory sites that I reciprocate links with that are coming up under a search for my content. I have asked them to cancel the links, of course. I recommend only exchanging links with sites that use proper "a href" hard links.

Here is a great post from a Webmaster World forum thread which explains the details of the problem much more concisely than I am able:


The full story of Google and 302s


You can't ban 302 referrers as such


Why? Because your server will never know that a 302 is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

You can't ban a "go.php?someurl" redirect script


Why? Because your server will never know that a "go.php?someurl" redirect script is used for reaching it. This information is never passed to your server, so you can't instruct your server to react to it.

Even if you could, it would have no effect with Google


Why? Because Googlebot does not carry a referrer with it when it spiders, so you don't know where it's been before it visited you. As already mentioned, Googlebot could have seen a link to your page a lot of places, so it can't "just pick one". Visits by Googlebot have no referrers, so you can't tell Googlebot that one link that points to your site is good while another is bad.

You CAN ban clickthrough from the page holding the 302 script - but it's no good


Yes you can - but this will only hit legitimate traffic, meaning that surfers clicking from the redirect URL will not be able to view your page. It also means that you will have to maintain an ever-increasing list of individual pages linking to your site.

For Googlebot (and any other SE spider) those links will still work, as they pass on no referrer.

 











This is what really happens when Gbot meets 302:


Here's the full lowdown. First time i post it all. It's extremely simplified to benefit the non-tech readers among us, and hence not 100% accurate in the finer details, but even though i really have tried to keep it simple you may want to read it twice:

 



  1. Googlebot visits a page holding eg. a redirect script

  2. Googlebot indexes the content and makes a note of the links

  3. Links are sent to a database for storage until another Googlebot is ready to spider them. At this point the connection breaks between your site and the site with the redirect script, so you (as webmaster) can do nothing about the following:

  4. Some other Googlebot tries one of these links

  5. It receives a "302 Found" status code and goes "yummy, here's a nice new page for me"

  6. It then receives a "Location: www.your-domain.tld" header and hurries to that address to get the content for the new page.

  7. It deliberately chooses to keep the redirect URL, as the redirect script has just told it that the new location (That is: your URL) is just a temporary location for the content. That's what 302 means: Temporary location for content.

  8. It heads straight to your page without telling your server on what page it found the link it used to get there (as, obviously, it doesn't know - another Googlebot fetched it)

  9. It has the URL (which is the link it was given, not the page that link was on), so now it indexes your content as belonging to that URL.

  10. Bingo, a brand new page is created (nevermind that it does not exist IRL, to Googlebot it does)

  11. PR for the new page is assigned later in the process. My best bet: This is an initial calculation that is done something like: PR for the page holding the link less one.

  12. Some other Googlebot finds your page at your right URL and indexes it.

  13. When both pages arrive at the reception of the "index" they are spotted by the "duplicate filter" as it is discovered that they are identical.

  14. The "duplicate filter" doesn't know that one of these pages is not a page but just a link. It has two URLs and identical content, so this is a piece of cake: Let the best page win. The other disappears.


So, essentially, by doing the right thing (interpret a 302 as per the RFC) Google allows another webmaster to convince it's bot that your website is nothing but a temporary holding place for content.

Further, this leads to creation of pages in the index that are not real pages. And, you can do nothing about it.

[edited by: claus at 3:45 pm (utc) on Mar. 9, 2005]


I'm sure this will be rectified to some extent soon, as there is an explosion of publicity about it on the web now, and Yahoo have had the problem fixed for a long time now. MSN seems to be succeptable to it also, but not as much as G.

No comments: