Video Exclusive: Has Google Given Twitter a Cloaking Penalty?

It seems Google has given Twitter some kind of penalty, possibly for cloaking.

Watch this video for the details

Has Google Given Twitter a Cloaking Penalty?

This Video Is Protected

More about Andys Store by Andy

To check the cached Google page you will have to switch off javascript using something like the web developer toolbar for Firefox.

Google is not being blocked as there is a cached page, but the cached page is not what is given to users, even when not logged in.

The pages are significantly different in structure.

Google have quite extensive webmaster guidelines including a special on on cloaking and sneaky javascript.

Just to qualify the value of this video – some people rightly or wrongly look on me as an SEO expert – I don’t proactively consult partially because I don’t believe my own opinions half the time and in the past I have spent a great deal of time trying to prove myself wrong.

Ultimately even Google engineers directly involved with webspam or working directly on Google’s algorithms don’t know every aspect of how Google’s algorithms might work in every eventuality, and some results or side effects may be unintended.

It could well be that Twitter left old Twitter in place because Google was having problems crawling the new primarily javascript Twitter. From my perspective the mistake was to do a funky javascript redirect and having a significantly different link graph, but there could easily be something I have overlooked.

Liked this post? Follow this blog to get more. Follow

Comments

  1. says

    If that’s what it looks like (and I agree, nothing else but a penalty makes sense, since it’s caching) it’ll be interesting to hear the inevitable chatter. Great catch, Andy!

    • says

      You would think with the number of former Google engineers that they would get things right, but maybe Google is better at hanging onto the search and webspam guys.

  2. pageoneresults says

    “It seems Google has given Twitter some kind of penalty, possibly for cloaking.”

    When you say penalty, in what way is Twitter being penalized? I see the chain of events that takes place there. If you don’t support JavaScript, you’re going to be redirected to a version that is non JS friendly. There’s a meta refresh of “0” in the that is executing the redirect.

    I would imagine Twitter’s new platform is a crawling nightmare. To avoid those challenges, Twitter is redirecting all non JS requests. Personally, Twitter should just set NoArchive across the entire domain to avoid the misinterpretation of this.

    site:twitter.com – About 3,550,000,000 results < That's a pretty healthy document count for being penalized. ;)

    • says

      It sure is a healthy document count… it used to be a case that a large amount of the Twitter search results even when blocked by robots.txt would get indexed.

      Thus that isn’t at issue.. and penalties can be based upon single pages, with the intent those single pages don’t pass PageRank

      To see the cache results I had in the video I had to switch the javascript off

      cache:twitter.com/andybeard

      Google isn’t really going to bat an eye at stuff like this right?

          <script type="text/javascript">
      //<![CDATA[
      window.location.replace('/#!/andybeard');
      //]]>
      </script>
          <script type="text/javascript">
      //<![CDATA[
      (function(g){var c=g.location.href.split("#!");if(c[1]){g.location.replace(g.HBR = (c[0].replace(/\/*$/, "") + "/" + c[1].replace(/^\/*/, "")));}else return true})(window);
      //]]>
      </script>

      There is a meta description on a page being crawled but has no snippet at all

      <meta content="Andy Beard (AndyBeard) is on Twitter. Sign up for Twitter to follow Andy Beard (AndyBeard) and get their latest updates" name="description" />
      

      It won’t be the first time Google ignore a title tag, but I don’t see a logical reason to do it suddenly

       <title id="page_title">Andy Beard (AndyBeard) on Twitter</title>
      

      I must admit I can’t see any logic to giving an id to a title in the header, but you are the standards guy.

      My first recording was 20 mins and I knew I had to get the video down to a more manageable length – forgive me for not including every piece of evidence that led me to my conclusion. I had to also consider upload time for the video which is why it might not be as clear in parts as I would have liked.

    • says

      I should also mention that from Google’s perspective the important distinction is that a user clicking a search result should see what Google sees.
      I didn’t even examine what happens if I visit a /#! / url as that doesn’t really matter

  3. pageoneresults says

    Per our discussion on Twitter, this article gives a very in depth review of what may be happening with Twitter and Hash-Bang URIs…

    Hash URIs – Jeni Tennison

    The above article is an excellent read and provides detailed insight into this whole Hash-Bang thing which is to say the least, akin to frackin brain surgery. You need to be on your p’s and q’s to fully understand what may be happening here.

    Based on watching your video a few more times and then reading Jeni’s article on Hash URIs, the word Brittleness sticks in me mind. Twitter may be experiencing challenges in getting content indexed properly under the new AJAX platform.

    • says

      Also per Twitter (and yes that is a great article)

      Google has a page indexed… the old Twitter page which contains both a title and description.

      But they have chosen for some reason not to display that content.

      Also for this to be just a brittleness issue, in the past Google would have had to display a hash-bang URL in the SERPs and to my recollection they have never done this for Twitter.

      They have always displayed the http://twitter.com/andybeard URL

      The hash-bang URL has also always had a different link profile – the difference is that until fairly recently it wasn’t the default view.

  4. says

    I also had a look at how they implemented the hash bang (#!) page for Google.

    #! is a signal that the page implements “Google’s Proposed Ajax Solution” and that Google should go to the following page to see the true content:

    twitter.com/?_escaped_fragment_=/andybeard

    And the fun begins again. This page has a 301 redirect to the old twitter page.

    So the new #! pages are also telling Google to index the old page for content.

    It looks like the other #! pages of a profile map to their old equivalents, so can fall into the login trap.

    It’s a very strange way to do things. Why did they go with the #! switch if the result is they tell Google to ignore it!

    I’m guessing its a case of fancy Ajax development done before considering the SEO implications…then hack.

    • says

      In many ways it is bolt on solution to bolt on solutions and I am sure a lot of decisions made at Twitter are just to cope with rapid expansion and insane load.

      That wouldn’t however preclude them spending an afternoon of one engineers time to make the pages show the same information

  5. says

    I’m not a techie so I don’t totally follow some of the technical points, but I doubt the value i twitter is related to it’s google rank. Surely it is to do with it’s users and their reach and influence. Or are you just using this example to illustrate what can happen when google’s technical webmster guidlines are breached?