Dear @twittercomms – A Basic Search Query For Your Engineers

 

I am not sure why Twitter engineers are struggling with this, and potentially misleading tech journalists (and even some well respected SEOs) with the SEO issues hampering the indexation of Twitter.com

This is all pretty basic technical SEO

Here is a very simple cleaned up Google search query that will bring up some interesting results

https://www.google.com/search?q=site:twitter.com+inurl:andybeard&filter=0

It doesn’t exclude a subdomain such as www.twitter.com or fr.twitter.com or api.twitter.com

Twitter splits their domain authority between lots of different subdomains which have no business being indexed.

It also doesn’t include http and https – typically those should also be canonicalized (think of Highlander “There can be only one!”)

The &filter=0 tells Google not to ignore some of the URLs it might otherwise due to no content, low PageRank etc. It is especially useful for picking up URLs which are blocked by robots.txt

Like these

The first red arrow is what Google sees when it decides not to follow the funky javascript redirects that Twitter does. It is possible Google sees that a lot – it is like having a door slammed in your face.

The second is the bouncers at the door to the nighclub… you get to see all the cool stuff entering into Twitter’s archives. but it won’t let Google through as they aren’t willing to tip the bouncers enough money, or haven’t got the right friends.

That barrier prevents Google crawling deeper into your content, so whilst if they were very observant they may have seen a piece of content once, it may eventually drop out of the index unless other sites in the Twitter ecosystem maintain links directly to that content that Google can follow.

So for instance sites like Topsy & Tweetmeme if crawled by Google and they link directly to a tweet, it is possible for Google to find content… but that is far from perfect.

What controls Google in this way is Twitter’s robots.txt file https://twitter.com/robots.txt

The line of that file in the Google section that is causing a lot of the issues is this one
Disallow: /*?

Effectively any URL on the whole of Twitter that contains a “?” or query parameter Google is not allowed to look at.

Here are 2 more queries for you

https://www.google.com/search?q=site:andybeard.tweetglide.com/blog&filter=0&start=991

https://www.google.com/search?q=site%3Atwitter.com%2FAndyBeard&filter=0&start=991

Those are 991 searches… Google will only list up to 1000 items for a search, so that kind of query will show the end of the search results – if used on huge sites, then you might have to refine things down to a subset of pages where possible, but in this case my Twitter account only has 5100 tweets and Twitter should easily be able to get all of those indexed. I am restricting the search to a folder /andybeard

I haven’t linked directly to either my twitter account or my archived copy on Tweetglide for some time so here they both get a link with a screenshot of current indexation over 2 years… not so many tweets over that time due to a long hiatus.

http://andybeard.tweetglide.com/blog

http://twitter.com/andybeard

341 vs 321

The winner here seems to be Tweetglide, and it seems fairly close until you examine all the URLs for Twitter that google crawls needlessly as duplicate content on different subdomains, and remove all the junk pages (and some good stuff) that are blocked by robots.txt

Such as this

You also have to understand that in the last 2 years due to my hiatus I have only created 280 tweets as archived by Tweetglide (there may be a few early ones missing), and the additional tweets in that deep search result are the archived Tweets of the people I have conversations with.

If we removed that filter parameter things are drastically different

https://www.google.com/search?q=site:andybeard.tweetglide.com/blog&start=991

https://www.google.com/search?q=site%3Atwitter.com%2FAndyBeard&start=991

341 vs 253

That is when Google filter out lots of the duplicate junk from Twitter, and none from Tweetglide.
That filter removal isn’t perfect.. there are still some duplicates – if Twitter retains half the indexation of Tweetglide I would be amazed.

Is there a crawl limit?

It vastly depends on juice.. my Twitter profile at one time had enough juice that if Google had been allowed to crawl, they probably would have picked up 25-50% of the 5100 tweets, but Twitter doesn’t allow you to paginate that far into it’s archive (even if it wasn’t blocked), and even the API is still limited and can only pick up around 3000 historical tweets.

My good online friend Vlad Zablotskyy has more tweets than me achived on Tweetglide.

https://www.google.com/search?q=site:vladzablotskyy.tweetglide.com/blog&filter=0&start=990

He actually has less pages indexed than me possibly because of juice, so lets give him some, and see if we can crack the 500 barrier.

This has been a small introduction to Twitter’s SEO woes to demonstrate hopefully to laymen that all is not well with Twitter, and any claims that they can be indexed normally are false. Any competent SEO could have found all of these issues with the site, and fixing them would reduce the load on Twitter’s servers caused by Google, and maybe allow Google to index more content.
I have avoided additional complications with rel=”nofollow”, and potential cloaking issues with their impelentation of #!hashbang URLs, and funny javascript redirects and haven’t touched on some additional nuances with the way they feed juice to list pages.

Disclosure: When Tweetglide launched I offered some SEO tips (pro bono) over a few days to the owner and one of his engineers, and had 100% “buy in” to follow my recommendations for internal linking structure on the site. I would possibly change the structure of the pagination links at the bottom but otherwise I think the site from an SEO perspective is doing great… damn… no difference between with and without filter=0 actually amazes me.

 

Liked this post? Follow this blog to get more. Follow

Comments