I am not sure why Twitter engineers are struggling with this, and potentially misleading tech journalists (and even some well respected SEOs) with the SEO issues hampering the indexation of Twitter.com
This is all pretty basic technical SEO
Here is a very simple cleaned up Google search query that will bring up some interesting results
It doesn’t exclude a subdomain such as www.twitter.com or fr.twitter.com or api.twitter.com
Twitter splits their domain authority between lots of different subdomains which have no business being indexed.
It also doesn’t include http and https – typically those should also be canonicalized (think of Highlander “There can be only one!”)
The &filter=0 tells Google not to ignore some of the URLs it might otherwise due to no content, low PageRank etc. It is especially useful for picking up URLs which are blocked by robots.txt
The second is the bouncers at the door to the nighclub… you get to see all the cool stuff entering into Twitter’s archives. but it won’t let Google through as they aren’t willing to tip the bouncers enough money, or haven’t got the right friends.
That barrier prevents Google crawling deeper into your content, so whilst if they were very observant they may have seen a piece of content once, it may eventually drop out of the index unless other sites in the Twitter ecosystem maintain links directly to that content that Google can follow.
So for instance sites like Topsy & Tweetmeme if crawled by Google and they link directly to a tweet, it is possible for Google to find content… but that is far from perfect.
What controls Google in this way is Twitter’s robots.txt file https://twitter.com/robots.txt
The line of that file in the Google section that is causing a lot of the issues is this one
Effectively any URL on the whole of Twitter that contains a “?” or query parameter Google is not allowed to look at.
Here are 2 more queries for you
Those are 991 searches… Google will only list up to 1000 items for a search, so that kind of query will show the end of the search results – if used on huge sites, then you might have to refine things down to a subset of pages where possible, but in this case my Twitter account only has 5100 tweets and Twitter should easily be able to get all of those indexed. I am restricting the search to a folder /andybeard
I haven’t linked directly to either my twitter account or my archived copy on Tweetglide for some time so here they both get a link with a screenshot of current indexation over 2 years… not so many tweets over that time due to a long hiatus.
341 vs 321
The winner here seems to be Tweetglide, and it seems fairly close until you examine all the URLs for Twitter that google crawls needlessly as duplicate content on different subdomains, and remove all the junk pages (and some good stuff) that are blocked by robots.txt
Such as this
You also have to understand that in the last 2 years due to my hiatus I have only created 280 tweets as archived by Tweetglide (there may be a few early ones missing), and the additional tweets in that deep search result are the archived Tweets of the people I have conversations with.
If we removed that filter parameter things are drastically different
341 vs 253
That is when Google filter out lots of the duplicate junk from Twitter, and none from Tweetglide.
That filter removal isn’t perfect.. there are still some duplicates – if Twitter retains half the indexation of Tweetglide I would be amazed.
Is there a crawl limit?
It vastly depends on juice.. my Twitter profile at one time had enough juice that if Google had been allowed to crawl, they probably would have picked up 25-50% of the 5100 tweets, but Twitter doesn’t allow you to paginate that far into it’s archive (even if it wasn’t blocked), and even the API is still limited and can only pick up around 3000 historical tweets.
My good online friend Vlad Zablotskyy has more tweets than me achived on Tweetglide.
He actually has less pages indexed than me possibly because of juice, so lets give him some , and see if we can crack the 500 barrier.
This has been a small introduction to Twitter’s SEO woes to demonstrate hopefully to laymen that all is not well with Twitter, and any claims that they can be indexed normally are false. Any competent SEO could have found all of these issues with the site, and fixing them would reduce the load on Twitter’s servers caused by Google, and maybe allow Google to index more content.
Disclosure: When Tweetglide launched I offered some SEO tips (pro bono) over a few days to the owner and one of his engineers, and had 100% “buy in” to follow my recommendations for internal linking structure on the site. I would possibly change the structure of the pagination links at the bottom but otherwise I think the site from an SEO perspective is doing great… damn… no difference between with and without filter=0 actually amazes me.