SEOmoz LDA Tool – Just 3 Points

 

I am not an algorithm and patent geek to the extent of many of my friends & peers.

Dana last week wrote about the new SEOmoz experimental tools for LDA (Latent Dirichlet Allocation).

Rand has now followed up with a clearer description of what their tool is.

I played around with the tool a little and there was some discussion on Sphinn in relation to another post about it.

At that time I expressed some concern over the randomness to friends, but let that slide. I was a little more concerned that this was announced as being available, and at the same time in my email box I was receiving notifications that now was the last chance to upgrade to a SEOmoz Pro subscription.
Rand in the Sphinn conversation was asking for people not to review the LDA tool based on information currently available, and to wait until Tuesday when he would have a more detailed post available.

So on one hand we have price scarcity… a viable marketing strategy but a request/suggestion not to review the new fangled 3rd generation keyword density, latent semantic indexing (LSI), LDA (Latent Dirichlet Allocation) tool that was causing a lot of positive buzz.

I love the passion in the article, but I’d ask that we have until our public release on Tuesday to explain what it does, how it calculates, the models, math, etc. I won’t try to address criticism until then.

So that is point 1 – suppressing objective reviews based upon available information whilst using price scarcity sucks.
I did delay this post and wait, but only because I wanted to see the official claims before this next point.

So now we look at the tool itself and see if it has some kind of useful application. I mentioned random results.
I thought about using lots of screenshots of every test, but then I decided the exact test I performed doesn’t really matter.

55%, 49%, 59%, 52%, 56%, 57%, 55%, 57%, 67%, 57% = correlation?

Dana mentioned in her post the “random Monte Carlo algorithm”

Rand was a little more specific:-

Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query.

I would need to run a lot more tests, but on my “swingometer” I have a variation of 18% just from that small sample. If I could define a mid point of that small sample, it would be in the mid 50s, which might suggest I hadn’t seen the worst of it and that at least 20% variation is possible.

Rand seems to be claiming some kind of correlation in his blog post title.

Latent Dirichlet Allocation (LDA) and Google’s Rankings are Remarkably Well Correlated

To claim correlation he would have had to have run each page through his LDA tool maybe 100 times to get some kind of reasonable average.

So point 2 is that the results currently being portrayed as some kind of correlation could well be bullshit based on the extremely unreliable results from the LDA tool.

The final point is a little about data.

SEOmoz have a history, possibly a little unfounded of making huge claims about things and then being picked apart. An interesting situation for instance were claims about the source of their data for Linkscape. Sebastian covered it and did Michael (actually quite a bit)

Linkscape data is used by services like Open Site Explorer and many 3rd parties via their API. It is useful… the point is that data has to come from somewhere.

“The Rising Tide Lifts All Boats” – as long as it lifts you higher than your competitors

In the case of using the SEOmoz LDA tool effectively, one option is to enter a website address (that is what I was testing) – then you might run 10, 50, maybe 100 views of that page to get reliable data.
Then you would do the same with 10 of your competitors.

All the time you are showing the keywoord that a particular page should rank for, possibly one of your own, and also comparing it to the pages you feel it should compete against.

As far as business data is concerned, that is pretty useful. I know that SEOmoz will certainly claim that any data collected is only used for improvements in the tool.
In fact SEOmoz do include a statement on their main website regarding tools data.

What Tool Usage Information Does SEOmoz Collect?

SEOmoz offers a variety of online tools and software. These include, but are not limited to, our free SEO tools, our paid SEO tools, our API, and our tools on OpenSiteExplorer.org. These tools require you to enter a variety of information, such as URLs, domains, keywords, or other items relevant to Internet marketing and link research. We associate this information with your account in order to provide useful features, identify and terminate accounts that violate our Terms of Service, to improve our products, and to provide customer service. We never use this information for the provision of SEO consulting services so you do not need to worry that entering your information will be used against you or your clients by SEOmoz.

We take appropriate physical, electronic, and other security measures to help safeguard personal information from unauthorized access, alteration, or disclosure.

Some people however have tin foil hats. The ones with the biggest tin foil hats are not the blackhats, but those who work with major corporations who have signed contracts that prevent them using tools that might in any way share data with 3rd parties.

Small update:-

When I wrote this part I was actually thinking about whether someone like Steve Plunkett would use a 3rd party tool like this, and he conveniently tweeted.

http://bit.ly/9VPf9y - TOOLS... yes SEO haz them, but should YOU use them? (with all the client data?)
@steveplunkett
steveplunkett

Steve works in the corporate SEO world – data security is sacrosanct for his INC500 clients

So the 3rd point is be aware of who you are sharing data with about your own sites or that of clients

Ultimately a real LDA Google information retrieval geek (my mate Dave) is fairly positive that the conversation and experimentation is taking place.

That doesn’t mean Google are actually using LDA, or suggest any correlation with the current SERPs.

One thing I am sure off though, if you have paid for some kind of SEO miracle plugin/software/service for bloggers based upon keyword density you really need to read and try to understand Rand’s post, as it highlights how much junk you have been fed by your Tribe.
LDA > LSI > Keyword Density but that doesn’t mean they are used by Google for search engine results.

Update

You know how once you release technology people either copy it very quickly or release their own competing product.. “showing their cards” so to speak?

Well it looks like Matthew has an Advanced LDA tool with a very simple interface that seems to give very similar correlation data to what I discovered with the SEOmoz LDA tool.

 

Liked this post? Follow this blog to get more. Follow

Comments

  1. asbar says

    In this digital age, Google AdWords have the magic to open cash fountains for businesses, much like Ali Baba’s magic line, “open sesame.” Google AdWords however, do not reveal a cave of gold treasures; instead it gives businesses so much more: consistent cash flow through ingenious online marketing, with the entire world as the audience and target market.

  2. Cody says

    Interesting writeup, I like the research that you’ve done.

    To add to your section about LDA > LSI > Keyword Density … but neither are what Google uses to Rank. I think that’s the most important consideration, what you’ve pointed out here. While LDA is a good tool, and of course Google is going to be using some form of topical modeling, it doesn’t mean that it ranks using LDA. I would like to do some research, but assuming SEOmoz’s tool is correct, I would imagine that there are site results stuck in Pg2 or 3 of the results that actually have better LDA content rating’s than what’s ranking #1.

    Why? Because while Google may be using query expansion, topical modeling, and relevance feedback … it’s ultimately built upon, and added to, PageRank.

  3. says

    Rand is in business. Of course he’s going to spin things to promote his products. People need to understand that, yet enough people will fail to do so. So much the better for those of us who see the bigger picture.

    Personally, I use OSE – not for 100% accuracy, but for trending. Just like the GKT. Even though every tool can be wildly useless when taken at face value, some of them are either good for trending, or to get me to think more. And that’s why I like when new tools come out. They get me thinking more.

    Would I suggest their new LDA tool for anyone at all? Only to get them to think more.

    • James says

      I’m not too sure I’d recommend their LDA tool to anybody.. it looks like a glamourised version of a keyword density checker! However, I did look more into LDA, LSI and Vector space theory so i suppose it did get me thinking more, whether it was ultimately useful who knows?

    • says

      Even keyowrd density tools at least get people to think, combined with encouraging them to link out and improve their titles and Google finally understand what their douments contain and they suddenly start ranking and hollering that whatever they user is a gift from god.. or SEO equivalent.

      The random results are random results.

      The claims being made at the same time as price scarcity were a lot more exagerated than in Rand’s post

      The data implications are tangible and that tweet from Steve I included is because things like that matter to him.

      I joked with him earlier that he has a thickest tin foil hat I know, thicker than most blackhats because he can’t share data.

  4. Recyclage Electroniq says

    Google adword have really excellent to make a huge amount of money . Its really very simple process , at least better then all other outsourcing . Another hand it give the business a different flow of success towords your success.

    Thanks andy for the great article……..

  5. says

    LDA doesn’t care about word proximity or word order, so it’s not relevant to what Google is trying to accomplish in its search results.

    And, as usual, the SEOmoz “correlation data” is anything BUT correlative.

    So now we’re in for another two years of nonsense until people get the message: LDA is irrelevant.

    • says

      I just avoided the topic of whether LDA is really relevant to Google and bow to superior knowledge.

      I can’t see it being possible to claim correlation with the current data output… which might be a dumbed down version of the output they used in their tests. Initial tsting might have had access to a wider data set wthus be more accurate, at the expense of more processing – but to provide that at scale might not have been viable.

      If that was the case, it would make the public tool crippled and not the same as what was used in testing.

      2 years of nonsense? I hope not, but par for the territory.

  6. says

    I’ve been looking at semantics and IR only half-way as much as I should in recent years. I don’t think there is some secret sauce, but we do know that Google cares about relevant content. From reading Michael’s comment, it sounds like he sees more focus on word proximity or word order than context in general. Gosh, Google why don’t you just tell us which combination of models and words it is? (rhetorical question)

    I did want to share that the hype from the tool came as a result of tweets and members in the audience at the SEOmoz seminar not from the stage. I’m not defending but clarifying. You raise some valid points for consideration, Andy.

    A slide in Ben’s LDA presentation read:

    “One of us needs to implement it so we can:
    1) See how it applies to pages
    2) See if it helps explain SERPs
    One-two-three-not-it.”

    Time and testing will tell if LDA is 1, 2 or 3 not it.

    The positive by-product is maybe people will pay more attention to IR and to writing some kick-butt content!

    • says

      I did get the impression that the “hype” was more the tweets – your own article which I ensured I liked to first did mention the random factors – I even researched the random factors within hours of you posting and was all prepared to write a calming article.

      It wouldn’t have been based upon as much detail as is currently available but it (along with any other commentary such as a much earlier post by Dave) might have had a negative effect on the last day of sales at the $79 price point.

      Rand asking people to hold off, but running that promotion certainly tainted things for me.

      On Twitter I expected “social bites” and positive insight to appear, as it always does, but the person who seemed to be leading it was @seomom – nothing wrong with Jillian doing that, because any new technology in the space is something to be encouraged. I wouldn’t even fault her for interspersing mentions of the $79 deal going away. It wouldn’t be the first product marketed with a new feature that is untested but showing promise – but there shouldn’t then be requests asking for restraint in offering opinions.

      You could well get correlation even if Google use something totally different, and adding related terms in a natural matter can meet the needs of many different techniques to determine relevance.

      • says

        Agreed, Andy. I had not considered (or paid attention to) the marketing aspect, timing or how that would be perceived. I had seen encouragement (a couple weeks ago) to some folks on Twitter who were interested in the new Web App/Campaign Manager to consider joining sooner vs. later due to planned price increase. The Pro membership has additional offerings from that rollout, and the LDA was not part of it. I didn’t see this LDA tool as being pivotal. However, perception can be everything.

        Does LDA help explain the SERPs to some extent? To quote you,

        “…adding related terms in a natural matter can meet the needs of many different techniques to determine relevance.”

        My next thought is how do other on-page factors such as page segmentation fit into this? I’ll ask that over at SEOmoz once the conversation dies down. :-) With over 200 factors, no wonder people get excited to think one of them has a significant correlation. (Pavlov’s Bell)

        SEOmoz’s LDA tool provides an opportunity to see if the LDA viewpoint is valid or not. Hopefully, in a bigger picture, it will lead to people focusing on better contextual content. Heck, I sure would like to see people write content that stays on-topic and not meander. What’s good for me as a reader is certainly good for Google. It all comes back to content as being the real secret.