I am not an algorithm and patent geek to the extent of many of my friends & peers.
Dana last week wrote about the new SEOmoz experimental tools for LDA (Latent Dirichlet Allocation).
Rand has now followed up with a clearer description of what their tool is.
At that time I expressed some concern over the randomness to friends, but let that slide. I was a little more concerned that this was announced as being available, and at the same time in my email box I was receiving notifications that now was the last chance to upgrade to a SEOmoz Pro subscription.
Rand in the Sphinn conversation was asking for people not to review the LDA tool based on information currently available, and to wait until Tuesday when he would have a more detailed post available.
So on one hand we have price scarcity… a viable marketing strategy but a request/suggestion not to review the new fangled 3rd generation
keyword density, latent semantic indexing (LSI), LDA (Latent Dirichlet Allocation) tool that was causing a lot of positive buzz.
I love the passion in the article, but I’d ask that we have until our public release on Tuesday to explain what it does, how it calculates, the models, math, etc. I won’t try to address criticism until then.
So that is point 1 – suppressing objective reviews based upon available information whilst using price scarcity sucks.
I did delay this post and wait, but only because I wanted to see the official claims before this next point.
So now we look at the tool itself and see if it has some kind of useful application. I mentioned random results.
I thought about using lots of screenshots of every test, but then I decided the exact test I performed doesn’t really matter.
55%, 49%, 59%, 52%, 56%, 57%, 55%, 57%, 67%, 57% = correlation?
Dana mentioned in her post the “random Monte Carlo algorithm”
Rand was a little more specific:-
Scores Change Slightly with Each Run
This is because, like a pollster interviewing 100 voters in a city to get a sense of the local electorate, we check a sample of the topics a content+query combo could fit with (checking every possibility would take an exceptionally long time). You can, therefore, expect the percentage output to flux 1-5% each time you check a page/content block against a query.
I would need to run a lot more tests, but on my “swingometer” I have a variation of 18% just from that small sample. If I could define a mid point of that small sample, it would be in the mid 50s, which might suggest I hadn’t seen the worst of it and that at least 20% variation is possible.
Rand seems to be claiming some kind of correlation in his blog post title.
Latent Dirichlet Allocation (LDA) and Google’s Rankings are Remarkably Well Correlated
To claim correlation he would have had to have run each page through his LDA tool maybe 100 times to get some kind of reasonable average.
So point 2 is that the results currently being portrayed as some kind of correlation could well be bullshit based on the extremely unreliable results from the LDA tool.
The final point is a little about data.
SEOmoz have a history, possibly a little unfounded of making huge claims about things and then being picked apart. An interesting situation for instance were claims about the source of their data for Linkscape. Sebastian covered it and did Michael (actually quite a bit)
Linkscape data is used by services like Open Site Explorer and many 3rd parties via their API. It is useful… the point is that data has to come from somewhere.
“The Rising Tide Lifts All Boats” – as long as it lifts you higher than your competitors
In the case of using the SEOmoz LDA tool effectively, one option is to enter a website address (that is what I was testing) – then you might run 10, 50, maybe 100 views of that page to get reliable data.
Then you would do the same with 10 of your competitors.
All the time you are showing the keywoord that a particular page should rank for, possibly one of your own, and also comparing it to the pages you feel it should compete against.
As far as business data is concerned, that is pretty useful. I know that SEOmoz will certainly claim that any data collected is only used for improvements in the tool.
In fact SEOmoz do include a statement on their main website regarding tools data.
What Tool Usage Information Does SEOmoz Collect?
SEOmoz offers a variety of online tools and software. These include, but are not limited to, our free SEO tools, our paid SEO tools, our API, and our tools on OpenSiteExplorer.org. These tools require you to enter a variety of information, such as URLs, domains, keywords, or other items relevant to Internet marketing and link research. We associate this information with your account in order to provide useful features, identify and terminate accounts that violate our Terms of Service, to improve our products, and to provide customer service. We never use this information for the provision of SEO consulting services so you do not need to worry that entering your information will be used against you or your clients by SEOmoz.
We take appropriate physical, electronic, and other security measures to help safeguard personal information from unauthorized access, alteration, or disclosure.
Some people however have tin foil hats. The ones with the biggest tin foil hats are not the blackhats, but those who work with major corporations who have signed contracts that prevent them using tools that might in any way share data with 3rd parties.
When I wrote this part I was actually thinking about whether someone like Steve Plunkett would use a 3rd party tool like this, and he conveniently tweeted.
Steve works in the corporate SEO world – data security is sacrosanct for his INC500 clients
So the 3rd point is be aware of who you are sharing data with about your own sites or that of clients
Ultimately a real LDA Google information retrieval geek (my mate Dave) is fairly positive that the conversation and experimentation is taking place.
That doesn’t mean Google are actually using LDA, or suggest any correlation with the current SERPs.
One thing I am sure off though, if you have paid for some kind of SEO miracle plugin/software/service for bloggers based upon keyword density you really need to read and try to understand Rand’s post, as it highlights how much junk you have been fed by your Tribe.
LDA > LSI > Keyword Density but that doesn’t mean they are used by Google for search engine results.
You know how once you release technology people either copy it very quickly or release their own competing product.. “showing their cards” so to speak?
Well it looks like Matthew has an Advanced LDA tool with a very simple interface that seems to give very similar correlation data to what I discovered with the SEOmoz LDA tool.