What do LDA, LSI and a few other geeky acronyms have to do with Search Rank Predictability?

The folks over at SEOmoz knew they were on to something…

There is a growing body of evidence in SEO to support the idea that “Topic Modeling“, Latent Semantic Indexing (aka LSI) and its hip new cousin, Latent Dirichlet Allocation, represent the missing link in determining how some pages consistently rank higher than other pages when all other things are equal (number and quality of backlinks, for example).

In preparation for release of my latest software product, ClickBump SEO! I’ve been intently studying software applications of LSI in general and more specifically, the science of topic and vector space modeling.

So, it was with great interest, that I approached a post on SEOmoz regarding a new and exciting offshoot of LSI called Latent Dirichlet Allocation. LDA, a really geeky sounding term if there ever was one, can legitimately be described as the single hottest development in search engine optimization that you’ve never heard of.

Yet…

However, the folks at SEOmoz are about to change that. And in a major way.

They have developed a context-based algorithm for determining search relevance called Latent Dirichlet Allocation or LDA for short (thank Goodness for abbreviations!). What’s absolutely ground breaking though is that their research has shown a strong and unmistakable correlation between LDA and search rankings:


(the vertical blue bars indicate standard error in the diagram, which is relatively low thanks to the large sample set)

Using the same process we did for our release of Google vs. Bing correlation/ranking data at SMX Advanced (we posted much more detail on the process here), we’ve shown the Spearman correlations for a set of metrics familiar to most SEOs against some of the LDA results, including:

  • TF*IDF – the classic term weighting formula, TF*IDF measures keyword usage in a more accurate way than a more primitive metric like keyword density. In this case, we just took the TF*IDF score of the page content that appeared in Google’s rankings
  • Followed IPs – this is our highest correlated single link-based metric, and shows the number of unique IP addresses hosting a website that contains a followed link to the URL. As we’ve shown in the past, with metrics like Page Authority (which uses machine learning to build more complex ranking models) we can do even better, but it’s valuable in this context to just think and compare raw link numbers.
  • LDA Cosine – this is the score produced from the new LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.

The correlation with rankings of the LDA scores are uncanny. Certainly, they’re not a perfect correlation, but that shouldn’t be expected given the supposed complexity of Google’s ranking algorithm and the many factors therein. But, seeing LDA scores show this dramatic result made us seriously question whether there was causation at work here (and we hope to do additional research via our ranking models to attempt to show that impact). Perhaps, good links are more likely to point to pages that are more “relevant” via a topic model or some other aspect of Google’s algorithm that we don’t yet understand naturally biases towards these.

However, given that many SEO best practices (e.g. keywords in title tags, static URLs and ) have dramatically lower correlations and the same difficulties proving causation, we suspect a lot of SEO professionals will be deeply interested in trying this approach.

I was particularly struck with this point:

LDA labs tool. It measures the cosine similarity of topics between a given page or content block and the topics produced by the query.

It was particularly gratifying to see these results because this is the exact model I’ve implemented in the topic matching algorithm for ClickBump SEO! And in particular, it confirms what I’ve found to be true in my own experience with topic modeling and content authority rank predictability with the top results on Google for a given search query.

Without boring you too much with the details of what the ClickBump SEO! keyword matching algorithm is doing at a mathematical level, its primary function is to facilitate the process of content rank maximization by assembling the best collection of topic modeled, LSI matching keyphrases from the ordered aggregate top content/page results for a given keyphrase in real time. Essentially, the secret soup collection of terms that are found most often across the top 50 ranked pages for a given query and weighted by result rank and frequency of occurrence.

To see this new LSI matching algorithm in action, lets take a look at ClickBump SEO! for WordPress

2 Responses to “What do LDA, LSI and a few other geeky acronyms have to do with Search Rank Predictability?”

  1. Mark says:

    Hi, this is totally one of your most awesome blog posts, my friend. Please make sure that you keep up this great work!

  2. Patrick says:

    Scott,

    Just had to say I’m stunned! I hadn’t realised about the LSI element you’d added to your SEO plugin. The SEO plugin was already fairly impressive, it easily points out where one’s page is lacking in SEO factors, saves a lot of time.

    The LSI is lightning fast, still can’t believe it. Virtually the second I clicked the ‘get LSI’ button almost 30 words/phrases popped up in the box. That would take me ages of going through relevant articles to ascertain and even then it wouldn’t be as good.

    I’ve just checked an existing site of mine and there are quite a few words that I haven’t used, that I can now see as being relevant and worth integrating. This is a fabulous plugin :-0 … can you tell I’m happy!!!

    Congratulations!

☝ Back to Top