SVM – The Secret Sauce inside Google’s Panda

In this article, I’d like to share a bit of what I’ve learned in preparation for a book I’m writing on building high ranking websites that Wow users.

To set things up, lets take a look at a very enlightening conversation that Wired magazine had with two of the engineers responsible for Search quality at Google, Matt Cutts and Amit Singhal

Wired.com: What’s the code name of this update? Danny Sullivan of Search Engine Land has been calling it “Farmer” because its apparent target is content farms.

Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So internally we called a big Panda. He was one of the key guys. He basically came up with the breakthrough a few months back that made it possible.

Wired.com: What was the purpose?

Singhal: So we did Caffeine [a major update that improved Google’s indexing process] in late 2009.  Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.

Matt Cutts: It was like, “What’s the bare minimum that I can do that’s not spam?”  It sort of fell between our respective groups. And then we decided, okay, we’ve got to come together and figure out how to address this.

Wired.com: How do you recognize a shallow-content site? Do you have to wind up defining low quality content?

Singhal: That’s a very, very hard problem that we haven’t solved, and it’s an ongoing evolution how to solve that problem. We wanted to keep it strictly scientific, so we used our standard evaluation system that we’ve developed, where we basically sent out documents to outside testers. Then we asked the raters questions like: “Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?”

Cutts: There was an engineer who came up with a rigorous set of questions, everything from. “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?” Questions along those lines.

Singhal: And based on that, we basically formed some definition of what could be considered low quality. In addition, we launched the Chrome Site Blocker [allowing users to specify sites they wanted blocked from their search results] earlier , and we didn’t use that data in this change. However, we compared and it was 84 percent overlap [between sites downloaded by the Chrome blocker and downgraded by the update]. So that said that we were in the right direction.

Wired.com: But how do you implement that algorithmically?

Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons …

Singhal: You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there’s some mixture. Your job is to find a plane which says that most things on this side of the place are red, and most of the things on that side of the plane are the opposite of red.

Now, As you may or may not know, on February 24, 2011, Google rolled out its most significant algorithm change since it first launched in the late 90′s. The update adversely affected about 12% of all US based websites in its index. A huge impact if you were one of the 12 percent.

This update has been code-named “Panda” after Navneet Panda, an Indian born software engineer at Google. The important thing to note here is that Navneet is the person primarily responsible for the “breakthrough” that Singhal describes above, hence the name.

Through my research, and verified by Singhal’s last comment above. I’ve identified that this breakthrough is based on SVM or Support Vector Machine computational analysis. So, in essence, Panda = SVM.

Want more evidence? Here’s a snippet from Navneet’s current resume at the UCSB website under “Projects > Machine Learning” (highlights added for emphasis):

• Development of indexing structures for support vector machines to enable relevant
instance search in high-dimensional datasets
• Speeding up SVM training in multi-category large dataset scenarios
• Speeding up approximate SVM classification of data-streams
• Design of a real time web page classifier for text and image data

So, you can see many references to SVM and machine learning concepts, all pretty wonky stuff, but the basic concepts are outlined by Matt and Amit above in laymans terms:

Google has basically created a dividing line or “hyperplane” representation of URLs for any given search term in which on one side are the quality sites and on the other are the bad.

The way it does this is extremely clever. It all starts with google’s vast army of human “quality raters”. These are contracted workers, normal people like you and me …

Aside: a clickbump owner has indicated to me that he’s just been accepted into the google rating program. I had already obtained a copy of the latest Google quality rater’s guide (March, 2011) and I’m looking forward to getting some actual rater feedback to share with you

These “raters”, as Cutts discusses above, are contract workers who are tasked to evaluate a list of website URLs from Google’s index for a given search query. They are then asked to rate the URL in terms of overall user experience, quality and how well the page answers the target search query. All the things that are very hard for a machine or algorithm to discern – unless you can teach it to behave and respond like a rater.

Contrary to popular belief, the raters do not directly impact the rankings of the individual sites they are rating. Rather, google uses their evaluations to create a model of the common traits (aka, signals or vectors) of websites which are viewed favorably by real people. It also creates a model of the common traits of websites which are found not useful at best or “spam” at worst.

As best I can tell at this point, this “model” is based on non visual indicators (apart from the human rater’s visual perceptions). However, Naveet Panda has done work on using SVM for visual image analysis, and as evidenced by his online resume at UCSB, “Design of a real time web page classifier for text and image data”. Since we know that facial recognition software exists and has advanced to a high degree of accuracy, one could expect that the same technology could be used for rapid comparitive analysis of a target URL screenshot against a known set of “Good” visual indicator cues.

What this means is that its conceivable and entirely possible that Google’s algorithm uses a screenshot of the page as a factor in its evaluation.

Google then uses all of this data to “train” its algorithm to instantly classify URLs as good or bad for any given query. In other words, if the URL exhibits an abundance of the signals common in those websites that fall on the “good” side of the hyperplane and very few of the traits of the “bad” sites, then it can make an extremely accurate prediction of the quality and user satisfaction of the URL. And most important of all, due to the massive size of the index, it can do it without ever visiting any website it passes judgment upon.

That last sentence is very important. As you might imagine, given the size and scope of indexing all of the pages on the web, an algorithm that only requires a glimpse of that index in order to return the top matches for any given search, would certainly be considered the holy grail of search,  that’s the breakthrough that Amit is talking about in the first and last paragraphs of the Wired interview excerpt above, and it can best be represented graphically like so:

hyperplane classification of data Illustration of hyperplane model separating good results from bad: In this case, the solid and empty dots can be correctly classified by any number of linear classifiers. H1 (blue) classifies them correctly, as does H2 (red). H2 could be considered “better” in the sense that it is also furthest from both groups. H3 (green) fails to correctly classify the dots.

I’m particularly intrigued by Navneet’s last bullet point above “Design of a real time web page classifier for text and image data“. I’ll talk more about this later, but it would seem that this statement is much more powerful than its order in the sequence would suggest.

Cutts: If someone has a specific question about, for example, why a site dropped, I think it’s fair and justifiable and defensible to tell them why that site dropped. But for example, our most recent algorithm does contain signals that can be gamed. If that one were 100 percent transparent, the bad guys would know how to optimize their way back into the rankings.

Singhal: There is absolutely no algorithm out there which, when published, would not be gamed.

So, once the site owner has done everything it can to produce content that is highly useful and relevant to the subject matter, the task of the SEO is to determine the most favorable “signals” or “vectors” that place the site as far away from the bad side of the index as possible. One might speculate that given a large enough selection of top ranked websites, you could identify the most common traits (aka signals or vectors) that they share. In the same manner.

In either case, I believe there will certainly be some common predictors of sites on both sides of the index and that given a set of URLs for a given search query, we should be able to more accurately predict potential search rank based on these factors.

I believe those predictors are:

1. Design and user experience: Choose site design templates that are clean, uncluttered and provide easy to understand navigation and categorization of content. A site logo is a trust signal. A header graphic, while not as strong as a logo, is also a signal of quality.

2. Content quality : Its no longer enough just to provide unique content. You content needs to “Wow” users into wanting to share your page with others. It should be rich and add value with photos, videos, illustrations, graphs, comparisons and well-researched data and conclusions. You should strive to tell both sides of the story, the good and the bad so as to remain objective. All of these traits are found in the pages that people like and respond to the best. They provide a strong signal of quality and these are the types of pages that Google wants to push to the top.

3. Encourage Google +1′s, Facebook likes and Twitter tweets: A social profile of a given URL should be easily indexible and a strong indicator of quality and relevance. As such, you should pay special attention to improving your page’s social profile and actively engaging with your target audience through social media.

4. Track user behavior and bounce rates closely - Google engineers have released several studies that show that bounce rate is a very accurate predictor of quality. As such, you should expect that your page’s bounce rate is of critical importance to your ranking. You want low bounce rates and long user sessions. This is a strong indicator of quality and very easy for an algorithmic evaluation. You can bet that its a central part in the Panda formula.

At this point in the discussion, we are only scratching the surface of the inner working’s of Google Panda. And as I continue to dive into SVM and Navneet’s research, I’ll be presenting more information on what we can learn from his work in order to build better websites.

The core takeaway at this point, and fodder ripe for discussion, is that, based on the core work product of Navneet Panda, along with Amit Singal’s comments to Wired.com above, we can be reasonably certain that the algorithm that determines whether your website is on the good side or the bad side of the index, is found in SVM theory.

If you want to do some reading on your own, here are a few of Navneet’s papers that delve deeper into the topic and give us a glimpse of what makes Google panda purr:

Efficient Top-k Hyperplane Query Processing for Multimedia Information Retrieval
Navneet Panda and Edward Y. Chang
ACM International Conference on Multimedia, MM Oct. 2006

Concept Boundary Detection for Speeding up SVMs
Navneet Panda, Edward Y. Chang and Gang Wu
International Conference on Machine Learning, 2006

KDX: An Indexer for Support Vector Machines
Navneet Panda and Edward Y. Chang
(Transactions of Knowledge and Data Engg., Jun 2006)

Exploiting Geometry for Support Vector Machine Indexing
Navneet Panda and Edward Y. Chang
(2005 SIAM International Conference on Data Mining)

Hypersphere Indexer
Navneet Panda, Edward Y. Chang and Arun Qamra
Database and Expert Systems Applications, DEXA 2006

Active Learning in Very Large Databases
Navneet Panda, Kingshy Goh and Edward Y. Chang
(Journal of Multimedia Tools and Applications Special Issue on Computer
Vision Meets Databases (invited submission))

Formulating Context-dependent Similarity
Gang Wu, Navneet Panda and Edward Y. Chang
ACM International Conference on Multimedia (MM), Singapore, November 2005

Formulating Distance Functions via the Kernel Trick
Gang Wu, Navneet Panda and Edward Y. Chang
ACM International Conference on Data Mining and Knowledge Discovery KDD, 2005

4 Responses to “SVM – The Secret Sauce inside Google’s Panda”

  1. Scott, I only came across this article of yours after someone linked to it on some Black Hat forum :).

    I am pretty sure you were totally on the money with most of this – I have also been banging on about this since the original Panda. Yet I can’t believe that most of the SEO community still don’t recognise these factors you are talking about! They are STILL scratching their heads wondering why their crappy monochrome Xfactor-template sites aren’t ranking.

    You know what, I have banged on about it enough, I will be keeping quiet from now on :)

  2. Tech says:

    Good post, bounce rate is pointless though, at least until your site hit’s the first page. Also on some niches such as health it is arbitrarily high, so no signal quality can really be determined (at least I think) for example fitness.com had an above 80% bounce rate, the same as my fitness blog.

  3. Jeff says:

    Scott,

    Good read. Thanks. I love your products. You keep it simple. I imagine that is going to be the end result of your research into this. You’ll distill all the complexities into simple, effective process, products, themes.

    I am looking forward to following you on this.

    Thanks,

    Jeff

    • Scott says:

      @Jeff, thanks for your comments. I’ll definitely be trying to simplify what this means. There is definitely some highly complex calculas involved in all of this, however, we pretty much know the baseline “inputs” into the system are:

      - site meta (includes the title, url, and meta description and possibly an overall page theme as the key indicators of page scope)
      - bounce rate (how many clicks result in a click back to search on the same search phrase? user disatisfaction)
      - ads on page
      - semantic structure of the page (well formed?, validates?, heading tags?, bold and emphasis used on which keywords?)
      - content relevancy (an index number, say 1-100, that reflects the overall analysis of the page’s raw content, without tags, based on proprietary formulas that acertain the core theme or themes of the page’s content)
      - content diversity (an index number that represents the mixture of elements on the page including images and video assets)
      - inbound links profile (an index number, say from 1-100, that reflects the quantity and quality of inbound links)
      - social profile (an index number, say from 1-100, that reflects the quantity and quality of inbound social links, twitter tweets, facebook likes and google +1′s, etc)

☝ Back to Top