In this article, I’d like to share a bit of what I’ve learned in preparation for a book I’m writing on building high ranking websites that Wow users.
Wired.com: What’s the code name of this update? Danny Sullivan of Search Engine Land has been calling it “Farmer” because its apparent target is content farms.
Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So internally we called a big Panda. He was one of the key guys. He basically came up with the breakthrough a few months back that made it possible.
Wired.com: What was the purpose?
Singhal: So we did Caffeine [a major update that improved Google’s indexing process] in late 2009. Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.
Matt Cutts: It was like, “What’s the bare minimum that I can do that’s not spam?” It sort of fell between our respective groups. And then we decided, okay, we’ve got to come together and figure out how to address this.
Wired.com: How do you recognize a shallow-content site? Do you have to wind up defining low quality content?
Singhal: That’s a very, very hard problem that we haven’t solved, and it’s an ongoing evolution how to solve that problem. We wanted to keep it strictly scientific, so we used our standard evaluation system that we’ve developed, where we basically sent out documents to outside testers. Then we asked the raters questions like: “Would you be comfortable giving this site your credit card? Would you be comfortable giving medicine prescribed by this site to your kids?”
Cutts: There was an engineer who came up with a rigorous set of questions, everything from. “Do you consider this site to be authoritative? Would it be okay if this was in a magazine? Does this site have excessive ads?” Questions along those lines.
Singhal: And based on that, we basically formed some definition of what could be considered low quality. In addition, we launched the Chrome Site Blocker [allowing users to specify sites they wanted blocked from their search results] earlier , and we didn’t use that data in this change. However, we compared and it was 84 percent overlap [between sites downloaded by the Chrome blocker and downgraded by the update]. So that said that we were in the right direction.
Wired.com: But how do you implement that algorithmically?
Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons …
Singhal: You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there’s some mixture. Your job is to find a plane which says that most things on this side of the place are red, and most of the things on that side of the plane are the opposite of red.
Now, As you may or may not know, on February 24, 2011, Google rolled out its most significant algorithm change since it first launched in the late 90’s. The update adversely affected about 12% of all US based websites in its index. A huge impact if you were one of the 12 percent.
This update has been code-named “Panda” after Navneet Panda, an Indian born software engineer at Google. The important thing to note here is that Navneet is the person primarily responsible for the “breakthrough” that Singhal describes above, hence the name.
Through my research, and verified by Singhal’s last comment above. I’ve identified that this breakthrough is based on SVM or Support Vector Machine computational analysis. So, in essence, Panda = SVM.
Want more evidence? Here’s a snippet from Navneet’s current resume at the UCSB website under “Projects > Machine Learning” (highlights added for emphasis):
• Development of indexing structures for support vector machines to enable relevant
instance search in high-dimensional datasets
• Speeding up SVM training in multi-category large dataset scenarios
• Speeding up approximate SVM classiﬁcation of data-streams
• Design of a real time web page classiﬁer for text and image data
So, you can see many references to SVM and machine learning concepts, all pretty wonky stuff, but the basic concepts are outlined by Matt and Amit above in laymans terms:
Google has basically created a dividing line or “hyperplane” representation of URLs for any given search term in which on one side are the quality sites and on the other are the bad.
The way it does this is extremely clever. It all starts with google’s vast army of human “quality raters”. These are contracted workers, normal people like you and me …
Aside: a clickbump owner has indicated to me that he’s just been accepted into the google rating program. I had already obtained a copy of the latest Google quality rater’s guide (March, 2011) and I’m looking forward to getting some actual rater feedback to share with you
These “raters”, as Cutts discusses above, are contract workers who are tasked to evaluate a list of website URLs from Google’s index for a given search query. They are then asked to rate the URL in terms of overall user experience, quality and how well the page answers the target search query. All the things that are very hard for a machine or algorithm to discern – unless you can teach it to behave and respond like a rater.
Contrary to popular belief, the raters do not directly impact the rankings of the individual sites they are rating. Rather, google uses their evaluations to create a model of the common traits (aka, signals or vectors) of websites which are viewed favorably by real people. It also creates a model of the common traits of websites which are found not useful at best or “spam” at worst.
As best I can tell at this point, this “model” is based on non visual indicators (apart from the human rater’s visual perceptions). However, Naveet Panda has done work on using SVM for visual image analysis, and as evidenced by his online resume at UCSB, “Design of a real time web page classiﬁer for text and image data”. Since we know that facial recognition software exists and has advanced to a high degree of accuracy, one could expect that the same technology could be used for rapid comparitive analysis of a target URL screenshot against a known set of “Good” visual indicator cues.
What this means is that its conceivable and entirely possible that Google’s algorithm uses a screenshot of the page as a factor in its evaluation.
Google then uses all of this data to “train” its algorithm to instantly classify URLs as good or bad for any given query. In other words, if the URL exhibits an abundance of the signals common in those websites that fall on the “good” side of the hyperplane and very few of the traits of the “bad” sites, then it can make an extremely accurate prediction of the quality and user satisfaction of the URL. And most important of all, due to the massive size of the index, it can do it without ever visiting any website it passes judgment upon.
That last sentence is very important. As you might imagine, given the size and scope of indexing all of the pages on the web, an algorithm that only requires a glimpse of that index in order to return the top matches for any given search, would certainly be considered the holy grail of search, that’s the breakthrough that Amit is talking about in the first and last paragraphs of the Wired interview excerpt above, and it can best be represented graphically like so:
Illustration of hyperplane model separating good results from bad: In this case, the solid and empty dots can be correctly classified by any number of linear classifiers. H1 (blue) classifies them correctly, as does H2 (red). H2 could be considered “better” in the sense that it is also furthest from both groups. H3 (green) fails to correctly classify the dots.
I’m particularly intrigued by Navneet’s last bullet point above “Design of a real time web page classiﬁer for text and image data“. I’ll talk more about this later, but it would seem that this statement is much more powerful than its order in the sequence would suggest.
Cutts: If someone has a specific question about, for example, why a site dropped, I think it’s fair and justifiable and defensible to tell them why that site dropped. But for example, our most recent algorithm does contain signals that can be gamed. If that one were 100 percent transparent, the bad guys would know how to optimize their way back into the rankings.
Singhal: There is absolutely no algorithm out there which, when published, would not be gamed.
So, once the site owner has done everything it can to produce content that is highly useful and relevant to the subject matter, the task of the SEO is to determine the most favorable “signals” or “vectors” that place the site as far away from the bad side of the index as possible. One might speculate that given a large enough selection of top ranked websites, you could identify the most common traits (aka signals or vectors) that they share. In the same manner.
In either case, I believe there will certainly be some common predictors of sites on both sides of the index and that given a set of URLs for a given search query, we should be able to more accurately predict potential search rank based on these factors.
I believe those predictors are:
1. Design and user experience: Choose site design templates that are clean, uncluttered and provide easy to understand navigation and categorization of content. A site logo is a trust signal. A header graphic, while not as strong as a logo, is also a signal of quality.
2. Content quality : Its no longer enough just to provide unique content. You content needs to “Wow” users into wanting to share your page with others. It should be rich and add value with photos, videos, illustrations, graphs, comparisons and well-researched data and conclusions. You should strive to tell both sides of the story, the good and the bad so as to remain objective. All of these traits are found in the pages that people like and respond to the best. They provide a strong signal of quality and these are the types of pages that Google wants to push to the top.
3. Encourage Google +1’s, Facebook likes and Twitter tweets: A social profile of a given URL should be easily indexible and a strong indicator of quality and relevance. As such, you should pay special attention to improving your page’s social profile and actively engaging with your target audience through social media.
4. Track user behavior and bounce rates closely – Google engineers have released several studies that show that bounce rate is a very accurate predictor of quality. As such, you should expect that your page’s bounce rate is of critical importance to your ranking. You want low bounce rates and long user sessions. This is a strong indicator of quality and very easy for an algorithmic evaluation. You can bet that its a central part in the Panda formula.
At this point in the discussion, we are only scratching the surface of the inner working’s of Google Panda. And as I continue to dive into SVM and Navneet’s research, I’ll be presenting more information on what we can learn from his work in order to build better websites.
The core takeaway at this point, and fodder ripe for discussion, is that, based on the core work product of Navneet Panda, along with Amit Singal’s comments to Wired.com above, we can be reasonably certain that the algorithm that determines whether your website is on the good side or the bad side of the index, is found in SVM theory.
If you want to do some reading on your own, here are a few of Navneet’s papers that delve deeper into the topic and give us a glimpse of what makes Google panda purr: