In 2011 Google rolled out an algorithm called “Panda” which was designed to identify low-quality or “thin content” sites. Sites that got hit with it lost roughly 80% of their traffic, so it was pretty devastating. At the time, lots of people did analyses of it, and the most common denominator was that many of the sites were machine-generated (i.e. generated from a database of information, and usually using formula-driven content to beef up the site). Formula-driven content can be in the form of constructing sentences programmatically, or simply by using “article spinning” tools. An article spinner essentially analyzes a document and turns it into a formula, complete with synonyms. You can “spin” content at the paragraph, sentence, or word level, and once you’ve “spun” a document or created the formula, you can run it thousands of times and get different articles that are essentially saying the same thing, but in different ways.
Since 2011, Google has incorporated its Panda detection capability as an ongoing process rather than one that gets run periodically. People in the SEO industry have speculated about Panda since then, and have (IMO) all kinds of cockamamie hand-waving explanations that sound interesting but really aren’t very specific.
Back in 2011, I spent countless days analyzing patents and reading up on various algorithms. I was, and am now, convinced that Google has a very very simple process for identifying these types of sites.
Before I explain what that process is, I suggest you first go read this old post of mine on ENTROPY. The entropy of a document is a measurement of how much information is in it. The simple example I give in the post is, typing “all work and no play makes jack a dull boy” 30 times, does not have much information in it. In fact, the document can be described by saying “all work and no play makes jack a dull boy repeat 30 times” – just 13 words.
So, if you can measure the entropy of a document, why not measure the entropy of a website? But, how can one easily do so?
Well, if you look at my 13 word description and think about it, what is that really? I’ll tell you what, it’s a COMPRESSED version of the original document. Intuitively, the smaller a document compresses, the less real information it must have had in it. We’ve all used Winzip many times, and sometimes you notice that if you’re compressing a photo that has gradients everywhere and lots of color changes, it doesn’t compress much. But another one with a white background compresses a lot. That’s because the one with all the colors and gradients is more INTERESTING….it has more actual information in it. With text, it’s much the same.
So, what I believe Google is doing is very simple. They are (well, conceptually anyway) compressing your entire website, then comparing the compressed size of it to the size of it uncompressed. If the ratio is too low, then the site doesn’t add a whole lot of value.
Can I prove this? No. If you’re an intelligent person and you think about this for just a few minutes, can you deduce that it’s probably correct? Yes. Sometimes in SEO that’s the best we can do, Google’s not going to make it easy by announcing everything, sorry!
If you want to dig into this further, check out the links to papers inside the old blog post I posted above; then search for things like “document entropy” and “koloromogov complexity”.
Concepts like document entropy and koloromogov complexity aren’t just applicable to this one problem; they are IMO very probably useful for helping detect AI-generated content as well, in various ways. So they’re well worth reading up on and thinking about at this point.