How big is the Web? One way to look at is this: There are roughly 4,348 pages out there for each of the 6.9 billion people on the planet, or to put it another way 30 trillion pages. And there will be a lot more tomorrow. Given the immensity of the Web, it seems nothing short of magic that Google's search, imperfect as it is, still indexes it as well as it does.
The Google search blog is often very interesting, and a recent post gives us some insight into how the engine works and how it identifies and kills spam.
As you may know, Google sends out "robots," little programs that crawl across the Web, following links from page to page while sorting them by content and other factors, and adding information to an index. That index is immense, taking up over 100 million gigabytes. Even so, not every page on the Web is indexed. When the robots get to a page, they look for a file called robots.txt which tells the engine not to index. If it's there, and contains instructions placed by an authorized Web master, the Google robot will not index the page.
When you type something in a search box, formulas called algorithms evaluate your query and pull relevant pages from the index. Exactly how those pages are ranked is a closely guarded secret, but Google does say that it uses over 200 factors to do so. Results are typically served up in one-eighth of a second.
Humans, of course, do not enter the picture, but Google uses a corps of trained people to evaluate the accuracy of searches by testing. In a typical year, the company says, it will run over 40,000 evaluations.
Most spam removal is automatic, but some questionable pages are examined by hand. Google looks for quite a few factors that indicate spam. Hidden text and "keyword stuffing" is a clue that a page is bogus, as is user-generated spam that appears on forum or guestbook pages or user profiles.
Last year, Google launched an update to its anti-spam algorithm called Penguin which decreases the rankings of sites that are using what it calls Webspam tactics. When Google is going to take action against a site, it attempts to find and notify the owners and gives them a chance to fix the problem. The number of those requests varies quite a bit, but in one particularly busy month last year, more than 650,000 notices to Web sites were sent out.
As important as search results are to users, they can be life and death to a commercial Web site. They have a huge impact on how much traffic a site gets, and that, in turn, affects ad revenue. Anyone who runs a commercial site (including this one) spends a good deal of time trying to figure out ways to rank high in searches, or in the case of news sites (like this one) how to be included in results on Google News.