Related: Mathematical analysis proves that Google's counts are bogus


Google is dying

Death by a billion cuts

by Daniel Brandt, October 15, 2004
Updated June 4, 2005

On sites with more than a few thousand pages, Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about. These pages show up in Google's main index as a listing of the URL, which means that the Googlebot is aware of the page. But they do not show up as an indexed page. When the page is listed but not indexed, the only way to find it in a search is if your search terms hit on words in the URL itself. Even if they do hit, these listed pages rank so poorly compared to indexed pages, that they are almost invisible. This is true even though the listed pages still retain their usual PageRank.

I have been complaining about this since April 2003, and it became more visible in 2004. There is no method to Google's madness, which is another way of saying that this phenomenon is not characteristic of any particular type of site. It is happening across the entire landscape of large sites. I found it on www.johnkerry.com, on searchenginewatch.com, and dozens of other large sites I checked. Our own site, www.namebase.org, is a clean example of this, and I will use it to show how to do searches that expose this phenomenon.

You have to know what to look for and how to look for it. First of all, a listing consists of the URL in place of the title on Google's search results pages, in blue, and below this in a smaller font there appears a "Similar pages" link in blue. That's all. An indexed page has a real title, almost always has a snippet in black, shows the URL and the size of the page in green, and then has "Cached" and "Similar pages" links in blue. (On NameBase we disallow Google's cache copy, so the "Cached" link is legitimately missing on all of our pages.) These two types of links are very different and immediately obvious. However, you should set your Google preferences to 100 links per page, because the listed links are buried much deeper in the results.

Before I explain how to isolate the listed links from the indexed links, there are two cases I know of where a listing is normal for Google. These are exceptions to the phenomenon that interests me in this essay. Neither is relevant to NameBase, but I have to mention them in case you want to examine other sites. The first exception is when a site has certain directories disallowed in their robots.txt file. Google will habitually list the URLs in the disallowed directory but not index them. (This itself is an invasion of privacy, because filenames can be very revealing — but that's a rant for another day.)

The second exception is when there are ID numbers at the end of the URL, particularly if these numbers follow a question mark in the URL. Google avoids any URL that looks like it might be a problem. Sometimes this number is a session ID number from a shopping cart site. If Google followed these links, the crawler might end up grabbing thousands of duplicate pages, distinguished only by the session ID.

Now that you know what I'm not talking about, here is how you can investigate a site. First you have to find a word on the site that is present on nearly every page of the site. On some of the sites we looked at, the word "reserved" from the copyright notice (as in "All rights reserved") worked fairly well. On NameBase, we have "book reviews" at the bottom. The "site:" command is used in conjunction with "book reviews":

        site:www.namebase.org "book reviews"

That search asks for all pages from www.namebase.org that include either the phrase "book reviews." These will be indexed pages. If the page was merely listed, Google wouldn't be aware that one of these phrases is at the bottom of the page. Next you can request all pages that do not contain these phrases, by inserting a minus sign in front of the phrase:

        site:www.namebase.org -"book reviews"

The numbers that Google reports are strange. As of June 4, 2005 they are 203,000 and 279,000 for the two searches above. Considering that NameBase has never had more than 132,000 pages on its site, and our file-naming conventions have been stable now for several years, this means Google cannot count. If this isn't sufficiently bizarre, try just site:www.namebase.org, which shows a count of 2,330,000.

(On September 29, 2005 the above links were checked again, and the numbers had jumped to 499,000 and 980,000 and 6,570,000. At the same time, the referrals from Google for NameBase have been decreasing. Even assuming that Google includes every page from NameBase, this still leaves their count 50 times higher than it should be. If we assume that their stock price is similarly bloated, that means it's really worth $6.10 per share. Seems about right.)

In November 2004, Google suddenly increased their count overnight from 4,285,199,774 pages to "searching 8,058,044,651 web pages." Every pundit, blogger, and journalist on earth took Google at face value, and thought all those PhDs have to be really smart to double their index that fast. There must have been howls of laughter over this at the Googleplex. Every number reported by Google over 1,000 — the maximum number that can be verified — is completely worthless.

In the case of NameBase, the URL-only listings became a problem that I first noticed in April 2003. That was the month when Google underwent a massive upheaval, which I describe in my Google is broken essay. When that essay was written two months after the upheaval, it would have been speculative to claim that the listed URL phenomenon was a symptom of the 4-byte docID problem described in the essay. It was too soon. But by now the URL-only listings are beginning to look very widespread and very suspicious. It's a major fault in Google's index, it is getting worse, and it is much more than a mere temporary glitch.

Another curiosity emerged in August 2003, two months after my "Google is broken" essay. Google started showing supplemental results from an entirely separate index. If you run out of regular results you will often see the label "Supplemental Result" in green on the last page of available links. At that time Google briefly stated on their site that they "augment results for difficult queries by searching a supplemental collection of web pages." A representative from Google had little to add to this, but did concede that it is an entirely separate index, and then threw out a few words of spin. It sounded like a cover story. I believe that this new index was started due to a capacity problem in the main index and the need to develop new software.

In November 2003, there was a major update that threw thousands of ecommerce sites out of the index. At first everyone thought that this was either an attempt to purge spam from the index, or an attempt to force webmasters to buy AdWords to stay in the game. It was called the filter fiasco. There was an outcry and the spam problem was worse, not better, so Google eased up on this filter a few weeks later. Then by mid-2004 it was clear that new nonprofit, educational, and government sites, as well as ecommerce sites, were all embargoed in what was known as the "sandbox," where they languished for months without getting ranked by Google. Now the entire pattern began to look less like an anti-spam effort and more like a broken index.

Google is dying. It broke two years ago and hasn't been fixed. It looks to me as if pages that have been noted by the crawler cannot be indexed until some other indexed page gives up its docID number. Now that Google is a public company, stockholders and analysts should require that Google give a full accounting of their indexing problems, and what they are doing to fix the situation. The SEC should get involved too, because this continuing decline in the quality of Google's main index is a significant risk factor that should have been mentioned in the prospectus.

The graphs below are based on page views at NameBase, our main site. Images and automated crawlers are excluded from these numbers. We started collecting traffic data for NameBase in May, 2003. The first graph shows daily totals, the second shows weekly totals, and the third shows monthly totals. The last day shown is always yesterday, and the last week shown is always the week that ended yesterday. The average that defines the 100 percent line consists of all of the data shown on each graph, and is specified in the upper right corner.


NameBase

   
   
   

Because NameBase is a large site with broad appeal, the traffic tends to be steady and predictable. We know this because NameBase has been on the Internet now for ten years. When the pattern changes, apart from the usual weekend dips, we start looking at what's happening with our referrals from search engines. Note the skinny blue line on the bottom. This is the number of referrals from Google (excluding Yahoo, AOL, Earthlink, and Netscape). Our site is "sticky," which means that anyone who lands on one of our pages from a search engine is likely to click around some more. Nevertheless, it is clear who is in the driver's seat when it comes to overall traffic trends — the little blue line on the bottom is directly driving the big line on top. The more Google loves you, the more the world loves you. Google rules. For the past few years they have not reflected popularity, as much as their near-monopoly created and perpetuated it.

Our biggest problem is that with 132,000 pages on our site, Google doesn't take the time to get a complete crawl of our data. Or if they do, they don't put it all into their index. The red line at the bottom is the number of referrals from Yahoo plus Microsoft. We began tracking Yahoo and Microsoft in April 2004, shortly after Yahoo dropped Google and began their own engine. It took six months before Yahoo, which also fed MSN until Microsoft switched to their new engine, had most of our pages indexed. In October 2004, the combination of Yahoo and Microsoft was better than Google for NameBase referrals, primarily because Google was merely listing most of our pages instead of indexing them.

In January 2005, Microsoft dropped Yahoo and launched their new engine. So far they have indexed only the top layer of pages at NameBase. Their referrals are negligible, and it's not even worth the effort yet to split off MSN from Yahoo on our graphs. Yahoo by itself has decreased substantially too, even when considered independently of MSN. Google has improved its coverage of NameBase since October, and is well ahead of the pack once again. It's still nowhere near where it was in early 2004.

In 2004, webmasters speculated that Google was working on a redesigned index that would show up any day now. By 2005, most conceded that either Google is deliberately rotating their algorithms to keep webmasters confused, or their engineers have made things so complex that the effects of algorithm changes are by now fundamentally unpredictable. It's just as likely that Google has lost interest in their main index altogether, and has decided that they don't need to trouble themselves with large sites. They're all millionaires by now, so who cares? Increasingly, it looks like Google's results are a melding of two or three indexes, due to some sort of capacity problem. The doubling of the count in November may simply have been a decision to add all the indexes together, without bothering to consider the overlaps. Google knew by then that their arrogance would win applause, just as it has for years.

It has been sixteen months now since Yahoo launched their own engine. Almost everyone agrees that Yahoo's engine is as good as Google's. On Scroogle we have a "Yahoogle" feature that compares the top 100 results from Yahoo with the top 100 from Google for a single search. After over a year of doing this and recording the numbers, the difference between the two engines has averaged 75 percent over that time. In other words, you will find that only 25 percent of the top 100 results link to the same pages on both Yahoo and Google. You normally have to click ten pages deep to see 100 results, yet most searchers rarely venture beyond page two.

This means that the appearance of relevance is easy to achieve. Any engine, using any algorithm, can fool most of the people most of the time into thinking that it is delivering the best results. Once you have more detailed knowledge of how various engines perform, you know that this is not true. Most search engine algorithms are incredibly crude compared to human judgment, and Google's are currently worse than they've ever been. Casual analysts and media pundits don't know this. The result is that Google now finds itself sitting on top of a new Wall Street bubble.

Once upon a time Google's results made sense. Now they frequently display nothing more than Google's need to maintain a posture of relevance, while patching things together behind the scenes with ad hoc indexes. These extra indexes have one purpose, which is to keep the Google hype flying as high as their stock price.


Google Watch