Is Google broken?
It's behaving as if it has a Y2K+3 problemby Daniel Brandt
June 9, 2003
For the last 30 days, Google has been doing strange things. No webmaster who follows Google closely will deny this. There is no explanation from Google apart from some vague hints from "GoogleGuy," an anonymous poster at webmasterworld.com, whom the forum owner says is from Google. These hints claim that new algorithms are being put into place, and that this will take a couple of months.Another possible explanation, entirely speculative, is quite intriguing and fits the facts better than GoogleGuy's vague assertions of an algorithm change. This explanation was obliquely denied by GoogleGuy, at which point the pro-Google forum owner, Brett Tabke, moved the thread off of the active list. The next day this thread was locked against further postings. Since GoogleGuy has been known to make strong denials on other matters, this amounts to a nondenial.*
We'll call this intriguing explanation the "Y2K+3 problem." It's all the more intriguing because Mr. Tabke, an able programmer, made a misleading posting to the effect that adding another byte to the webpage ID number to fix an integer overflow problem was "child's play," and called the thread "bogus." The behavior of GoogleGuy and his crony Mr. Tabke look like a cover-up. Hopefully this Google Watch summary page will encourage more discussion of what's going on at Google, even if webmasterworld.com won't.**
First, an explanation of what's different at Google over the last 30 days. This sort of behavior is entirely unprecedented in the 2.5 years we've been following Google:
1. There were some strange results observed when the previous update kicked in on April 11, but webmasters became universally concerned in May, 2003. It took about two weeks for the May update to propagate to the nine data centers. Normally it takes around five days.
2. The May data showed half the number of backlinks that webmasters were accustomed to seeing on their sites.
3. The May data was from a previous update in February or March, and not from the usual deep crawl that occurred in mid-April. It was as if the April deep crawl had been thrown out. GoogleGuy essentially confirmed that this was the case. Pages that had been created since February or March were, as often as not, missing.
4. Slowly, the freshbot began to add more recent pages once the May data settled. But as is typical with freshbot data, these pages persisted in the index for two or three days, and then dropped out of the index. Then they'd pop up again a few days later. This yo-yo effect got the attention of a lot of webmasters who weren't used to this sort of instability. Freshbot normally handles fresh pages this way, but all of a sudden, the definition of "fresh" covered about a three-month span. Usually the freshbot has a dramatic effect only on pages that are new since the last deep crawl.
5. The PageRank of these "fresh" pages is indeterminate. It takes a deep crawl plus several days of calculation to compute the PageRank for the entire web. Without the April deep crawl data, the PageRanks that appeared were from February or March. Any "fresh" pages since then showed no PageRank at all. Normally the toolbar would approximate the PageRank for fresh pages, based on the PageRank of the home page, but it stopped doing this also. All it showed was an all-white or gray bar. This also attracted a lot of attention from webmasters. The more crawling the freshbot did, particularly on home pages for dynamic sites such as blogs, the more webmasters noticed that their PageRank was no longer showing on the toolbar. It is still unclear whether the real PageRank (as opposed to what the toolbar shows) is also indeterminate. The only way to judge this is by the rankings for selected keywords. Some webmasters reported massive fluctuations in rankings, suggesting that the PageRank component of the ranking algorithm was indeed broken. This was in addition to the fact that the pages would drop in and out of the index.
6. The deep bot has not appeared since the end of April. It is unprecedented for Google's deep bot to be AWOL for six weeks running.
The Y2K+3 theory
Let's speculate. Most of Google's core software was written in 1998-2000. It was written in C and C++ to run under Linux. As of July 2000, Google was claiming one billion web pages indexed. By November 2002, they were claiming 3 billion. At this rate of increase, they would now be at 3.5 billion, even though the count hasn't changed on their home page since November. If you search for the word "and" you get a count of 3.6 billion. It's unclear what role other languages would have, if any, in producing this count. Perhaps each language has it's own lexicon and it's own web page IDs. But any way you cut it, we're approaching 4 billion very soon, at least for English. With some numbers presumably set aside for the freshbot, it would appear that they are running out of available web page IDs. [Update: As of early 2004, Google claims to search 4,285,199,774 pages in their main index.]If you use an ID number to identify each new page on the web, there is a problem once you get to 4,294,967,296. Numbers higher than that require more processing power and different coding. Our speculation makes three major assumptions: a) Google uses standard functions for the C language in their core programming; b) when Google's programs were first developed four or more years ago, a unique ID was required for every web page; and c) it seemed reasonable and efficient at that time to use an unsigned long integer in ANSI C. In Linux, this variable is four bytes long, and has a maximum of 4.2 billion before it rolls over to zero. The next step up in numeric variables under Linux requires different standard functions in ANSI C, and more CPU cycles for processing. When the core programs were developed for Google several years ago, it's reasonable to assume that the 4.2 billion upper limit was not seen as a potential problem.
It's also possible that the freshbot, which began work in August 2001, was assigned its own block of unused ID numbers from the pool of 4.2 billion. This would allow for easier integration of the freshbot data with the deep crawl data, but would also suggest a certain amount of inefficiency in the use of available numbers. Indeed, freshbot behaves as if it can only hold a certain number of total pages, at which point it drops the oldest page in its round-robin queue and overwrites it with a newer page from a different site. This peculiar habit could be another symptom of the limited pool of ID numbers available.
Any way you cut it, if you assume that every unique web page needs a number from the 4.2 billion pool before it can be included in Google's index, then either Google has problems right now, or it is anticipating problems in the near future and is beginning to overhaul parts of their system. This could explain the recent behavior of Google.
Despite Brett Tabke's opinion, it is not child's play to add a byte and make it a five-byte integer instead of a four-byte integer. Using an extra byte basically means that you use an unsigned character as a multiplier for 4.2 billion, and add that onto the integer to get your final result. If the integer is A, and the multiplier is B, the new integer is A + ( B * 4.2 billion ). If B is zero, then you get A. If B is one, you get values from 4.2 to 8.4. The point is that you need an extra multiplier variable, and extra lines of code for conversion. (You could use an eight-byte number in Linux, but this might be less efficient computationally and obviously uses more bytes than you need.) There are no standard functions in programming languages that handle five-byte integers.
For something as basic as a web page ID number, this implies a massive overhaul of Google's software. Meanwhile, you have 15,000 Linux boxes to upgrade. Every Linux box at Google reportedly runs a standard set of software, so that it can be interchanged with other boxes once the data is duplicated. It is safe to assume that the web page ID number is ubiquitous on all of these boxes. The upgrade has to be done in phases, otherwise all of Google goes off-line for a time. Performance degradation is tolerable, given the circumstances, but going off-line would mean an instant loss of market share. Adjustments in data capacity for the various indexes have to be made also, due to the extra byte per web page ID.
As we stated at the beginning of this page, all of this is speculative. Since Google won't tell us what's going on, we're forced to encourage wild speculation in the hope that Google will see it as in their interests to be more forthcoming about what they're up to.
Google believes that it's none of our business. But they're wrong. It is our business. We didn't ask Google to become the world's information broker; it just happened that way.
_________________
* GoogleGuy's denial has become more interesting since this was written. On June 18, a friend of this site received an internal email on the WebmasterWorld forum from GoogleGuy, explaining the apparent contradiction between two of his posts. We quote it in full:"Hey, I should have been more clear in one of my posts. When I said 'it could be that we just ran out of space or time,' I was referring to space in our repository. That is, we wind down the crawl after fetching 2B+ URLs, and the URL in question might not have been in that set of documents. Regarding address space, we're not in danger of running out of docIDs for our documents. The story about the engineer claiming that he almost fell out of his chair laughing at the suggestion that we're running out of docID/address space is true. You can choose not to believe me, but that's your call. :)"If true, this amounts to a new and bigger scandal. It is the first we've heard about the crawl winding down after fetching two billion plus URLs. We already know that the crawl is ordered by PageRank, so this means that little guys with big databases stand a good chance of finding half of their pages not indexed in Google. That has been our experience with NameBase. If you believe GoogleGuy, it suggests that Google is too lazy to even approach the 4.2 billion ceiling, because that would mean they'd need some new software. We prefer the Y2K+3 theory, which is more generous toward Google in that it merely assumes a temporary technical problem.
** On June 11 Mr. Tabke reopened the thread, after changing the post quoted above. Here's his post that we quoted and here is the current air-brushed version. As he reopened the thread for new postings and linked to this page, Mr. Tabke accused this site, Google Watch, of being a conspiracy by some competitor of Google's, with a mission to spread conspiracy theories about Google in order to achieve market position. Now that is what we would call a conspiracy theory! This accusation isn't new. When Mr. Tabke locked out Everyman (Daniel Brandt) from the forum a year ago, the same theory was suggested in an angry email from him. For the record, no one at Public Information Research, Inc. has any connection with, or even knows anyone associated with, any web search engine whatsoever. Unfortunately, the same cannot be said of Mr. Tabke.
A search engine primer: What is an inverted index?
There is an amazing amount of denial from webmasters who aren't programmers, as well as some disinformation from those who know better. This is understandable. In the first instance, to say that Google needs to phase in an extra byte, and that this will be disruptive and inconvenient, is like saying that God didn't know about entropy and failed to plan for it. In the second instance, it's predictable that Google, Inc. wants to squelch rumors of a software overhaul. In the current competitive search engine environment, with Yahoo and Microsoft positioned to move forward, the news that Google is making repairs is enough to forestall any plans for an immediate IPO. The first reaction is based on programming ignorance and blind pro-Google cultism. The second is driven by big money.We cannot fight these extra forces at work, but we can explain what an inverted index is, and how it works. All full-text search engines work the same way, and all computers use ones and zeros. Anyone who claims that Google uses web page IDs that are plenty long has no idea of how an inverted index works. One poster looked at Google's URL for their cache copies, and concluded that the string of 12 alphanumeric characters, upper plus lower case, gave Google 62 to the 12th power for their web page ID, which leaves plenty of room for expansion.
This 12-byte string may contain a web page ID within it, but it may also contain locater information for the cache copy, as well as some flags. The web page ID variable we're talking about is called the docID in The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. Every new URL gets its own docID.
Google indexes in many languages, but let's pretend we have just one language, and in all of the 99 documents we want to index, there are only five different words we want to index. We set up an inverted index by arranging the words, and after each word, we point to all of the documents that contain that word:
A-word: 05 02 87 33
B-word: 41 33 64
C-word: 56
D-word: 21 54 87 96 82
E-word: 32 55The numbers above point to specific documents within our set of 99. Every word on every web page is indexed, and each of these words includes a list of pointers to every document that contains that word. For example, Google shows 6 million page hits for the word "britney." If they use four bytes for the docID, that means this single word requires 24 million bytes of memory or hard disk space to hold the pointers to those 6 million pages. That's for one word. Now multiply that by all the words Google indexes in each of several dozen languages.
You can see why the docID has to be as tiny as possible. Using four bytes, which means 32 bits consisting of either a one or a zero, you get a maximum of 4.2 billion possible numbers. It is so important to use as few bytes as possible for the docID, and the length of four bytes is so obvious for Linux in terms of efficiency, that we're certain that at some point, the length of the docID was set at four bytes. The only questions that remain are when did Google start implementing plans to expand to five bytes, and how disruptive will this conversion be if, as we presume, it hasn't already happened?
Here is all but the speculative last line of the anonymous post at WebmasterWorld. It is the first and only post by someone who called himself re5earcher :
_________________
12:04 pm on June 7, 2003_________________Google has reached its data indexing capacity of 4,294,967,296 (2^32) URLs. Now non-image URLs have an ID stored in 4 bytes, so Google is now running out of IDs for stored pages. When there will be no URLs returned "not found" and deleted from the index, total number of non-image files indexed will soon reach 4,294,967,296 including 3,083,324,652 html pages. After that Google will stop adding new URLs from indexed pages as well as new URLs added for indexing.
They are now considering reconstruction of the data tables which involves expanding ID fields to 5 bytes. This will result in additional 2 bytes per every word indexed throwing the total index size to be multiplied by 1.17. This procedure will require 1000 new page index servers and additional storage for temporary tables. They are hoping to make this change gradually server by server. The completion of the process will take up to one year after that the main URL index will be switched to use 5 bytes ID.
Why do we think that some of this information originated from an insider?For one thing, after reading up on Google's architecture, the numbers seem about right. Notice that the move from four to five bytes will result in two additional bytes per word indexed. This seems strange until one does a search for "fancy hits" plus "inverted index" (use the quotes in the search). Google uses two inverted indexes. The first is for anchor text, URLs, and titles, and the second, the "plain" one, is for all the words. They search the fancy index first. If they don't have enough hits by the end of this search, then they go to the plain index, which is much larger. Many single-word searches at Google can be satisfied by the fancy index alone.
The docID is also used in other places throughout the system. When you look at the total picture, it makes sense that on average, adding one byte to the docID will mean an increase of two bytes per word indexed. The increase of 1.17 in the "total index" suggests that right now, 11.76 bytes per word are used. (A move from 11.76 to 13.76 would amount to an increase of 1.17.) This might be counting the wordIDs, the docIDs, and at least two bytes of encoding alongside the docID for data such as word position on the page and font size. After determining the 10 to 100 best documents, the docID is looked up and translated into a cache copy location. The cache copy is decompressed and the snippet extracted so that the search terms can be displayed with a line or two of context. Most of this entire process is distributed to a number of machines operating in parallel, each contributing a little portion toward the final result.
Assuming that the two bytes of encoding are carried in the two inverted indexes right next to the 4-byte docID, one critic said that all they have to do is steal one bit and thereby double the 4.2 billion to 8.4 billion. This is true, but the software still has to be rewritten to read this bit, and in other places where the docID is used in the system, there may not be any bits to steal. It makes much more sense to add an entire byte (8 bits). You don't need them all for the docID, and the extra ones can be used to improve the encoding scheme. If you used three bits for the docID you would expand the total docIDs by a factor of eight. That would leave five bits to do something that Google sorely needs -- a breakdown of pages into commercial, nonprofit, blogging, and up to 29 other categories. Then they can offer tabs for the user to specify what sort of material they're trying to find.
It also makes sense that they will start working on expanding the field lengths in the various data tables first, and do the switchover at the end. Each of the 15,000 boxes has a standard set of software for ease of installation and expansion, and presumably a configuration file defining what portion of the huge system is currently assigned to that particular box. Perhaps now this configuration file will also tell the box whether it is using the old 4-byte docID, or the new software with the new 5-byte docID.
A year for the project does not sound unreasonable. It's a messy task to integrate the new docID into the system without going off-line. There are a few Google fans who insist that we're crazy, because the whole conversion project is the easiest thing in the world, and we obviously don't know as much as they do about programming because we think it's a big deal.
All we can say is that this anonymous posting made sense to us. The numbers hold up based on what we can discover about Google's system, and something has been screwy lately at Google. The theory fits the available evidence. Some of our critics need to find a new religion.
Update 2003-08-04: There is still a lot of disbelief generated by this page. Everyone seems to think that either Google's status as God means it is immune to mathematical limits, or if Google isn't God, then the issue of increasing the docID is so trivial that it's not worth considering. The most common reaction is that Google would use many more bytes than four as soon as they anticipated a problem with four bytes, and they would have anticipated this years ago. But what are the implications of using more than four bytes? Best estimates are that on average, each docID is used twice per word per page. That's because they have two inverted indexes -- one is "fancy" and the other is "plain." The fancy index includes only the most important words from each page, so it is much smaller than the plain index. But the docID is also used elsewhere in the system, which means our average of two docIDs per word per page is pretty close. The average number of words per web page is 300. Here are the space requirements for the docID if we assume 4 bytes, 12 bytes, and 20 bytes, for 4 billion web pages:
If you were designing a search engine, how many bytes would you choose for your docID? You'd use four bytes in 1998, and in 2003 you'd use five bytes. No question about it. The only questions are these: a) Did they look ahead in 1998? If not, when did they shift to five bytes? b) If they are in the process of shifting to five bytes right now, when will they finish? Are the current changes that everyone has observed in Google the result of this process? c) If they've been using more than four bytes for some time, making this page irrelevant, then why don't they just say so? If someone at Google goes on record on this topic, then we'll take down this page and that will be the end of it.
- 4 bytes: 300 * 4 billion * 8 = 9.6 to 12th power (10 terabytes)
- 12 bytes: 300 * 4 billion * 24 = 2.9 to 13th power (29 terabytes)
- 20 bytes: 300 * 4 billion * 40 = 4.8 to 13th power (48 terabytes)
Using the minimum number of bits needed for the docID does not mean that this ID is not represented elsewhere in the system in other forms. If you need to display or preserve the ID somewhere, like in a URL, you would use the four bytes to calculate the long version of the ID, using a conversion algo. You can insert this alphanumeric long version into the cache copy URL and use it to shift back to the four bytes later. This conversion process is cheap, because it doesn't have to be done all that often. It takes a few lines of code to convert from an optimized binary number to the same number in a different base. One possible base would use the upper and lower case letters, plus numbers, plus hyphen, underscore, slash, and period. These characters will survive intact on the web when appended to a URL, without extra encoding and decoding. The times you'd have to convert between the binary number and the other base, and back again, are trivial compared to the intensive amount of processing required on the inverted indexes. These inverted indexes are immediately consulted with every new query coming into the front end of Google.
Take a look at The Term Vector Database: Fast Access to Indexing Terms for Web Pages by Raymie Stata, Krishna Bharat, and Farzin Maghoul. Bharat is a member of the research staff at Google and has a Ph.D. in computer science. Here are some quotes from that paper:
"Rather than deal directly with URLs, the Connectivity Server uses a set of densely-packed integers to identify pages."
"Recall that page identifiers are a dense set of integers."
"To avoid wasting space, we pack vector records densely."
"Functions in the Connectivity Server convert between these integers and text URLs. In our work with the Connectivity Server, these identifiers have proven more convenient to handle in code than text URLs."
"Notice the use of integers to represent terms; as with page IDs in the Connectivity Server, we find these to be more convenient to manipulate than text strings."
"The page ID of the vector is stored in the first 4-bytes of the vector's record."It's clear that four-byte docIDs were used at one time. It's also clear that increasing beyond four bytes is not trivial. The real mystery is why Google is allowed to stay silent in the midst of the major shifts in indexing behavior that all Google watchers have noticed since April 2003. Everyone rushes to defend Google as being utterly infallible, and exempt from the mathematics that mere mortals are forced to use. The Pope, sitting in the Vatican, probably wishes he had it as easy as Google's public relations people.
![]()
![]()