Related link: Google Me Not  by David Whelan, Forbes Magazine, 16 August 2004


It's time for opt-in at search engines

By Daniel Brandt
August 8, 2004

One of the biggest challenges for a webmaster or system administrator is the control of automatic crawlers. It first started in September, 1995, when AltaVista suddenly grabbed a couple dozen essays sitting in a directory on one of our sites. We didn't even have a web page yet; our nonprofit database was accessed through Telnet in 1995.

A little research at the time revealed that this was some sort of automated "crawler," and that they indexed the pages they found, and produced links to them in response to search queries. It was called a "search engine." There was also a protocol called "robots.txt" that let you specify which directories they should not index. It was recommended that robots.txt be consulted by well-behaved crawlers. If it did not exist, then everything that the crawler could reach on your server was considered fair game. In other words, robots.txt was a voluntary, opt-out protocol.

Fast-forward nine years. Google and Yahoo are making tons of money off of ads placed on their search engines. Ecommerce webmasters have become expert at manipulating their rankings on the engines. Most queries at either of these engines produce thousands of links, but most searchers never look beyond the first 20 links or so. Ranking on the engines is the key to survival for anyone making money on the web.

The quality of search results is declining under the pressure of this manipulation, and queries that several years ago produced noncommercial results are getting buried. Examples of this are queries by people looking for the sort of nonprofit social services that are usually available in average communities. Unless they know enough about how to use search engines intelligently, by entering well-constructed, multi-word queries, they are likely to get spammed by parallel services that are mainly after big profits.

At the same time, the search engines are making so much money from competing advertisers, that there is no incentive to improve their algorithms. How difficult would it be for Google or Yahoo to manually construct a nonprofit social-services directory keyed to location? A few employees could do it in a few months. Once it was constructed, the algorithm could be tweaked to rank these links higher for appropriate queries. Will it ever happen? No, not as long as Google is busy counting their millions from advertising.

Social disservice is what we have learned to expect from the search engines. Google in particular was very slow to curtail ads from rogue pharmacies that sell opiates without a prescription. Gambling ads, illegal in the U.S., are another problem. The law has trouble dealing with the Internet, often because of jurisdictional problems.


Automated crawlers have too much license

The robots.txt protocol has not improved much since 1995. Many crawlers are of questionable utility. Perhaps they are looking for email addresses to sell to spammers. A crawler could even be someone in a dorm room who turns on a personal robot to suck up an entire site, because he has nothing better to do. If the site has thousands of pages, and the crawler does not use a delay between fetches, then the bandwidth of almost any university can bring down an average server within minutes. Crawlers such as this don't even consult robots.txt.

The crawlers that do consult robots.txt are a mixed bag. Some are like the crawlers of Alexa, which feeds Archive.org, which collects old web pages and keeps them forever. They even sell sets of web data. Why would you want this for your web pages? What possible benefit is this for a webmaster?

Other crawlers are downright spooky. One example is Metacarta.com, which contracts with intelligence agencies. They specialize in scraping documents for geolocation data, so that the spies in Washington can figure out whether anything interesting is happening at a particular point on the globe. How many webmasters are even aware that this goes on? Of those few, how many know exactly how to stop this particular crawler, assuming that this crawler even bothers with robots.txt? It may take 30 minutes of tracing and research to uncover the crawler from Metacarta. With dozens of crawlers hitting your site from anywhere between Europe and China, who has the time or skills to trace them all?

Crawlers and search engines are a constant problem, but anyone who is trying to publish on the web cannot afford to shut them out completely. By shutting out Google, Yahoo, and Microsoft, in most cases a webmaster's traffic would diminish to about 15 percent of its previous level. At the same time, it's tricky and unreliable to set up your robots.txt to allow only these three -- robots.txt is an exclusion protocol, not an inclusion protocol. There are also issues that robots.txt cannot handle. One example is the cache copy. If you don't want a search engine to make a cache copy available of your pages, you have to insert an instruction on every single page on your site. Even this is an opt-out instruction; if the crawler doesn't find this instruction on a particular page, it assumes that it has permission to offer a link to its cache copy of that page.

The pervasive opt-out philosophy of big web players, as opposed to a more reasonable opt-in philosophy, has infected other areas. Google, Yahoo, Microsoft, and Amazon, for example, make certain assumptions about your privacy rights. They do everything possible to track your preferences and web behavior, and target you for advertising, and retain everything they learn about you, unless you take very specific steps at a number of levels to prevent or minimize this. Where privacy has been an issue, the solution arrived at is always an opt-out solution, never an opt-in solution. This entire opt-out philosophy is also at the root of the security problems with Internet Explorer. It's basically a battle between the biggest Internet corporations and the consumer. Unless he is well-informed about how to browse the Internet safely, and has a fair amount of technical aptitude, the consumer is increasingly losing these battles.


Everything about the Internet is backwards

We need more regulation on the Internet. Specifically, we need laws that presume that opt-in for the citizen and consumer is preferable to opt-out. A good place to start would be to limit web crawling by changing robots.txt so that it becomes an opt-in protocol.

A webmaster should be able to designate which crawlers are allowed on his site, with either a user-agent list or a list of IP address blocks. The directories and/or files that are allowed should be specified. Other instructions, such as the cache copy, could be listed on an opt-in basis. The implicit message behind this new protocol is that no one else has permission to do automatic crawling.

One effect that this would have is to encourage the development of new and diversified niche search engines. A major engine such as Google would have to appeal more to nonprofit groups in order to include those listings. Hopefully they would do this by showing these groups that getting listed in Google serves their nonprofit mission. If they can't show this, then other engines are in a better position to convince these groups that they can pick up the slack. The scale of your crawling capability would matter less, and the purity of your noncommercial ranking algorithms would matter more.

New regulations wouldn't affect those crawlers from other jurisdictions, but they might be easier to identify, if only because the field wouldn't be so crowded. Currently as much as 80 to 90 percent of all traffic on some websites is from various crawlers. A small percentage of this crawler traffic -- probably close to 20 percent -- is from crawlers that will steer a real eyeball to some page on that site, at some point in the future. This is not a very efficient ratio. It's rather like "crawler spam" for webmasters, even though the problem is not often acknowledged. Most noncommercial sites are not aware of where their traffic comes from, and most commercial sites are so busy trying to rank well in the big engines that they're not inclined to complain about wasted crawling.

The biggest change that an opt-in philosophy might provide is a change in attitude. The commercialization of the Internet would be de-emphasized slightly, and the interests of the public would become an issue for the first time in recent Internet history. Pundits would stop counting all the prospective millionaires at Google, and start asking what Google has done lately for the general welfare.

Isn't Google comparable to a power company providing essential electricity? Access to information is a vital resource, and search engines have changed the nature of that resource dramatically over the last nine years. It's not premature to call for the regulation of that resource for the benefit of the public.

_________________

Daniel Brandt is founder and president of Public Information Research, Inc.

Google Watch