Net Fisher

Webwalkers, Spiders, Wanderers and Worms

The Robot Hunters and Gatherers of Cyberspace

They are out there, folks! Exceptionally focused and highly skilled, they are more persistently aggressive than fire ants and more relentlessly indefatigable then killer bees. They refuse to be contained by rivers, mountains, or even oceans and most likely they have their sites firmly fixed on a computer in your neighborhood. In fact, it is very likely that they are already on the way!!

They proudly sport handles that could be right out of Star Wars or Terminator, answering to names like Spider, Wanderer, Worm, Crawler, and even Webwalker. They have a heavy agenda and are accustomed to cruising at hyper-speed in the fast lane of the Information Highway. Their numbers are multiplying as unregulated incubators and hatcheries spring into operation in cities around the globe. Cloning (mirroring) operations, some resulting in bizarre behavior problems, are taking place at an exponential rate. No doubt, as you read this report multiple briefings are being held, final code checks readied, and highly sophisticated retrieval engines warmed for the launch of yet another night sortie of furious globe-trotting activity.

Whether it is just a simple statistics gathering or access rights "reconnaissance" foray, or perhaps a more involved "search and recovery" operation, even Casper the Friendly Ghost would envy their capability to pass noiselessly and invisibly through locked gates and rooms into the inter sanctums of governments, corporations, and universities. There is no safety "offshore" on some Caribbean island. International borders offer no protection against these invaders as they conduct their often highly surreptitious deep mining operations within the computers of the world.

Are we suggesting that Al Gore begin shutting down the on-ramps to the Information Highway at dark? Should the beckoning lights of the White House Home Page be switched off at midnight?

Are these a bunch of sleepless, bleary-eyed Web surfers from Seattle so involved in developing their own site that they inadvertently up and morphed to the far side of the electronic vale? Perhaps they have been condemned, like nineties-version Charlies, to an eternity of riding the Internet MTA. Forever stuck browsing home pages from the backside while trying to come up with just a nickel's worth of E-cash? Just who or what are these globe-trotting rovers anyway and what are they up to? Is there genuine cause for alarm?

Whoa there, Skywalker!! Hold your megahertz's just a second and relax...a little! What we have been describing (and referring to by their trade names) are representatives of the sophisticated new class of powerful "robot" programs that have only recently been developed. These are creatures that have emerged along with the World Wide Web, now the most popular means of publishing information on the Internet. Over the last couple of years, as the Web quickly increased beyond a few sites and a small number of documents, it became clear that manual browsing through significant portions of the hypertext structure was quickly becoming unwieldy, if not almost impossible. Without a new tool set, serious information gathering and research, and the incredible resource the Web has become, would have been nipped in the bud. The problem prompted experiments worldwide with automated, Web-browsing "robots."

Today these "robots," packing very complicated search algorithms, traverse the far corners of the Internet. Their assignments range from automatically examining for repair and inspecting for change, to gathering general resource information . Masses of useful information are bundled up and trucked back to their master (sites).

Yes, it is true that there are serious site maintenance problems associated with the unbridled development and visits of ever more powerful models. WebMasters are involved in local and international forums to discuss such issues as the proper ethical standards for Web agents, and robot "net etiquette" for site visits (like knocking before entering and not wearing ski masks, etc.). These subjects are becoming increasingly germane as more and more often these software "agents" carry hidden agendas. Working under cover of false identification or even total anonymity, and acting as though they are carrying "diplomatic pouches," some are taking advantage of their unsuspecting hosts by examining and carting off more than an accommodating WebMaster ever intended to make available.

However, for the moment it appears that these are, for the most part, the "friendly" operative programs of a new breed of computer program "Handler." These masters of information-gathering machinery are individuals and organizations that have, for the general good and, hopefully, "profit," accepted the almost impossible challenge of meeting the exploding nourishment requirements of another new species (or is it a condition?) of information-starved baby guerrillas which could come to be known generically as "Infomanicus Websurferitus."

"Websurferitus,"-as you, dear reader, might be...shall we say, intimately aware-uses as a primary vehicle the Web search engines and Hot Sight lists for purposes of both business and pleasure. However, their "habits" are highly dependent on the Webwalkers, Spiders, Wanderers and Worms operating several levels down on the Internet subway. Speeding around, gathering and sorting through the masses of data in the thousands of Web servers of the world, they fetch back URLs (Uniform Resource Locators), site identifiers, key words, links, and small content samples. The mounds of collected data subsets and location maps are then consolidated to form giant databases. These databases in turn fuel the search engines that you and I know as Harvest, InfoSeek, Lycos, Global Network Navigator, CUI Index, etc.

The query engines thus allow us to quickly acquire the basic data discriminators for evaluation and then use HTML jump pointers to the locations of desired information. With a good engine, this takes seconds, compared to hours, days or even weeks of searching by other methods. The big search engines are usually supported by one or several robots, each with specific talents and tasks. The creators of the most successful robots can be justifiably proud parents and use anthropomorphical terms when describing their prodigious offspring. As the InfoSeek user guide states:

"The worm we use was written at InfoSeek. It is extremely high-performance and can collect over 10,000 pages per hour (sustained). Our robot User-agent is 'InfoSeek Robot 1.0' and we follow all the robot guidelines. We respect /robots.txt files, eliminate duplicates, and will never have more than 1 outstanding request on a site at any given time (it does not swamp sites; we always wait for a request to be completed before we send the next request). The worm code is InfoSeek proprietary and is not available at this time for general use. The average number of visits per site by the worm is about 20 per month. Some sites, such as NCSA and CERN, are explored more heavily due to the abundance of WWW documentation available at these sites. In general, the only links we follow are the links from your home page. When someone gives us a URL to retrieve, we retrieve that URL only and do not follow the links contained in those documents. That keeps the load on your server minimal and it results in a very high quality WWW pages database for our users."

Choosing which search engine to use is a personal decision somewhat dependent on the types and expected locations of data you are searching for. Comparing Web search engines is obviously a relatively new science. They certainly cannot be judged by the results of a single search since each search engine uses different algorithms. On any particular query, one search engine might do better than another. Industry experts suggest that it is best, if you would like to conduct your own test, to evaluate an engine over a range of queries.

Of course, the ultimate test of both the quality of the search database and the search engine is its success in finding the information you desire and presenting it appropriately ranked on its likely value based on the query constructed. In industry- speak, "Search engines are measured by their ability to consistently generate high precision-recall statistics." This means the engine does a good job returning and rank- ordering relevant articles. The great thing about the Web is there are no rules that say you can't have all of them in your bookmark "quiver."

There is an ocean of information out there! Now that you know "who" the swabbies are that keep this whole cybership under way and afloat...the next time you sit down to fire up your favorite high-profile search engine, the one that has been receiving all the great press and glory, stop for a moment of tribute. Reflect for a moment on the work and working conditions of those minion "robots," that tirelessly slaving away night and day, far from the limelight, in the back rooms and DASD cabinets of servers everywhere. Give a mighty cheer, or at least a silent toast for Web Wanderer, Web Worm and all their hardworking and dedicated colleagues! Then, just for the sheer amazement of it all, launch a search for an obscure byte of data and sit back and marvel at the wonder of it!

For those readers interested in getting to know a robot better, and in learning the locations of the biggest and most popular search engines, the following references are provided:

1. Guidelines for Robot Writers

This document contains some suggestions for people who are thinking about developing Web Wanderers (robots), or programs that traverse the Web.

2. Robots in the Web: threat or treat?

Robots have been operating in the World Wide Web for over a year. In that time they have performed useful tasks, but occasionally they have wreaked havoc on the networks. This paper investigates the advantages and disadvantages of robots, with an emphasis on robots used for resource discovery. New alternative resource discovery strategies are discussed and compared. It concludes that while current robots will be useful in the immediate future, they will become less effective and more problematic as the Web grows.

3. World Wide Web Robots, Wanderers, and Spiders

Spiders, and their uses and problems, are discussed.

4. Ethical Web Agents

As the Web continues to evolve, the sophistication of the programs that interact with it will also increase in sophistication. Web agents, programs acting autonomously on some task, are already present in the form of spiders. Agents offer substantial benefits and hazards; because of this, their development must involve attention not only to technical details, but also to the ethical concerns related to their resulting impact. These ethical concerns will differ for agents employed in the creation of a service, and for agents acting on behalf of a specific individual. An ethic is proposed that addresses both of these perspectives. The proposal is predicated on the assumption that agents are a reality on the Web, and that there are no reasonable means of preventing their proliferation.

Other sites:

5. What's On Internet Search Tools

The big list.

6. W3 Search Engines

This is one of the most interesting, well-organized and user-friendly sites. It is a collection of some of the most useful search engines available on the WWW.

7. The Lycos Home Page: Hunting WWW Information

8. Webcrawler Searching

(C) Copyright 1995 - WWWiz Magazine - All materials contained herein remain property of WWWiz Magazine.