
Are Web Spiders Beneficial?
Copyright © 1997 Qusay Mahmoud. All rights reserved.
Introduction
How Large Is the Web?
The first Web spider was the World Wide Web Wanderer. It was written in 1993 with the intention of finding how fast the Web was growing. The home page includes very useful information regarding the growth and usage of the Web and the Internet. Once you visit that home page and you see the kind of information available you should be able to figure out that this kind of information is very difficult, if not impossible, to gather manually. This is where Web spiders come to play.
What Is a Web Spider Anyway?
A Web spider, also known as a robot, or a worm, or a wanderer, is actually a computer program that traverses the Web's hypertext structure by retrieving a document and recursively retrieving all documents that are referenced. This program retrieves information from remote sites using the standard Web protocol-Hypertext Transfer Protocol (HTTP).
The term "spider" or "worm," however, may give the impression that this program actually moves from one site to another and multiplies itself as it moves. These are actually some characteristics of a virus, and that is why the term "robot" could be a better name for a program that traverses the Web to retrieve information. The terms "spider" and "robot" are used interchangeably in this article.
Search Engines
The term Web spider is not a synonym for a search engine. A search engine is actually another computer program that searches through the database that gets built by the Web spider. Alta Vista is an example of a search engine. On the other hand, it is important to note that Yahoo is a different kind of search engine in the sense that Yahoo does not use a Web spider to build its database, but rather the database is built by a human.
How Does a Web Spider Work?
As a starting point, the spider would be given a popular HTML document full of URLs. The spider parses the URLs and visits each one recursively. That is, it visits the first URL and it parses all the URLs that are contained in that URL, and visits each one recursively...and so on. So you can imagine how many Web sites the spider will visit. Ultimately, it will traverse the whole Web hierarchy.
Are There Any Problems?
While it is certainly true that robots are ideal to construct a searchable index of topics, there are some consequences. Imagine you have a Web site with 2,000 documents. The spider would parse each document and ultimately index your whole Web site. Should you be concerned? Well, it depends. One concern is how fast the spider is fetching documents from your Web site. If the spider is fetching multiple documents a second, then this would overload your Web server, which may cause it to crash. Another concern might be that you do not want a spider to index your whole Web site, but rather a portion of it. There are some proposed solutions to the above consequences; however, the proposed solutions are not enforced.
Robots Exclusion Standard
This standard is used to exclude robots from a server by creating the file "robots.txt" in the root HTML directory. Once created, rules can be specified in that file for robots. When an ethical, well-behaved, Web spider visits a new Web site, it reads the file "robots.txt" and checks if there are any rules specified for that particular robot. Thus, the file "robots.txt" is a guide to the Web spider that informs the spider as to what it can and cannot index. Here's how it works:
If you do not care about spiders and you do not mind if they visit your site and index everything, then simply create the file "robots.txt" in the root directory of your HTTP server. This will simply prevent the server from logging lots of error messages about the file "robots.txt" not being found.
If you do not wish to be visited by spiders then the best way to prevent spiders from visiting you is by creating the file "robots.txt" and adding the following two lines to it:
User-agent: *
Disallow: /
The above two lines say that whatever the spider name (user-agent) is, do not allow it to access anything on my site. An ethical spider would obey the rules and would not traverse your HTML documents. However, there are some spiders that do not even care to read the file "robots.txt" for instructions.
In reality, however, you would certainly want to be visited by a Web spider since it will make your site known to some search engines so that people will visit your site. The Robots Exclusion Standard allows for an easy way to select what area of the HMTL hierarchy to be traversed. Let's examine the following lines that are found in my "robots.txt" file:
# /robots.txt file for http://www.garfield.csd.unbsj.ca
# site maintainer: qusay@garfield.csd.unbsj.ca
User-agent: bird # birds are the best navigators
Disallow:
User-agent: *
Disallow: /icons # robots shouldn't read .gif, .jpg, ...etc.
Disallow: /cgi-bin/ # don't be insane
The first two lines that start with "#" specify a comment. The next two lines specify that the spider known as "bird" is allowed to traverse the whole HTML hierarchy. The last three lines specify that all spiders are not allowed to traverse through the "cgi-bin" and "icons" directories. The order (or rules) is important in the sense that if the rules for the spider "bird" and "*" are swapped then "bird" would not be allowed to access the "cgi-bin" directory.
Is This Standard Sufficient?
The abovementioned standard does not enforce spiders' writers to follow the rules by reading the file "robots.txt" before going ahead and traversing the whole visited Web site. However, it is recommended that spiders' writers follow the rules of the Robot Exclusion Standard so that their spiders will be ethical spiders. Also, there are other things that have to be taken into account when writing a new spider. For example, make sure your spider does not fetch multiple documents every second. Try to have a delay time of 10 seconds between each subsequent access. And always identify yourself to the site. Also, make sure your spider knows how to handle graphics and postscript documents that your spider may encounter.
I Cannot Create "robots.txt"
If you are not the administrator of the Web server on which you have your HTML documents then you will not be able to create the "robots.txt" file to exclude robots from your HTML hierarchy. However, that does not mean you cannot exclude spiders. There is a new standard for using HTML META tags just for that purpose. For example, if you include the following tag:
<META NAME="ROBOTS" CONTENT="NOINDEX">
in an HTML document, then the corresponding document would not be indexed by a robot.
If you do not want robots to parse each document in your HTML hierarchy, then if you include the following tag:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
in an HTML document, the corresponding HTML document will not be parsed and the robot will not visit every URL listed in that HTML document.
Getting Listed In Search Engines
Most search engines provide services for users to add their HTML pages to the database. There are some services on the Web that will submit your site to many search engines for a couple of hundred bucks. But why not do it yourself and save the money? Here is how: Every major search engine has a page for adding your HTML pages to their index. You can go to those pages, fill out the requested information and submit the request. Below are some URL resources for adding your HTML pages:
Yahoo: http://add.yahoo.com/fast/add?
Alta Vista: http://altavista.digital.com/cgi-bin/query?pg=tmpl&v=addurl.html
Submit-It: http://www.submit-it.com/
Most search engines use the META tag to help index your page. Basically, META tags store information about the page itself; this information will not be displayed. You put <META> inside the <HEAD> tag in your HTML page. For example, for a page about Jazz Music, you may want to do something as follows:
<HTML>
<HEAD>
<TITLE>Jazz Music is cool</TITLE>
<META NAME="keywords" CONTENT="Jazz Music">
<BODY>
...
</BODY>
</HTML>
What Is Wrong With Web Spiders?
Web spiders are good tools for indexing static HTML documents. That is, documents that do not change so often. However, the idea of having a Web spider as an automated information discovery tool is starting to break down due to the fact that these spiders cannot keep their massive database up to date. Web sites such as CNN change on an hourly basis, so how could spiders keep up with that?
Web spiders have been useful up to now, however, there is a lot of information on the Web that cannot be found with a traditional Web spider.
Conclusion
With the explosive growth of the Web, we (humans) cannot cope with the amount of information available on the Web. Information discovery is the most ideal application for Web spiders. Web spiders are ideal for indexing static HTML pages; however, there is a need for better tools for indexing dynamic HTML pages-pages that change quite frequently. For more information on Web spiders and a list of some known spiders, please check out http://info.webcrawler.com/mak/projects/robots/faq.html
Qusay H. Mahmoud is a software designer at a major networking company in Ottawa, Canada. He holds a B.Sc. in Data Analysis and a Masters degree in Computer Science, both from The University of New Brunswick, Canada. He can be reached at: dejavu@acm.org
