CS 4241: Webware
Assignment 2
[16 points]
Due: April 5, 2000 (9 pm)
Please note the greyed out parts changed March 29th and handin instructions.

Motivation and General Instructions

Crawlers, also known as robots, or spiders, are essential components of web search engines. Programming a crawler requires the knowledge of several important web techniques. In this assignment, you will implement a simplified version of a crawler. The architecture of the crawler is depicted below.

Note that you are free to implement the assignment in any programming language that you are an expert on (e.g., Perl, C, C++, Java). However, support will be provided only for Perl, a language that is especially suited for this assignment. Your program should meet the requirements of the assignment, should run on UNIX either on the WPI (CCC) computer (or on shiraz). If any of these requirements does not hold, then we cannot grade your assignment, and you will get a zero for it.

References

Your task

Program

You should write a program that performs a breadth-first traversal of a web site starting at a given URL, starturl. The traversal should visit all the HTML pages at depth d from starturl, i.e., reachable from starturl following a chain of at most d links. Only pages that have starturl as prefix should be visited. The crawler should store the collection of visited pages as follows:

  1. A text file, called term file, storing relevant terms extracted from all the pages. Note that there should be a line at the beginning of the terms for each URL, as follows:
    *URL actualurl
    

    e.g.,

    *URL http://www.cs.wpi.edu/
    
  2. Not needed: A text file called urlindex that maps the visited URLs to the associated term file. Each line of the file should have the format:
    url termfile

Your program should run from the command line, with the following syntax:

crawl [-depth d] starturl dir

The default value for the depth should be 2. The file urlindex and the term files should be stored in subdirectory dir of the current directory, which should be created by the program. Replaced by: The term file should be stored in subdirectory dir of the current directory, which should be created by the program.

The relevant terms extracted from each page should include the following:

Feel free to extract additional relevant terms (e.g., all the words in the page, excluding the HTML tags).

Reuse of code in standard libraries (e.g., Perl CPAN archive) is encouraged. However, you should not try to find and copy a program written by someone else that solves the assignment (see the Academic Honesty policy for the course).

Readme File

In addition to your program, you should write a readme file (plain text or HTML) that describes:

Grading

  1. [45%] Functionality
  2. [30%] Design and efficiency
  3. [25%] Documentation.
    1. [10%] Internal: Comments inside the code.
    2. [15%] External: Readme file.

Handin instructions