Crawlers, also known as robots, or spiders, are essential components of web search engines. Programming a crawler requires the knowledge of several important web techniques. In this assignment, you will implement a simplified version of a crawler. The architecture of the crawler is depicted below.
Note that you are free to implement the assignment in any programming language that you are an expert on (e.g., Perl, C, C++, Java). However, support will be provided only for Perl, a language that is especially suited for this assignment. Your program should meet the requirements of the assignment, should run on UNIX either on the WPI (CCC) computer (or on shiraz). If any of these requirements does not hold, then we cannot grade your assignment, and you will get a zero for it.
You should write a program that performs a breadth-first traversal of a web site starting at a given URL, starturl. The traversal should visit all the HTML pages at depth d from starturl, i.e., reachable from starturl following a chain of at most d links. Only pages that have starturl as prefix should be visited. The crawler should store the collection of visited pages as follows:
Your program should run from the command line, with the following syntax:
crawl [-depth d] starturl dir
The default value for the depth should be 2. The file urlindex and the term files should be stored in subdirectory dir of the current directory, which should be created by the program. Replaced by: The term file should be stored in subdirectory dir of the current directory, which should be created by the program.
The relevant terms extracted from each page should include the following:
Feel free to extract additional relevant terms (e.g., all the words in the page, excluding the HTML tags).
Reuse of code in standard libraries (e.g., Perl CPAN archive) is encouraged. However, you should not try to find and copy a program written by someone else that solves the assignment (see the Academic Honesty policy for the course).
In addition to your program, you should write a readme file (plain text or HTML) that describes:
- design choices
- code structure
- instructions for use
- instructions for installation
- known bugs
- list of reused code from the standard libraries.