Programming Assignment #5

Project 5 (60 points)
Assigned: Friday, February 19, 2010
Part 1 due: Wednesday February 24, 2010, 11:59 PM
Part 2 due: Wednesday, March 3, 2010, 11:59 PM

Programming Assignment #5 — Hash Tables

Abstract

Write and test a class that implements a hash table to store strings. Create a subclass to use in an application to determine the percentage of lines that are the same among a set of files.

Outcomes

After successfully completing this assignment, you should be able to:–

· Build a hash table class in C++ for storing and finding non-trivial data.

· Define a subclass one of the classes you already defined.

· Use that subclass, along with a set of linked lists, to figure out whether multiple files contain identical lines and, if so, how many.

· Put together a non-trivial program in C++.

Before Starting

Read Chapter 23 of Deitel & Deitel. Also, review hash tables in §6.6 of Kernighan and Ritchie or in the lecture notes of this course (.ppt, html).

In Visual Studio, create a Console Application Project (similar to that you created for Laboratory Assignment #5) called PA5 to hold the program and interface files for this project.

This Assignment

This assignment is in two parts. In Part 1, you must build and test a class to implement hash tables for strings. This is an individual part. This part will be due next Wednesday, February 24.

In Part 2, you will use your hash table to build an application to determine the similarity of multiple documents (such as programs, write-ups, etc.) For this part, you may optionally work in two-person teams. In particular, if you were not able to construct a fully working hash table class, you should partner with someone who did (or with someone with whom you can merge your hash table classes.)

Part 1: Hash Table class

Your hash table class must contain an array of size determined by an argument of the constructor. Each array entry (called a “bucket” in hash table terminology) points to a linked list of entries. Each entry contains a string (i.e., the object being hashed and stored) and a link to the next entry. Your hash table class must implement the following methods:–

· A constructor HashTable(const long tableSize). The number of elements in the hash table should be a prime number close in value to the parameter tableSize. For this problem, think in terms of 10,000 or more.

· A method long hash(const string &s), which creates a hash value for its string argument. You may use the hash algorithm on page 144 of Kernighan & Ritchie or any suitable hash algorithm that you find on the web or in your textbooks. Be sure to cite your source for your hash algorithm in your README file.

· A method node *lookup(const string &s), where node is the type of the elements of the linked list entries in your hash table. This method looks up value of string argument in the hash table and returns a pointer to its list entry (if found) or to NULL (if not found).

· A method node *addEntry(const string &s) that calls lookup to find the entry and, if it is not found, adds it. Either way, it returns a pointer to the entry for its string argument.

· A destructor method that deletes all of the entries of the hash table and then frees its array.

To test your class, you must to write a separate, small test program that inserts random strings into the hash table and then finds them again. You should test it with both short strings (a few characters) and long strings (tens or hundreds of characters). Show your test cases in your README file.

Submitting Part 1

To submit Part 1 of this assignment, clean your Visual Studio project, zip your class definition and test program together, and submit it and your README file using the following command:–

/cs/bin/turnin submit CS2303 PA5.1 <your zip file> README

Submit this by 11:59 PM on Wednesday, February 24. Submissions after 11:59 pm on due date (February 24) will be tagged as late, and will be subject to the late homework policy.

This part is worth 20 points of the total of 60 points of this project.

Part 2: Application Program

This program must accept an indeterminate number of arguments on the command line, the first of which specifies the size of the hash table and the remaining of which specify the input files. Thus, a user could invoke your program in a Windows Command Prompt by typing

./PA5 hashTableSize inputFile1 inputFile2 ...

The program needs to first create the hash table. Then it needs to read each file line-by-line and store each line in the hash table. After all of the input files have been read, it has to print out a summary that says for each other input file, what percentage of the lines of the given file also appear in that other input file. For example, the output might look like the following:–

inputFile1:
   6% of lines are also in inputFile2
   50% of lines are also in inputFile3
   0% of lines are also in inputFile4

inputFile2:
   9% of lines are also in inputFile1
   0% of lines are also in inputFile3
   10% of lines are also in inputFile4

inputFile3:
   45% of lines are also in inputFile1
   0% of lines are also in inputFile2
   1% of lines are also in inputFile4

inputFile4:
   0% of lines are also in inputFile1
   11% of lines are also in inputFile2
   2% of lines are also in inputFile3

Implementing Part 2

One way to implement this program is to create an array of size argc – 2. Each entry of this array would correspond to one input file and would be a pointer to the first element of linked list. The elements of the linked list would point to the entries of hash table containing the lines of the file. Hypothetically, it would be possible to print out the original file from this linked list.

This array of linked lists provides “forward” pointers to the lines of the files as they are stored in the hash table. To get the output, it is also necessary to have “backward” pointers from each line to all of the files that contain that line.

For this purpose, you need to create a subclass (i.e., a derived class) of the hash table nodes. The subclass adds one or more members to the nodes in order to contain the pointers back to the files containing the lines of those nodes.^{^[1]} Suppose, for example, that line

for (i = 0; i < max; i++)

appears in some number of the input files. Then, the hash table entry for this line would point to a linked list of six items, each item containing an index into the original file array.

This makes it possible to identify which lines of a particular file also appear in other files. For a given file, you simply follow the pointer to each of its lines and see what other files also contain that line.

To keep things simple, you should have a separate forward pointer for each line of the file, even if the line is repeated, and you should have a separate backward pointer for each time a line appears in a particular file, even if it appears more than once.

Submitting Part 2

To submit Part 2 of this assignment, clean your Visual Studio project, zip your class definition and application program together, and submit it and your README file using the following command:–

/cs/bin/turnin submit CS2303 PA5.2 <zip file> README <test-cases>

This part is worth 40 points of the total of 60 points of this project. This part is due at 11:59 PM on Wednesday, March 3. Because of grading deadlines at the end of the term, and because of the exam schedule, there is no room for extensions or late submittals.

Individual and Team Project

Part 1 is an individual project. You may consult classmates and other resources on the theory and algorithms for creating and managing hash tables, but you must implement them in your class on your own. Be sure to put your name on all files of the project. Be sure to cite your sources (including your friends) for any knowledge, ideas, brainstorming, algorithms, or other information you use in this assignment.

Part 2 is optionally a two-person team project. Please inform the TAs of your partnership, so that we can properly record grades for both of you. Be sure to put both of your names on all files of your project. Explain in your README file whose class implementation your used. Also cite all of your sources for knowledge, ideas, brainstorming, algorithms, etc., used in this project.

Grading

This assignment is worth sixty (60) points. Each part of your submission must compile without errors in Visual Studio 2008 in order to receive any credit. If you do your work on any other system, it is strongly suggested that you compile and test it on Visual Studio at least one day prior to submitting it, in order to flush out any differences in C++ implementation.

Part 1 – 20 points

· Correctly build in Visual Studio 2008 (Debug mode) without warnings – 4 points

· Correct definition and implementation of HashTable class and the methods above – 8 points

· Adequate test program to show that class and methods work – 4 points

· README file summarizing implementation of class and showing output of test program – 4 points

Part 2 – 40 points

· Correctly build in Visual Studio 2008 (Debug mode) without warnings – 5 points

· Subclass of nodes of hash table to contain pointers to files containing lines – 10 points

· Correct implementation of forward pointers from input file array to lines of each file – 5 points

· Correct implementation of backward pointers from hash table to files containing those lines – 5 points

· Satisfactory test cases comprising at least five input files, some of which share lines and some of which do not – 5 points

· Correct execution with graders’ test cases – 5 points

· Satisfactory README file summarizing all salient points of application program – 5 points

· Penalty of five points on either part if your project is not clean before creating the zip file or for submitting your programs in something other than a zip file.

^{^[1]} Of course, it is possible to do this assignment by simply redefining the nodes of the hash table class of Part 1. However, we want you to actually create a subclass and use it for this assignment.