Homework 5 — Binary Trees

Due: Sunday, December 14, 2008, at 11:59pm

Abstract

Write a program to scan one or more text files and count the number of occurrences of each word. Use a binary tree to keep track of all words. When all input files have been scanned, print out the results to a file. For extra credit, instrument your program to get information about how long it takes to search the binary tree as a function of the number of entries stored in it.

Outcomes

After successfully completing this assignment, you should be able to:–

·        Build a massive, recursive data structure of structs using malloc() and free().

·        Search for an item in that data structure and, if it is not found, add it to the structure.

·        Create a file for your output and write to it.

·        Put together a non-trivial program in C from algorithms and tools that are already available to you.

Before Starting

Re-read §6.5 regarding self-referential data structures and §7.5 regarding file open and creation. Also, review Lab 5 about make and makefiles.

This Assignment                                                             

The program you must create in this assignment is approximately the same size as Homework 4. It should comprise multiple files, and it should include a makefile that builds the entire program and all of its components. Unlike Homework 4, the partitioning of the problem is not given to you. Instead, you have to divide it up into component .c and .h files implementing small groups of related functions. The only requirement is that the executable program be named hw5.

Your program must accept an indeterminate number of arguments on the command line, the first of which specifies the output file and the remaining of which specify the input files. Thus, to invoke your program, the user would type

./hw5 outputFilePath inputFilePath1 inputFilePath2 ...

Under this command, the program would open and read each of the input files in turn, building up a binary tree of words and counts as it progresses. Once all files have been read and closed, it must create the output file and write out the words in the tree in alphabetical order, one word per line, along with the number of occurrences of that word. Your program should ignore the case of the words, so that “This” and “this” are considered the same. A sample output should look like the following:–

    166 a
     25 and
     11 as
      3 command
     15 each
      2 file
      4 files
    109 in
      4 input
     98 it
     99 of
      3 open
      6 program
     18 read
    152 the
     41 this
      3 under
     30 would
-------------
     17 Total number of different words

To allow for very long input files, the field width of the number of occurrences of each word should be at least six decimal digits. You should also total and print the number of distinct words.

As part of your project submission, you must provide a file called makefile that enables make to build the target hw5, and individual components, and the target clean, which removes the executable and all intermediate files. The graders will type the command

make

to build your program and the command

make clean

to delete it and the intermediate files.

For debugging, you may use the same text files as were provided in Homework 4 — i.e.,

·        Kennedy.txt

·        MartinLutherKing.txt

·        Homework4.txt

Algorithms for your Use

In §6.5, Kernighan and Ritchie describe a struct called tnode that defines a binary tree and two functions called addtree() and treeprint() to use and print that tree. Note both are recursive functions. You should base your program on these, but you may make changes if you feel it necessary to do so.

The example program of §6.5 also refers to a function getword(), which is described on page 136, and talloc(), which is described later. (Note that getword() is more general than you need; you may shorten it if you wish so that it does not “put back” the first character after each word.) For file I/O, you should use fopen(), as described in §7.5. See especially the top of page 161, which explains that if a nonexistent file is opened for writing, it is created if possible.

The main purpose of this assignment is not to discover or invent new things but rather to bring together tools, techniques, algorithms that are already available to you into a new program. Therefore, you should freely copy and adapt any functions you can find in Kernighan & Ritchie — or from other resources — provided that you cite them in your project write-up.

Deliverables

Your project submission should include:–

·        A write-up describing the principal files, functions, and data structures of your program, including citations of material you used or adapted;

·        The .c and .h files of your program;

·        At least one sample output file showing that it works; and

·        The makefile needed to build your program.

Submit your files using the following turnin command (all on one line):–

/cs/bin/turnin submit cs2301 hw5 README makefile *.c *.h output.txt

 Note that you can submit additional files at any time; for example, you can add additional samples of output, or you can modify a previously submitted file by simply resubmitting it again.

Be sure to put your name at the top of ALL files (except, obviously, the sample output files).

Submissions dated after 11:59pm on December 14 will be automatically tagged as late, and will be subject to the late homework policy.

Grading

This assignment will be graded on the following areas:–

·        documentation of all files, functions, and data structures, including citations;

·        correct makefile to build your program;

·        correctness of execution, particularly on the grader’s test cases;

·        correct use of malloc () and correctly freeing all malloc’ed data;

·        correctly building and managing dynamically allocated data structures;

·        partitioning the problem into multiple .c files and setting up appropriate .h files.

Programs must make successfully in order to receive any credit.

Extra Credit (25%)

After you have completed this assignment to your own satisfaction and submitted it, you may modify it to gather statistics about how long it takes to search a binary tree. In a separate copy with a new name hw5-extra, modify addtree() by adding two static variables, one to keep track of the number of entries in the tree and the other to keep track of the number of calls to addtree() that are needed to either find a word or add it to the binary tree. On the command line, add an optional argument of the form “-i filePath,” which, if specified, turns instrumentation on.

When instrumentation is on, your program writes two numbers to the instrumentation file for every word scanned — the size of the tree and the number of calls to addtree() needed for that word. After a run is completed, take this file into Excel or your favorite plotting program and plot number of calls versus size of the tree. The size of the tree should be plotted on a logarithmic scale. Submit the plot in PDF format, and comment on what you find in your writeup.