Homework 4 - D Term 2009

See course policy regarding late homework submissions

**Homework Instructions:**- Read Section 2.5 of the textbook in detail so learn more about the implementation of priority queues using heaps.
- Read Chapter 4 of the textbook in detail.
- See Useful logarithm and exponential facts.
- This is an individual homework. No collaboration is allowed.
- Submission Instructions:
- Code Submission:
To submit, login to a CCC unix machine. If the files that you plan to
submit are not already in a directory on your CCC Unix account,
transfer them there first.
Then use the following command to submit your files:
/cs/bin/turnin submit cs2223 hw4

replace*yourfile1 yourfile2 ... yourfilen**yourfile1 yourfile2 ... yourfilen*with the actual list of files you want to submit. Include a README.txt file providing information on what files are contained in your submission, what each one contains, and how to run your program.We suggest that you test the submission process AS SOON AS POSSIBLE by creating a text file named JustATest.txt and submitting it using turnin:

/cs/bin/turnin submit cs2223 hw4 JustATest.txt

See Using the turnin Program for more information and help. - Written Homework: Submit a hardcopy of your written homework solutions to Problem 1 by the beginning of class (i.e., 1:00 pm) on the day the homework is due.

- Code Submission:
To submit, login to a CCC unix machine. If the files that you plan to
submit are not already in a directory on your CCC Unix account,
transfer them there first.
Then use the following command to submit your files:
**Remember to show your work and explain your answers.**Correct answers without complete explanations won't receive full credit. Incorrect answers with explanations may receive partial credit.- See the course webpage for the late homework submission policy.

**Homework Problems:****(40 Points) Problem 1**. Suppose that your MP3 player has a disk capacity of M megabytes. Also assume that you have the following n songs available for download: S_{1}, S_{2}, S_{3}, ..., S_{n}. The i^{th}song S_{i}takes up m_{i}megabytes. The problem is that your MP3 player capacity, M, is less than the sum of all the m_{i}'s. That is, M < Σ^{n}_{i=1}m_{i}.**(20 Points)**Assume that you want to maximize the number of songs that you download onto your MP3 player. Prove or give a counterexample: A greedy algorithm that selects songs to download in increasing size order (from smallest to largest) will produce an optimal solution.**(20 Points)**Assume that you want to maximize the disk utilization of your MP3 player (that is, you want to use as many of your M megabytes as possible). Prove or give a counterexample: A greedy algorithm that selects songs to download in decreasing size order (from largest to smallest) will produce an optimal solution.

**(160 points) Huffman's Optimal Prefix Codes Algorithm**

Before you can solve this exercise, you need to read Sections 2.5 and 4.8 of the textbook in detail. In this problem you are asked to implement Huffman's Optimal Prefix Codes algorithm. Your implementation should satisfy**ALL**of the specifications below:- Your program should be written in Java, C, or C++. Choose one.
**(5 Points) Input specification**. Your program must receive one string as its only command line argument. This string is the name of a text file (e.g., mytest.txt, webpage.html, test) Your program must open this text file, read in the text in the file, and count the frequency of each of the 26 letters a, b, c, ..., z. Remember to divide the number of occurrences of each of these letters by the total number of the letters in the text file. For the purporse of this homework, ignore all symbols that are not letters of the alphabet (e.g., punctuation symbols, numbers, etc.) that appear in the text file. Do not include them in any of the counts (not even in the total number of letters in the text file). Also both capital and lower case letters contribute to the count (i.e., "a" and "A" both contribute to the count of letter "a").The following are some of the text files that we will use to test your program. We will also use other test files not included here.

**(40 Points) Algorithm**Your program should be a faithful implementation of the psedo-code of the Huffman's algorithm provided below.**Input:**- A string specifing the name of the input file, as described in detail in the "Input specification" section above.

**Output:**- See the "Output specification" section below.

**Algorithm:**Huffman() { Variables and Data Structures used in the code:- n: the number of symbols to consider. In this case 26.
- total_letters: the total number of letter occurrences in the input file.
- An array S of n positions where S[i] is the ith symbol in S, 1 ≤ i ≤ n.
- An array P of n positions where P[i] is the frequency of the symbol S[i].
- An array Code of n positions where Code[i] contains the encoding of the symbol S[i]. The encoding of a symbol is just a sequence of 0's and 1's
- A binary tree T Implement your own binary trees as follows. Define a node in a binary tree either as a public class (Java) or a struct (C, C++) treeNode with the following characteristics: A treeNode "tn" object has four fields (public instance variables in Java): tn.symbol : a string tn.freq : a real number tn.left : another treeNode object (or pointer to it in C) which is the left child of tn in the tree (or "nil"/"null" if tn doesn't have a left child) tn.right : another treeNode object (or pointer to it in C) which is the right child of tn in the tree (or "nil"/"null" if tn doesn't have a right child) Hence, a binary tree is simply given by its root node which is an object of class treeNode In the pseudo-code below the variables v, x, z, y, and T are treeNode objects (or in other words, variables of type treeNode)
- A priority queue PQ A priority queue is just a heap with added functionality (see Part 4 below). A heap is an array PQ[1...max] together with an integer value PQ.length where: - Each cell of the array, PQ[i], is a treeNode object. - PQ.length tells how many cells in PQ are being used (0 ≤ PQ.length ≤ max). Initially PQ.length should be initialized to 0, it should be incremented by one any time a treeNode object is inserted into PQ, and it should be decremented by one any time a treeNode object is extracted from PQ. - For the purpose of the pseudo-code below, max=26 would suffice As Figure 2.3 of the textbook (p. 60) shows, one can "visualize" the elements in the PQ array as being nodes of a binary tree. (The root of the PQ "tree" is PQ[1], and the two children of the element PQ[i] are PQ[2i] and P[2i+1].) In the case of the Huffman algorithm below, each node in this PQ "virtual tree structure" is itself a binary tree of type treeNode. That is, in this case, each element in the array PQ is a prefix code tree (three examples of prefix code trees are given in Figure 4.16 p. 168 of the textbook). In summary, the PQ in this pseudo code can be "visualized" as a tree in which each node is a tree itself. Complicated enough? :-)

**(65 Points) Required Data Structures: Heaps and Priority Queues**. Your program should use the following data structures:**(65 Points)**A priority queue PQ to handle the letters/strings under consideration by the Huffman's algorithm. The*key*of each letter/string in this priority queue is its frequency.Your program should implement Heaps and based on them, Priority Queues, satisfying all the specifications below.

**(10 Points)**The heaps should rely only on the plain array data structure in the programming language that you chose. The priority queues should rely only on the heap data structure you implemented.**Using heaps and/or priority queues provided by the programming laguage that you chose is not allowed. You need to write your own implementations.****(20 Points)**Your implementation of heaps should include the following functions faithfully implemented from their descriptions in Section 2.5 of the textbook.**(10 Points)**Heapify-up(H,i)**(10 Points)**Heapify-down(H,i)

**(35 Points)**Your implementation of priority queues should include the following functions faithfully implemented from their descriptions in Section 2.5 of the textbook.**(5 Points)**StartHeap(N)**(5 Points)**Insert(H,v)**(5 Points)**FindMin(H)**(5 Points)**Delete(H,i)**(5 Points)**ExtractMin(H)**(10 Points)**ChangeKey(H,v,alpha)

**(10 Points) Output specification**. Your program should output the following information:- For each letter of the alphabet:
- The letter, its frequency in the input text file, and the code assigned to this letter by your algorithm.
- The average number of bits per letter used by the encoding.

Letter Frequency Encoding a 0.002 110 b 0.008 10 c 0.001 1110 d 0.0003 11110 e 0.06 0 ... ... ... z 0.000001 11111111 Average number of bits per letter = 7.3

- A new file containing the encoding of the full input file (ignoring symbols different from the 26 letters of the alphabet) using the codes generated by the algorithm. The name of the output file should be the original name of the input file concatenated with the extension ".huffman". For instance, if the name of the original input file was "webpage.html" then your output file should be named "webpage.html.huffman".

- For each letter of the alphabet:
**(40 Points)**Testing. Your code will be tested with several different input files as described above in the Input Specification section.