### CS2223 Algorithms Homework 4 - D Term 2009

#### PROF. CAROLINA RUIZ

Due Date: April 14, 2009 at 12:30 (code) and 1:00 (written homework) PM
See course policy regarding late homework submissions

• Homework Instructions:
• Read Chapter 4 of the textbook in detail.
• See Useful logarithm and exponential facts.
• This is an individual homework. No collaboration is allowed.
• Submission Instructions:
1. Code Submission: To submit, login to a CCC unix machine. If the files that you plan to submit are not already in a directory on your CCC Unix account, transfer them there first. Then use the following command to submit your files:
/cs/bin/turnin submit cs2223 hw4 yourfile1 yourfile2 ... yourfilen
replace yourfile1 yourfile2 ... yourfilen with the actual list of files you want to submit. Include a README.txt file providing information on what files are contained in your submission, what each one contains, and how to run your program.

We suggest that you test the submission process AS SOON AS POSSIBLE by creating a text file named JustATest.txt and submitting it using turnin:

/cs/bin/turnin submit cs2223 hw4 JustATest.txt

2. Written Homework: Submit a hardcopy of your written homework solutions to Problem 1 by the beginning of class (i.e., 1:00 pm) on the day the homework is due.
• See the course webpage for the late homework submission policy.

• Homework Problems:

1. (40 Points) Problem 1. Suppose that your MP3 player has a disk capacity of M megabytes. Also assume that you have the following n songs available for download: S1, S2, S3, ..., Sn. The ith song Si takes up mi megabytes. The problem is that your MP3 player capacity, M, is less than the sum of all the mi's. That is, M < Σni=1 mi.
1. (20 Points) Assume that you want to maximize the number of songs that you download onto your MP3 player. Prove or give a counterexample: A greedy algorithm that selects songs to download in increasing size order (from smallest to largest) will produce an optimal solution.
2. (20 Points) Assume that you want to maximize the disk utilization of your MP3 player (that is, you want to use as many of your M megabytes as possible). Prove or give a counterexample: A greedy algorithm that selects songs to download in decreasing size order (from largest to smallest) will produce an optimal solution.

2. (160 points) Huffman's Optimal Prefix Codes Algorithm
Before you can solve this exercise, you need to read Sections 2.5 and 4.8 of the textbook in detail. In this problem you are asked to implement Huffman's Optimal Prefix Codes algorithm. Your implementation should satisfy ALL of the specifications below:

1. Your program should be written in Java, C, or C++. Choose one.

2. (5 Points) Input specification. Your program must receive one string as its only command line argument. This string is the name of a text file (e.g., mytest.txt, webpage.html, test) Your program must open this text file, read in the text in the file, and count the frequency of each of the 26 letters a, b, c, ..., z. Remember to divide the number of occurrences of each of these letters by the total number of the letters in the text file. For the purporse of this homework, ignore all symbols that are not letters of the alphabet (e.g., punctuation symbols, numbers, etc.) that appear in the text file. Do not include them in any of the counts (not even in the total number of letters in the text file). Also both capital and lower case letters contribute to the count (i.e., "a" and "A" both contribute to the count of letter "a").

The following are some of the text files that we will use to test your program. We will also use other test files not included here.

3. (40 Points) Algorithm Your program should be a faithful implementation of the psedo-code of the Huffman's algorithm provided below.
```Input:

A string specifing the name of the input file, as described in detail in the "Input specification" section above.

Output:

See the "Output specification" section below.

Algorithm:

Huffman() {

Variables and Data Structures used in the code:

n: the number of symbols to consider. In this case 26.
total_letters: the total number of letter occurrences in the input file.
An array S of n positions where S[i] is the ith symbol in S, 1 ≤ i ≤ n.
An array P of n positions where P[i] is the frequency of the symbol S[i].
An array Code of n positions where Code[i] contains the encoding of the symbol S[i].

The encoding of a symbol is just a sequence of 0's and 1's

A binary tree T

Implement your own binary trees as follows. Define a node in a binary tree
either as a public class (Java) or a struct (C, C++) treeNode with the following
characteristics:

A treeNode "tn" object has four fields (public instance variables in Java):
tn.symbol : a string
tn.freq   : a real number
tn.left   : another treeNode object (or pointer to it in C) which is the left child
of tn in the tree (or "nil"/"null" if tn doesn't have a left child)
tn.right  : another treeNode object (or pointer to it in C) which is the right child
of tn in the tree (or "nil"/"null" if tn doesn't have a right child)

Hence, a binary tree is simply given by its root node which is an object of class treeNode

In the pseudo-code below the variables v, x, z, y, and T are treeNode objects
(or in other words, variables of type treeNode)

A priority queue PQ

A priority queue is just a heap with added functionality (see Part 4 below).
A heap is an array PQ[1...max] together with an
integer value PQ.length where:
- Each cell of the array, PQ[i], is a treeNode object.
- PQ.length tells how many cells in PQ are being used (0 ≤ PQ.length ≤ max).
Initially PQ.length should be initialized to 0,
it should be incremented by one any time a treeNode object is inserted into PQ,
and it should be decremented by one any time a treeNode object is extracted from PQ.
- For the purpose of the pseudo-code below, max=26 would suffice

As Figure 2.3 of the textbook (p. 60) shows, one can "visualize" the elements
in the PQ array as being nodes of a binary tree. (The root of the PQ "tree" is PQ[1],
and the two children of the element PQ[i] are PQ[2i] and P[2i+1].)
In the case of the Huffman algorithm below, each node in this PQ "virtual tree
structure" is itself a binary tree of type treeNode. That is, in this case,
each element in the array PQ is a prefix code tree (three examples of prefix code
trees are given in Figure 4.16 p. 168 of the textbook).
In summary, the PQ in this pseudo code can be "visualized" as a tree in which
each node is a tree itself. Complicated enough? :-)

initialize the array S with the letters a, b, ...., z

initialize the array P and the total_letters with 0's (initial letter counts)

read the input text file and update the total_letters and P[.] counts
as you read in the text

// For each symbol s in S, store in P[s] the frequency of s in the text file: //
For i := 1 to n {
P[i] := P[i]
}

create an empty priority queue PQ

// Create a new treeNode object v for each symbol in S and insert them into PQ: //
For i := 1 to n {
create a new tree node v
v.symbol := S[i]
v.freq := P[i]
v.left := nil
v.right := nil
insert v with key value v.freq into PQ
}

// Construct a binary tree (prefix code tree) with the symbols in S stored in PQ as follows: //
For i := 1 to n-1 {
create a new tree node z
x := ExtractMin(PQ). That is, extract the tree x with the minimum frequency from PQ
z.right := x
y := ExtractMin(PQ). That is, extract the tree y with the minimum frequency from PQ
z.left := y
z.freq := x.freq + y.freq
insert z with key value z.freq into PQ
}

extract the minimum (and only) tree T remaining in PQ

generate the codes for each symbol in S by following the
tree T structure in a depth-1st manner assigning "0" to the
left child of the node and "1" to the right child of the node
until all the leaves are assigned a code
(as shown in Figure 4.16 of the textbook)
Once that a code is assigned to a symbol in S, say to the symbol S[i], store such
code in the corresponding array position Code[i].

Compute "ABL", that is the average number of bits per letter.

Output the code generated for each letter in S and also output the value "ABL"

Output the encoding the full input file (ignoring symbols different
from the 26 letters of the alphabet) in a new file with the same name of the
original input file and with the added extension ".huffman"

}

```

4. (65 Points) Required Data Structures: Heaps and Priority Queues. Your program should use the following data structures:

• (65 Points) A priority queue PQ to handle the letters/strings under consideration by the Huffman's algorithm. The key of each letter/string in this priority queue is its frequency.

Your program should implement Heaps and based on them, Priority Queues, satisfying all the specifications below.

• (10 Points) The heaps should rely only on the plain array data structure in the programming language that you chose. The priority queues should rely only on the heap data structure you implemented. Using heaps and/or priority queues provided by the programming laguage that you chose is not allowed. You need to write your own implementations.

• (20 Points) Your implementation of heaps should include the following functions faithfully implemented from their descriptions in Section 2.5 of the textbook.
• (10 Points) Heapify-up(H,i)
• (10 Points) Heapify-down(H,i)

• (35 Points) Your implementation of priority queues should include the following functions faithfully implemented from their descriptions in Section 2.5 of the textbook.
• (5 Points) StartHeap(N)
• (5 Points) Insert(H,v)
• (5 Points) FindMin(H)
• (5 Points) Delete(H,i)
• (5 Points) ExtractMin(H)
• (10 Points) ChangeKey(H,v,alpha)

5. (10 Points) Output specification. Your program should output the following information:
1. For each letter of the alphabet:
• The letter, its frequency in the input text file, and the code assigned to this letter by your algorithm.
• The average number of bits per letter used by the encoding.
A sample output formatting is given below (with made-up values):
```Letter   Frequency     Encoding
a        0.002         110
b        0.008         10
c        0.001         1110
d        0.0003        11110
e        0.06          0
...      ...           ...
z        0.000001      11111111

Average number of bits per letter = 7.3
```
2. A new file containing the encoding of the full input file (ignoring symbols different from the 26 letters of the alphabet) using the codes generated by the algorithm. The name of the output file should be the original name of the input file concatenated with the extension ".huffman". For instance, if the name of the original input file was "webpage.html" then your output file should be named "webpage.html.huffman".

6. (40 Points)Testing. Your code will be tested with several different input files as described above in the Input Specification section.