Heaps
So far, we’ve considered lists, binary-search trees, and AVL trees as representations of sets. Our comparisons have focused primarily on the worst-case performance of hasElt. Depending on your application, however, checking for membership might not be your primary concern.
What if instead you needed to frequently find the smallest (or largest) element in a collection of elements? Strictly speaking, once we are considering "smallest" elements we are no longer talking about sets. However, as we said in the case of lists, it is okay to implement an interface for an unordered datatype with a data structure that has order. We will therefore continue to discuss the standard set operations (AddElt, etc), even as we consider ordering.
Consider the data structures we’ve looked at so far. How do they compare on adding elements, finding elements, and retrieving the smallest element?
| addElt | hasElt | minElt |
----------------------------------------------- |
Lists | constant | linear | linear |
Sorted Lists | linear | linear | constant |
Binary Search Tree | linear | linear | linear |
AVL Tree | log | log | log |
AVL trees seem to have the best performance if all operations are equally important. If we expect to do more minimum element fetches than other operations, however, the constant access of sorted lists is quite appealing, but the cost on the other operations is much higher. How could we get both constant-time access to the least element and good behavior on insertion of new elements?
Heaps are binary trees (not binary search trees) with a simple invariant: the smallest element in the set is at the root of the tree (equivalently, the root could contain the largest element), and the left and right subtrees are also heaps. There are no constraints on how other elements fall across the left and right subtrees. This is a much weaker invariant than we’ve seen previously. To understand whether this invariant is useful, we need to figure out the worst case running time of key operations.
Before we do that though, let’s understand the invariant and how our key set operations have to behave to preserve it. Each of the following is a valid heap on the numbers 1 through 6:
1 1 1 |
\ / \ / \ |
2 2 4 3 2 |
\ / / \ / \ |
3 3 6 5 4 5 |
\ / |
4 6 |
\ |
5 |
\ |
6 |
In practice, however, heaps are implemented to be "mostly balanced", in a way that is sufficient to make them have log-time performance on addElt.
This is as far as we went on the formal lecture on heaps. The rest of these notes provide details about how heaps work for those who are interested. The Summary section at the end gives some general principles about programming data structures that are useful for everyone. The sections between this and the summary are just for those who are interested in this material.
1 How the Heap Operations Work
We often implement data structures through algorithms with better performance guarantees than those required in the invariants. Informally, let’s add "mostly balanced" to our implementation goals. (The "mostly" is why this is an algorithmic issue rather than built into the invariant).
Which operations are responsible for "mostly balanced"? Those that modify the contents of the heap, so in this case, addElt and remElt. With heaps, we are more often interested in removing the minimum element than an arbitrary element, so let’s ignore remElt in favor of remMinElt. We’ll tackle remMinElt by way of an example.
Assume you want to remMinElt from the following heap:
1 |
/ \ |
3 2 |
/ \ |
4 5 |
/ \ |
6 8 |
/ \ |
12 10 |
3 2 |
/ \ |
4 5 |
/ \ |
6 8 |
/ \ |
12 10 |
3 4 5 |
/ \ |
6 8 |
/ \ |
12 10 |
Remember our goal: to keep the heap "mostly balanced". That means that we shouldn’t let the new subtrees get any farther apart in depth than they need to be. With that goal in mind, consider each of the three options:
merge the 3 and 4 subtrees and leave 5 alone
merge the 3 and 5 subtrees and leave 4 alone
merge the 4 and 5 subtrees and leave 3 alone
Which combination most closely preserves the overall depth of the heap? Merging creates a new heap that is at least as tall as the two input heaps (and possible taller). For example, if we merge the 3 and 5 heaps, we would have to get a new heap with 3 as the root and the entire 5 subtree as a heap (we could consider re-distributing the values in the 5-heap, but that gets expensive).
If you think about this a bit, it becomes clear that a good way to control growth of the heap on merging is to leave the largest subtree alone and merge the two shorter subtrees. This is indeed the approach. In pseudocode (a sketch, not exact Java code):
Merge (H1, H2) { |
if H1 is empty, return H2 |
else if H2 is empty, return H1 |
else let newroot = min(root(H1), root(H2)) |
if newroot == root(H1) |
let ST1 = H1.left |
ST2 = H1.right |
ST3 = H2 |
else |
let ST1 = H2.left |
ST2 = H2.right |
ST3 = H1 |
if ST1.height >= ST2.height && ST1.height >= ST3.height |
new Heap (newroot, ST1, Merge (ST2, ST3)) |
else if ST2.height >= ST1.height && ST2.height >= ST3.height |
new Heap (newroot, ST2, Merge (ST1, ST3)) |
else // ST3 is largest |
new Heap (newroot, ST3, Merge (ST1, ST2)) |
} |
2 Another Example of Merging
Let’s try one more example of remMinElt and merge. What if you call remMinElt on the following heap?:
5 |
/ \ |
6 8 |
/ \ |
12 10 |
We remove the 5. The smaller root between left and right is 6 which means we have to merge across the following 3 heaps:
12 10 8 |
Merge says "pick the shortest two". But all three have the same height! That means that any of the following three heaps (and more!) could be a correct answer:
6 6 6 |
/ \ / \ / \ |
12 8 8 10 8 10 |
/ \ \ |
10 12 12 |
The merge code we sketched out above returns exactly one of these, but the point is that any of these should be considered a correct answer. If you were using someone else’s heap code, you couldn’t predict which tree you would get back. Not having a single guaranteed answer makes tasks like testing more challenge (we’ll return to that later).
3 Performance
How close does "mostly balanced" get to our desired goal of log time on remMinElt? We can write a formula to capture how much time Merge takes relative to the number of items (n) in the heap:
T(n) <= T(2n/3) + c
This says that if the original pair of heaps have a total of n nodes combined, the recursive call to Merge considers at most 2n/3 nodes. Why? The algorithm keeps the tallest subtree in tact. By definition, that subtree has at least 1/3 of the total nodes (otherwise it wouldn’t be the tallest). This leaves 2n/3 nodes left to merge on the next iteration.
If you solve this equation (something you will cover if you take CS2223), you find that this recurrence yields the desired log n worst-case time.
4 addElt
What about addElt? Imagine that I want to add the new element 3 to the heap shown on the right:
3 2 |
/ \ |
4 5 |
/ \ |
6 8 |
/ \ |
12 10 |
Heaps constructed by this "keep the largest subtree intact" method are called Maxiphobic heaps (the title is suggestive – avoid processing the large sub-heap when merging).
To wrap up, let’s add Maxiphobic Heaps to our running table:
| addElt | hasElt | minElt |
----------------------------------------------- |
Lists | constant | linear | linear |
Sorted Lists | linear | linear | constant |
Binary Search Tree | linear | linear | linear |
AVL Tree | log | log | log |
Maxiphobic Heap | log | linear | constant |
5 Implementing Heaps
What if we wanted to actually write code for heaps, rather than just sketch out the ideas as we have done here? This section sketches out how you would go about doing this, for those who aren’t sure how to proceed.
First, heaps are trees. So we should expect to need an interface and two classes (one for empty heaps and one for heaps with at least one number):
interface IHeap { |
// adds given element to the heap without removing other elements |
IHeap addElt(int e); |
// removes one occurrence of the smallest element from the heap |
IHeap remMinElt(); |
// returns the size of the heap |
int size(); |
} |
|
class MtHeap { |
MtHeap(){} |
} |
|
class DataHeap { |
int data; |
IHeap left; |
IHeap right; |
|
DataHeap(int data, IHeap left, IHeap right) { |
this.data = data; |
this.left = left; |
this.right = right; |
} |
} |
Next, think about the methods. The size should be straightforward: an empty heap has size 0, while the size of a non-empty heap is the sum of the sizes of the left and right heaps, adding one more for the current node.
The other methods rely on merge, as we have already discussed. We need to turn merge into a method, then use it to implement the other methods. For example, the code for addElt is as follows for the non-empty heap class:
public IHeap addElt(int e) { |
return this.merge(new DataHeap(e, new MtHeap(), new MtHeap())); |
} |
What is it for the empty heap class? See if you can fill it in.
Writing the heaps code is a good exercise for those who need practice working with trees in Java.
6 Misc
A reference on maxiphobic heaps: http://www.eecs.usma.edu/webs/people/okasaki/sigcse05.pdf. Not required, just here for reference.
7 Summary
We’ve looked at several possible implementations of sets, considering both algorithmic and program design issues. What have we seen?
Algorithmic: choose your algorithm based on the profile of operations you expect to run most/least often (each has different properties as summarized by the above table).
Program Design: Many data structures that you encounter are a combination of a data definition (for shape) and an invariant on the contents within that shape. The core code for functions over the data structure follow the data-definition template as we have always done (for lists or trees, in our sets examples). The invariant gets built into the implementations atop the core template code.
Algorithmic: Once you know the interface you are trying to implement, invariants often suggest the corresponding algorithms.
Program Design: Document how your code maintains the invariant.
What have we not done?
Algorithmic: Choosing a good data structure is much more subtle than looking at worst-case running times. CS 2223 (Algorithms) will continue your study of this topic in more detail.
Program Design: Given the common core of all these set implementations in basic binary trees, shouldn’t we be able to share some of the code across these implementations? Tune in later ...