Heaps

So far, we’ve considered lists, binary-search trees, and AVL trees as representations of sets. Our comparisons have focused primarily on the worst-case performance of hasElt. Depending on your application, however, checking for membership might not be your primary concern.

What if instead you needed to frequently find the smallest (or largest) element in a collection of elements? Strictly speaking, once we are considering "smallest" elements we are no longer talking about sets. However, as we said in the case of lists, it is okay to implement an interface for an unordered datatype with a data structure that has order. We will therefore continue to discuss the standard set operations (AddElt, etc), even as we consider ordering.

Consider the data structures we’ve looked at so far. How do they compare on adding elements, finding elements, and retrieving the smallest element?

AVL trees seem to have the best performance if all operations are equally important. If we expect to do more minimum element fetches than other operations, however, the constant access of sorted lists is quite appealing, but the cost on the other operations is much higher. How could we get both constant-time access to the least element and good behavior on insertion of new elements?

Heaps are binary trees (not binary search trees) with a simple invariant: the smallest element in the set is at the root of the tree (equivalently, the root could contain the largest element), and the left and right subtrees are also heaps. There are no constraints on how other elements fall across the left and right subtrees. This is a much weaker invariant than we’ve seen previously. To understand whether this invariant is useful, we need to figure out the worst case running time of key operations.

Before we do that though, let’s understand the invariant and how our key set operations have to behave to preserve it. Each of the following is a valid heap on the numbers 1 through 6:

1 1 1
\ / \ / \
2 2 4 3 2
\ / / \ / \
3 3 6 5 4 5
\ /
4 6
\
5
\
6

Recall that AVL trees achieve small worst-case running times by requiring that the subtrees at each node be balanced (differ in height by at most 1). Clearly, the heap invariant doesn’t do that (else the leftmost tree wouldn’t be a valid heap).

We often implement data structures through algorithms with better performance guarantees than those required in the invariants. Informally, let’s add "mostly balanced" to our implementation goals. (The "mostly" is why this is an algorithmic issue rather than built into the invariant).

Which operations are responsible for "mostly balanced"? Those that modify the contents of the heap, so in this case, addElt and remElt. With heaps, we are more often interested in removing the minimum element than an arbitrary element, so let’s ignore remElt in favor of remMinElt. We’ll tackle remMinElt by way of an example.

Assume you want to remMinElt from the following heap:

1
/ \
3 2
/ \
4 5
/ \
6 8
/ \
12 10

Once we remove the 1, we are left with two heaps:

3 2

/ \

4 5

/ \

6 8

/ \

12 10

We have to merge these into one heap. Clearly, 2 is the smaller of the roots; it will become the root of the new heap. The question is what the two subtrees of the new root will be. Once we remove the 2, we are left with three subtrees to consider:

3 4 5

/ \

6 8

/ \

12 10

To collapse three subtrees into two, we could leave one as is and merge the other two subtrees into one. Which two should we merge and why? Think about it before reading on. [Hint: what constraint are we trying to satisfy with remMinElt?]

Remember our goal: to keep the heap "mostly balanced". That means that we shouldn’t let the new subtrees get any farther apart in depth than they need to be. With that goal in mind, consider each of the three options:

merge the 3 and 4 subtrees and leave 5 alone
merge the 3 and 5 subtrees and leave 4 alone
merge the 4 and 5 subtrees and leave 3 alone

Which combination most closely preserves the overall depth of the heap? Merging creates a new heap that is at least as tall as the two input heaps (and possible taller). For example, if we merge the 3 and 5 heaps, we would have to get a new heap with 3 as the root and the entire 5 subtree as a heap (we could consider re-distributing the values in the 5-heap, but that gets expensive).

If you think about this a bit, it becomes clear that a good way to control growth of the heap on merging is to leave the largest subtree alone and merge the two shorter subtrees. This is indeed the approach. In pseudocode:

  Merge (H1, H2) {
   if H1 is empty, return H2
   else if H2 is empty, return H1
   else let newroot = min(root(H1), root(H2))
     if newroot == root(H1)
       let ST1 = H1.left
           ST2 = H1.right
           ST3 = H2
     else
       let ST1 = H2.left
           ST2 = H2.right
           ST3 = H1
     if ST1.height >= ST2.height && ST1.height >= ST3.height
       new Heap (newroot, ST1, Merge (ST2, ST3))
     else if ST2.height >= ST1.height && ST2.height >= ST3.height
       new Heap (newroot, ST2, Merge (ST1, ST3))
     else // ST3 is largest
       new Heap (newroot, ST3, Merge (ST1, ST2))
  }

How close does "mostly balanced" get to our desired goal of log time on remMinElt? We can write a formula to capture how much time Merge takes relative to the number of items (n) in the heap:

T(n) <= T(2n/3) + c

This says that if the original pair of heaps have a total of n nodes combined, the recursive call to Merge considers at most 2n/3 nodes. Why? The algorithm keeps the tallest subtree in tact. By definition, that subtree has at least 1/3 of the total nodes (otherwise it wouldn’t be the tallest). This leaves 2n/3 nodes left to merge on the next iteration.

If you solve this equation (something you will cover if you take CS2223), you find that this recurrence yields the desired log n worst-case time.

What about addElt? Imagine that I want to add the new element 3 to the heap shown on the right:

3 2

/ \

4 5

/ \

6 8

/ \

12 10

Wait, isn’t this pair of pictures familiar from remMinElt? Yes. Adding an element is equivalent to merging a single-element heap with the existing heap. So addElt occurs in log n time as well.

Heaps constructed by this "keep the largest subtree intact" method are called Maxiphobic heaps (the title is suggestive – avoid processing the large sub-heap when merging).

To wrap up, let’s add Maxiphobic Heaps to our running table:

1 Summary

We’ve looked at several possible implementations of sets, considering both algorithmic and program design issues. What have we seen?

Algorithmic: choose your algorithm based on the profile of operations you expect to run most/least often (each has different properties as summarized by the above table).
Program Design: Many data structures that you encounter are a combination of a data definition (for shape) and an invariant on the contents within that shape. The core code for functions over the data structure follow the data-definition template as we have always done (for lists or trees, in our sets examples). The invariant gets built into the implementations atop the core template code.
Algorithmic: Once you know the interface you are trying to implement, invariants often suggest the corresponding algorithms.
Program Design: Document how your code maintains the invariant.
Program Design: if your implementation contains a value (such as height) that is used to maintain an invariant, protect it so that nobody (unintentionally or maliciously) can edit that datum and break your code’s performance. In this case, we used Java’s private modifier to achieve this.

What have we not done?

Algorithmic: Choosing a good data structure is much more subtle than looking at worst-case running times. CS 2223 (Algorithms) will continue your study of this topic in more detail.
Program Design: Given the common core of all these set implementations in basic binary trees, shouldn’t we be able to share some of the code across these implementations? Tune in later ...

2 Misc

A reference on maxiphobic heaps: http://www.eecs.usma.edu/webs/people/okasaki/sigcse05.pdf. Not required, just here for reference.