Data Structures for Sets

Kathi Fisler

Starting with this lecture, we will discuss three concrete data structures that you can use to implement sets. Our goal is for you to understand what these data structures look like, how they provide each of our core set operations, and when each is a good choice in a problem. We will not emphasize how to implement these in Java (though doing so is a good exercise for those who wish to practice Java programming).

1 Implementing Sets: Why Not Use Lists?

One obvious way to implement sets is to use lists. Even though we haven’t yet covered lists in Java, you know about lists from some programming language you have used before. Roughly speaking, how would you implement sets with lists? Let’s consider each operation:

addElt would add the element to the list (i.e., cons in Racket, add or addFirst in Java, etc)
remElt would require a method that traverses the list and removes the given element
size would return the length of the list
hasElt would traverse the list to check for the element

Are lists a good choice for implementing sets? It’s an intuitive approach, but we should think about how long it takes to perform our key operations. Consider hasElt: to determine whether an element is in a list, we have to (potentially) traverse the entire list. The time to perform hasElt is therefore proportional to the length of the list. Formally, we say that the time required for hasElt is linear in the size of the list.

For large sets, the linear time required to search can be prohibitive. We therefore would like some other data structure in which hasElt runs faster. How might we do this? Think about why hasElt is slow: as you traverse the list, you compare the current list element to the one you are looking for. If you did not find it, you throw away that item and go on to the next. On each comparison, you throw away at most one element, so you potentially have to check all of them. If you want better performance on hasElt, you need an approach that throws away more than one element after each (failed) comparison.

This sounds like a job for trees. If the data is placed in a tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design, with some of the details on how to implement these in Java.

2 Implementing Sets with Binary Search Trees

To simplify the presentation, we will switch to sets of numbers, rather than sets of Strings (though what we present will work just fine for Strings—or other kinds of data—as well). Our starting point is therefore to assume that you have a binary tree with a number at each node and two subtrees (left and right).

A binary search tree (BST) is a binary tree in which every node has the following property:

Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.

(this statement assumes no duplicates, but that is okay since we are modeling sets.) Wikipedia’s entry on binary search trees has diagrams of trees with this property.

Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. When we discuss implementations of data structures, we’ll talk about how invariants reflect in your code.

2.1 Understanding BSTs

Consider the following BST (convince yourself that it meets the BST property):

10
/ \
4 12
/ \
2 7
/ \ / \
1 3 6 8
/
5

As a way to get familiar with BSTs and to figure out how the addElt and other set operations work on BSTs, work out (by hand) the trees that should result from each of the following operations on the original tree: addElt(9), remElt(6), remElt(3), remElt(4).

With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:

size behaves as in a plain binary tree.
hasElt optimizes on hasElt on a plain binary tree: if the element you’re looking for is not in the root, the search recurs on only one of the left subtree (if the element to find is smaller than that in the root) or the right subtree (if the element to find is larger than that in the root).
addElt always inserts new elements at a leaf in the tree. It starts from the root, moving to the left or right subtree as needed to maintain the invariant. When it hits a empty tree, addElt replaces it with a new node with the data to add and two empty subtrees.
remElt traverses the BST until it finds the node with the element to remove at the root. If the node has no children, remElt returns the empty tree. If the node has only one child, remElt replaces it with its child node. If the node has two children, remElt replaces the value in the node with either the largest element in the left subtree or the smallest element in the right subtree; remElt then removes the moved node value from its subtree.
The wikipedia entry also has diagrams illustrating this operation.

This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).

2.2 Run-time Performance of Set Operations via BSTs

Our discussion of hasElt on BSTs suggests that we’ve made progress on our performance problems with hasElt: on each comparison, we throw away one subtree, which is more than the single element we got to throw away on each comparison within lists.

Stop and think: do we always get to throw away more than one element in hasElt on a BST?

Consider the BST resulting from the following sequence of calls:

addElt(5), addElt(4), addElt(3), addElt(2), addElt(1).

Draw the BST – what does it look like? It looks like a list. So we didn’t actually gain any performance from hasElt in this case.

This illustrates one of the subtleties of arguing about run-time performance: we have to distinguish what happens in the best case, worst case, and average case. With lists, the performance of hasElt is linear in each of the best, worst, and average cases. With BSTs, the best case performance of hasElt is logarithmic (the mathematical term for "we throw away half each time"), but the worst case is still linear. Ideally, we would like a data structure in which the worst case performance of hasElt is also logarithmic.

3 Implementing Sets with AVL Trees

BSTs have poor worst-case performance on hasElt because we cannot guarantee that half of the elements are in each subtree. If we had that guarantee, hasElt would have logarithmic worst-case performance. In other words, we need to constrain our trees to be balanced. There are several data structures for balanced BSTs: we will look at one of them.

AVL trees are a form of balanced binary search trees (BBST). BBSTs augment the binary search tree invariant to require that the heights of the left and right subtrees at every node differ by at most one ("height" is the length of the longest path from the root to a leaf). For the two trees that follow, the one on the left is a BBST, but the one on the right is not (it is just a BST):

  6             6
/ \           / \
3   8         3   9
   /               \
  7                 12
                   /
                  10

We already saw how to maintain the BST invariants, so we really just need to understand how to maintain balance. We’ll look at this by way of examples. Consider the following BST. It is not balanced, since the left subtree has height 2 and the right has height 0:

/ \

1 3

How do we balance this tree efficiently? Efficiency will come from not structurally changing more of the tree than necessary. Note that this tree is heavy on the left. This suggests that we could "rotate" the tree to the right, by making 2 the new root:

/ \

? 4

1 .. 3

When we do this, 4 rotates around to the right subtree of 2. But on the left, we have the trees rooted at 1 and 3 that both used to connect to 2. The 3 tree is on the wrong side of a BST rooted at 2. So we move the 3-tree to hang off of 4 and leave 1 as the left subtree of 2:

/ \

1 4

The 4 node must have an empty child, since one of its original children became the root of the new tree. We can always hang the former right subtree of the new root (2) off the old root (4).

Even though the new tree has the same height as the original, the number of nodes in the two subtrees is closer together. This makes it more likely that we can add elements into either side of the tree without rebalancing later.

A similar rotation counterclockwise handles the case when the right subtree is taller than the left.

One more example:

/ \

5 10

/ \

8 15

This tree is heavier on the right, so we rotate counterclockwise. This yields

/ \

7 15

/ \

5 8

which is no better than what we started with. Rotating clockwise would give us back the original tree. The problem here is that when we rotate counterclockwise, the left child of the right subtree (rooted at 8 in this example) moves to the left subtree (rooted at 7). If that left child is larger than its sibling on the right (rooted at 15), the tree can end up unbalanced.

The solution in this case is to first rotate the tree rooted at 10 (in the original tree) clockwise, then rotate the resulting whole tree (starting at 7) counterclockwise:

     7
    / \
   5   8
        \
         10
        /  \
       9    15

-----------------

     8
    / \
   7   10
  /   /  \
5   9   15

An astute reader would notice that the original tree in this example was not balanced, so maybe this case doesn’t arise in practice. Not so fast. The original tree without the 9 is balanced. So we could start with the original tree sans 9, then addElt(9), and end up with a tree that needs a double rotation. You should convince yourself, however, that we never need more than two rotations if we had a balanced tree before adding or removing an element.

We tie the rotation algorithm into a BBST implementation by using it to rebalance the tree after every addElt or remElt call. The rebalance calls should be built into the addElt or remElt implementations.

Wikipedia has a good description of AVL trees. Refer to that for details on the rotation algorithm.

1	Implementing Sets: Why Not Use Lists?
2	Implementing Sets with Binary Search Trees
3	Implementing Sets with AVL Trees