Tree-based Implementations of Sets
Lists provide a plausible data structure for implementing sets, but their run-time performance can be slow on some operations. Lists are, by definition, linear (meaning that there is a straight-line order in which elements get accessed). When we check whether an item is in a list, we may end up traversing the entire list. Put differently, every time we check one element of the list against the item we are searching for, we throw away only one element from further consideration. Using a sorted list (rather than an unsorted one) doesn’t help (we might be looking for the last element). To improve on the efficiency of our set implementation, we need a different implementation that discards more elements with each check.
Binary trees seem a natural way to throw away more elements. If the data is placed in the tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design, as well as how to implement these in Java..
To simply the discussion, we will focus on trees containing only integers (rather than trees of people or tournament contestants).
1 Binary Search Trees
In a binary search tree (BST), every node has the following property:
Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.
Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. You need to learn how to identify, state, and implement invariants.
1.1 Understanding BSTs
Consider the following BST (convince yourself that it meets the BST property):
10 |
/ \ |
4 12 |
/ \ |
2 7 |
/ \ / \ |
1 3 6 8 |
/ |
5 |
|
With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:
size behaves as in a plain binary tree.
hasElt optimizes on hasElt on a plain binary tree: if the element you’re looking for is not in the root, the search recurs on only one of the left subtree (if the element to find is smaller than that in the root) or the right subtree (if the element to find is larger than that in the root).
addElt always inserts new elements at a leaf in the tree. It starts from the root, moving to the left or right subtree as needed to maintain the invariant. When it hits a empty tree, addElt replaces it with a new node with the data to add and two empty subtrees.
remElt traverses the BST until it finds the node with the element to remove at the root. If the node has no children, remElt returns the empty tree. If the node has only one child, remElt replaces it with its child node. If the node has two children, remElt replaces the value in the node with either the largest element in the left subtree or the smallest element in the right subtree; remElt then removes the moved node value from its subtree.
The wikipedia entry also has diagrams illustrating this operation.
This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).
An implementation of BSTs following our in-class recipe is in a separate file. Several aspects of this implementation are worth highlighting:
The implementation starts from the definition of a plain binary tree of numbers that neither enforces the invariant nor satisfies the Iset interface:
interface IBST {}
class MtBST implements IBST {
MtBST() {}
}
class DataBST implements IBST {
int data;
IBST left;
IBST right;
DataBST(int data, IBST left, IBST right) {
this.data = data;
this.left = left;
this.right = right;
}
}
To indicate that our BSTs will implement the Iset interface, we edit the IBST definition as follows:
interface IBST extends Iset { ... }
Really, we want to say that IBST implements Iset, but Java doesn’t allow interfaces to implement one another. The extends keyword between interfaces achieves this goal.
Methods common to IBST classes (such as largestElt) go into the IBST interface, not the Iset interface.
The remElt method requires a largestElt method on DataBSTs. Our program never invokes largestElt on an MtBST (convince yourself of this by looking at where it gets used). Unfortunately, the Java type checker isn’t smart enough to determine this, so it requires that all variants of IBST have a largestElt method. This requirement forces largestElt into the IBST interface, and thus into the MtBST class (try removing it and see what error you get). The behavior of largestElt isn’t well-defined on MtBSTs. Therefore, the implementation of largestElt on IBST simply raises an error if it is ever invoked. We will cover error handling more explicitly in a couple of weeks.
Within addElt, you’ll see a new construct used when we create the DataBST to return:
return new DataBST (this.data,
(IBST) this.left.addElt(elt),
this.right);
Note the (IBST) before the recursive call to addElt on the left subtree. Currently, addElt is declared in the Iset interface. The return value of addElt is an Iset, which is not necessarily an IBST (as DataBST requires). There are two ways to handle this, both shown in the sample code:
Tell Java to view the result of addElt as an IBST. The code fragment above does this by putting (IBST) in front of the result of addElt. This technique is called casting. Casting lets a programmer override Java’s type system, though Java will raise an error at runtime if the cast is not valid.
Refine the type of addElt within the IBST interface. The following interface would achieve this:
interface IBST extends Iset {
IBST addElt (int elt);
...}
Casting is the more common solution, but there are advantages to the interface-based solution. In particular, Java would report an error if the addElt implementation within either BST class returned some Iset other than a BST. Without addElt in the interface, addElt implementations are free to return other Iset implementations, even though those would be nonsensical elsewhere in the program.
We now have two broad criteria to exercise in our test cases: that BSTs provide a proper implementation of Iset and that our implementation satisfies the BST invariant. The incomplete set of tests in the Examples class illustrate these points: test1 checks size as a standalone function; test2 and test3 check that size and addElt interact properly to satisfy the no-duplicates requirement of sets; test4 through test6 begin to check that remElt preserves the BST invariant. We say "begin" here because these tests are low-level examples of the results of the invariant rather than a convincing expression of the invariant itself. We’ll be returning to that point in a couple of days.
A full remElt implementation would choose between upgrading the largest element of the left subtree or smallest element of the right. For simplicity, this code has only provided the former. This decision vastly (and unrealistically) simplified the last three tests in the Examples class.
The tests in the Examples class check both the behavior of the operations on BSTs, but also the axioms on ISet.
2 Formalizing Performance
We set out on this topic in hoping of getting a more efficient implementation of set operations than lists would provide. In particular, we considered the efficiency of checking whether an element is in the set. Let’s contrast the efficiency of hasElt on a list implementation to that on a BST implementation.
Recall that the list implementation of hasElt eliminated at one element from consideration on each iteration. Let T be a function that consumes the number of elements in a list and produces the worst-case time required to check whether that element is in the list:
T(n) = T(n-1) + c
Read this formula as saying "computing hasElt on a list with n elements takes the time to compute it on n-1 elements plus a constant". The T(n-1) comes from calling hasElt on the rest of the list. The constant c captures the time for the other operations that link hasElt on the rest to hasElt on the whole list (conditional, or, etc). We use a constant here since none of these extra operations depend on characteristics of the particular datum in the first of the list.
Technically, recursive formulas such as this are called recurrences.
Let’s write a similar formula for hasElt on a BST. At worst, how many elements still need to be considered after we rule out the root of the tree? A BST could have all the remaining nodes in only one child. If this situation applied to all nodes, the BST is really just a list, so we get the same recurrence:
T(n) = T(n-1) + c
Put differently, in the worst case, a BST is no more efficient than a list. In practice, the BST could be better, since BSTs are not list-shaped as a general rule. However, we get no guarantee of good performance from a BST-implementation of sets. If we want guaranteed performance improvements, we need a data structure with a stronger invariant.
When you get to Algorithms (CS2223), you will discuss recurrences and related issues in much more detail, include looking at average-case running times and space efficiency. Recurrences are just one high-level way of expressing the running-time of an algorithm. We’ll return to them periodically during the course.
3 AVL Trees
AVL trees are a form of balanced binary search trees (BBST). BBSTs augment the binary search tree invariant to require that the heights of the left and right subtrees at every node differ by at most one ("height" is the length of the longest path from the root to a leaf). For the two trees that follow, the one on the left is a BBST, but the one on the right is not (it is just a BST):
6 6 |
/ \ / \ |
3 8 3 9 |
/ \ |
7 12 |
/ |
10 |
We already saw how to maintain the BST invariants, so we really just need to understand how to maintain balance. We’ll look at this by way of examples. Consider the following BST. It is not balanced, since the left subtree has height 2 and the right has height 0:
4 |
/ |
2 |
/ \ |
1 3 |
2 |
/ \ |
? 4 |
1 .. 3 |
2 |
/ \ |
1 4 |
/ |
3 |
Even though the new tree has the same height as the original, the number of nodes in the two subtrees is closer together. This makes it more likely that we can add elements into either side of the tree without rebalancing later.
A similar rotation counterclockwise handles the case when the right subtree is taller than the left.
One more example:
7 |
/ \ |
5 10 |
/ \ |
8 15 |
\ |
9 |
10 |
/ \ |
7 15 |
/ \ |
5 8 |
\ |
9 |
The solution in this case is to first rotate the tree rooted at 10 (in the original tree) clockwise, then rotate the resulting whole tree (starting at 7) counterclockwise:
7 |
/ \ |
5 8 |
\ |
10 |
/ \ |
9 15 |
|
----------------- |
|
8 |
/ \ |
7 10 |
/ / \ |
5 9 15 |
An astute reader would notice that the original tree in this example was not balanced, so maybe this case doesn’t arise in practice. Not so fast. The original tree without the 9 is balanced. So we could start with the original tree sans 9, then addElt(9), and end up with a tree that needs a double rotation. You should convince yourself, however, that we never need more than two rotations if we had a balanced tree before adding or removing an element.
We tie the rotation algorithm into a BBST implementation by using it to rebalance the tree after every addElt or remElt call. The rebalance calls should be built into the addElt or remElt implementations.
Wikipedia has a good description of AVL trees. Refer to that for details on the rotation algorithm.
3.1 AVL-tree Performance
What is the recurrence for the time performance of hasElt on AVL trees? Since the tree is balanced, we are guaranteed to throw away half of the elements on each iteration. Therefore,
T(n) = T(n/2) + c
Any recurrence of this form produces a solution in log(n) iterations. This is a clear performance improvement over BSTs or Lists, though at the cost of a more complicated algorithm.
4 Summary
An invariant is a constraint on every valid instance of a data structure. If a data structure has an invariant, all implementations are required to maintain that invariant.
Binary search trees and AVL trees are binary trees with different invariants. Each of these invariants yields a different run-time performance guarantee.
A data structure consists of a known data structure and an (optional) invariant. Follow the design recipe (template, etc) for the known data structure. Build the invariant into the methods that you write over that core template.
Different programming languages explicitly capture different invariants. Types are a common invariant that are captured in code. More complex invariants, such as those for binary search trees or AVL trees, need to be documented in Java classes. We can write separate methods to check for invariants, but Java does not provide mechanisms to check invariants at compile-time as it does for types.