Tree-based Implementations of Sets

Lists provide a plausible data structure for implementing sets, but their run-time performance can be slow on some operations. Lists are, by definition, linear (meaning that there is a straight-line order in which elements get accessed). When we check whether an item is in a list, we may end up traversing the entire list. Put differently, every time we check one element of the list against the item we are searching for, we throw away only one element from further consideration. Using a sorted list (rather than an unsorted one) doesn’t help (we might be looking for the last element). To improve on the efficiency of our set implementation, we need a different implementation that discards more elements with each check.

Binary trees seem a natural way to throw away more elements. If the data is placed in the tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design, as well as how to implement these in Java..

To simply the discussion, we will focus on trees containing only integers (rather than trees of people or tournament contestants).

1 Binary Search Trees

In a binary search tree (BST), every node has the following property:

Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.

(this statement assumes no duplicates, but that is okay since we are modeling sets.) Wikipedia’s entry on binary search trees has diagrams of trees with this property.}

Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. You need to learn how to identify, state, and implement invariants.

1.1 Understanding BSTs

Consider the following BST (convince yourself that it meets the BST property):

10
/ \
4 12
/ \
2 7
/ \ / \
1 3 6 8
/
5

As a way to get familiar with BSTs and to figure out how the addElt and other set operations work on BSTs, work out (by hand) the trees that should result from each of the following operations on the original tree: addElt(9), remElt(6), remElt(3), remElt(4).

With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:

size behaves as in a plain binary tree.
hasElt optimizes on hasElt on a plain binary tree: if the element you’re looking for is not in the root, the search recurs on only one of the left subtree (if the element to find is smaller than that in the root) or the right subtree (if the element to find is larger than that in the root).
addElt always inserts new elements at a leaf in the tree. It starts from the root, moving to the left or right subtree as needed to maintain the invariant. When it hits a empty tree, addElt replaces it with a new node with the data to add and two empty subtrees.
remElt traverses the BST until it finds the node with the element to remove at the root. If the node has no children, remElt returns the empty tree. If the node has only one child, remElt replaces it with its child node. If the node has two children, remElt replaces the value in the node with either the largest element in the left subtree or the smallest element in the right subtree; remElt then removes the moved node value from its subtree.
The wikipedia entry also has diagrams illustrating this operation.

This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).

The implementations of size, hasElt, and addElt are straightforward. The implementation of remElt, however, has some interesting implications in Java. The rest of these notes will focus on remElt. The final bst implementation in Java shows the details of all four operations.

1.2 Implementing remElt with BSTs

First, let’s turn the general description of the remElt algorithm into Java code. For simplicity as we look at the subtleties of Java, we will always grab the largest element in the left child when we need to remove the root of a tree with two populated subtrees. The parts that raise interesting Java points are written in all capital letters between angle brackets (these are not valid Java code).

  public IBST remElt (int elt) {
    if (elt == this.data) {
      if <BOTH CHILDREN ARE MtBSTs> {
        return new MtBST();
      } else if <LEFT IS AN MtBST> {
        return this.right;
      } else if <RIGHT IS AN MtBST> {
        return this.left;
      } else { // both children are DataBSTs
        return new DataBST(this.left.largestElt(),
                           this.left.remElt(this.left.largestElt()),
                           this.right);
      }
    } else if (elt < this.data) {
      return new DataBST(this.data,
                         this.left.remElt(elt),
                         this.right);
    } else { // elt > this.data
      return new DataBST(this.data,
                         this.left,
                         this.right.remElt(elt)) ;
    }
  }

Before you go on: make sure you see that this code implements the BST remElt algorithm. You should be able to articulate why this code preserves the BST invariant.

Now we need to capture the all-caps test questions in Java. To write these tests, we need a way to determine whether each child tree is an MtBST or a DataBST. Understanding how to do this properly is the point of this section of the presentation.

If you have had Java before, you may have been taught that you can check whether an object was created from a given class using an operator called instanceof. Using instanceof, we would fill in the holes as follows:

  if (this.left instanceof MtBST && this.right instanceof MtBST) {
    return new MtBST();
  } else if (this.left instanceof MtBST) {
    return this.right;
  } else if (this.right instanceof MtBST) {
    return this.left;
  } else { ... }

Back when we showed how to migrate Racket programs over mixed data to Java, however, we discussed that good OO programs should not check the type of objects explicitly. Remember that one of the key points of OO languages is that they handle finding the right method based on the type of an object automatically (this is called dispatch). So while instanceof works here, it isn’t a proper solution in an OO language.

1.3 Rewriting Code to Eliminate instanceof

A proper OO solution requires that we capture the effect of the instanceof uses in methods; these methods will have different implementations on each of the MtBST class and the DataBST class that achieve the effects of the original instanceOf. Our goal then is to design a method that can dispatch on the children to perform the appropriate computation.

To help with that, let’s reorganize the conditional tests around the types of the children. We start with a conditional based on the type of the left child:

  if (this.left instanceof MtBST) {
    if (this.right instanceof MtBST) {
       return new MtBST();
    } else {
       return this.right;
    }
  } else { // left is a DataBST
    if (this.right instanceof MtBST) {
       return this.left;
    } else { ... }
  }

Convince yourself that this version is indeed equivalent to the first version we sketched out. Now, note that in the case that the left is an MtBST, we return the right child in either case. So we can further simplify this to:

  if (this.left instanceof MtBST) {
    return this.right;
  } else { // left is a DataBST
    if (this.right instanceof MtBST) {
       return this.left;
    } else { ... }
  }

Next, we turn this into a method on IBST that we will call on the left child: the answer for the if will be the body of the method in the MtBST class, and the answer for the else will be the body of the method in the DataBST class (just as we did when writing methods on animals in week 1). Let’s call the method remParent:

  // goes into the MtBST class
  IBST remParent(IBST rightSibling) {
    return rightSibling;
  }

  // goes into the DataBST class.  "this" is the left sibling
  IBST remParent(IBST rightSibling) {
    if (rightSibling instanceof MtBST) {
       return this;
    } else { ... }
  }

We would call this method from within remElt in the DataBST class, as follows:

  // remElt in the DataBST class
  public IBST remElt (int elt) {
    if (elt == this.data)
      return this.left.remParent(this.right);
    else if (elt < this.data)
      ... //code is the same after here
  }

We still have a use of instanceof in the body of remParent, but we can use the same technique. The conditional already branches on the type of a single object, so we simply introduce a new method to handle the dispatch. We will call the new method mergeToRemParent. It will be called on the right sibling, taking the left sibling as an argument:

  // goes into the MtBST class
  // "this" is the right sibling; leftsibling is a DataBST
  IBST mergeToRemoveParent(IBST leftsibling) {
    return leftsibling;
  }

  // goes into the DataBST class.
  // "this" is the right sibling; leftsibling is a DataBST
  IBST mergeToRemoveParent(IBST leftSibling) {
    // this is where we choose largest-in-left or smallest-in-right,
    //   branching accordingly.  Only showing largest-in-left here
    int newRoot = leftSibling.largestElt();
    return new DataBST(newRoot,
                       leftSibling.remElt(newRoot),
                       this);
  }

In the implementation of mergeToRemoveParent for the DataBST class, we have filled in the ellipses that we have carried through the example, refining the terms to match the variable names in the method.

To make this code compile, we also need to add both remParent and mergeToRemoveParent to the IBST interface, and accordingly mark all implementations of these methods as public. The full solution shows all of these details.

NOTE: There are other approaches you might take to eliminating instanceof in this code. If you want to explore them, work on the advanced option in lab this week.

1.4 Casting: Making the Types Work Out

Unfortunately, the code as we have it still won’t compile due to one last subtle issue. Remember that we are using BSTs to implement sets. We now have the following interfaces for Iset and IBST, and the following concrete types in DataBST:

  interface Iset {
    Iset addElt (int elt);
    Iset remElt (int elt);
    int size ();
    boolean hasElt (int elt);
  }

  interface IBST extends Iset {
    int largestElt();
    IBST remParent(IBST sibling);
    IBST mergeToRemoveParent(IBST sibling);
  }

  class DataBST implements IBST  {
    ...
    DataBST(int data, IBST left, IBST right) {
      this.data = data;
      this.left = left;
      this.right = right;
    }
  }

Now, look closely at the types of objects we are passing to the DataBST constructor within remElt in the DataBST class:

  public IBST remElt (int elt) {
    if (elt == this.data)
      return this.left.remParent(this.right);
    else if (elt < this.data)
      return new DataBST(this.data,
                         this.left.remElt(elt),
                         this.right);
    ...
  }

The second argument to DataBST here is the result of remElt. The interfaces indicate that remElt returns an object of type Iset. But the DataBST constructor expects the second input to be of type IBST. The Java compiler will reject this code on a type mismatch.

But wait – we know that we are implementing Iset through IBST in this program. The actual remElt method we are calling returns an IBST. Aren’t we then guaranteed that the types are fine when we run the code?

Yes, we are. However, the Java type system cannot confirm this automatically (designing type systems in the presence of inheritence is very tricky, precisely for cases such as this). The Java compiler has no choice but to reject this code. The Java language, then, needs to provide programmers with a way to take the responsibility for this code executing properly in practice.

Java programmers do this by claiming what type the result of remElt will have a run time. This claim is called a cast. It is written as follows:

  public IBST remElt (int elt) {
    if (elt == this.data)
      return this.left.remParent(this.right);
    else if (elt < this.data)
      return new DataBST(this.data,
                         (IBST) this.left.remElt(elt),
                         this.right);
    ...
  }

The IBST before the result of remElt tells the compiler "assume this object is an IBST when you compile". The run-time system, in turn, will check this claim when the program is actually running. If the actual object does not implement IBST, an error will be reported as the program runs.

Once you have hierarchies of classes and interfaces, casts are sometimes necessary to make code compile. They slightly hurt the performance of running programs (since the types are checked at run-time rather than compile time). As a Java programmer, you should be careful to only use a cast when you are confident that the objects you are casting can actually be of the indicated type.

1.5 Other Notables in the Java BST Implementation

Several other details are embedded in the full BST implementation. In particular:

Methods common to IBST classes (such as largestElt) go into the IBST interface, not the Iset interface.
The remElt method requires a largestElt method on DataBSTs. Our program never invokes largestElt on an MtBST (convince yourself of this by looking at where it gets used). Unfortunately, the Java type checker isn’t smart enough to determine this, so it requires that all variants of IBST have a largestElt method. This requirement forces largestElt into the IBST interface, and thus into the MtBST class (try removing it and see what error you get). The behavior of largestElt isn’t well-defined on MtBSTs. Therefore, the implementation of largestElt on IBST simply raises an error if it is ever invoked. We will cover error handling more explicitly in a couple of weeks.
The addElt code also uses casts. The code actually shows another way to handle the type mismatch. We could refine the type of addElt within the IBST interface. The following interface would achieve this:
  interface IBST extends Iset {
    IBST addElt (int elt);
    ...}
Casting is the more common solution, but there are advantages to the interface-based solution. In particular, Java would report an error if the addElt implementation within either BST class returned some Iset other than a BST. Without addElt in the interface, addElt implementations are free to return other Iset implementations, even though those would be nonsensical elsewhere in the program.
We now have two broad criteria to exercise in our test cases: that BSTs provide a proper implementation of Iset and that our implementation satisfies the BST invariant. The incomplete set of tests in the Examples class illustrate these points: test1 checks size as a standalone function; test2 and test3 check that size and addElt interact properly to satisfy the no-duplicates requirement of sets; test4 through test6 begin to check that remElt preserves the BST invariant. We say "begin" here because these tests are low-level examples of the results of the invariant rather than a convincing expression of the invariant itself. We’ll be returning to that point in a couple of days.
A full remElt implementation would choose between upgrading the largest element of the left subtree or smallest element of the right. For simplicity, this code has only provided the former. This decision vastly (and unrealistically) simplified the last three tests in the Examples class.
The tests in the Examples class check both the behavior of the operations on BSTs, but also the axioms on ISet.

2 AVL Trees

AVL trees are a form of balanced binary search trees (BBST). BBSTs augment the binary search tree invariant to require that the heights of the left and right subtrees at every node differ by at most one ("height" is the length of the longest path from the root to a leaf). For the two trees that follow, the one on the left is a BBST, but the one on the right is not (it is just a BST):

  6             6
/ \           / \
3   8         3   9
   /               \
  7                 12
                   /
                  10

We already saw how to maintain the BST invariants, so we really just need to understand how to maintain balance. We’ll look at this by way of examples. Consider the following BST. It is not balanced, since the left subtree has height 2 and the right has height 0:

/ \

1 3

How do we balance this tree efficiently? Efficiency will come from not structurally changing more of the tree than necessary. Note that this tree is heavy on the left. This suggests that we could "rotate" the tree to the right, by making 2 the new root:

/ \

? 4

1 .. 3

When we do this, 4 rotates around to the right subtree of 2. But on the left, we have the trees rooted at 1 and 3 that both used to connect to 2. The 3 tree is on the wrong side of a BST rooted at 2. So we move the 3-tree to hang off of 4 and leave 1 as the left subtree of 2:

/ \

1 4

The 4 node must have an empty child, since it only keeps one of its subtrees after rotation. We can always hang the former right subtree of the new root (2) off the old root (4).

Even though the new tree has the same height as the original, the number of nodes in the two subtrees is closer together. This makes it more likely that we can add elements into either side of the tree without rebalancing later.

A similar rotation counterclockwise handles the case when the right subtree is taller than the left.

One more example:

/ \

5 10

/ \

8 15

This tree is heavier on the right, so we rotate counterclockwise. This yields

/ \

7 15

/ \

5 8

which is no better than what we started with. Rotating clockwise would give us back the original tree. The problem here is that when we rotate counterclockwise, the left child of the right subtree moves to the left subtree. If that left child is larger than its sibling on the right, the tree can end up unbalanced.

The solution in this case is to first rotate the tree rooted at 10 (in the original tree) clockwise, then rotate the resulting whole tree (starting at 7) counterclockwise:

     7
    / \
   5   8
        \
         10
        /  \
       9    15

-----------------

     8
    / \
   7   10
  /   /  \
5   9   15

An astute reader would notice that the original tree in this example was not balanced, so maybe this case doesn’t arise in practice. Not so fast. The original tree without the 9 is balanced. So we could start with the original tree sans 9, then addElt(9), and end up with a tree that needs a double rotation. You should convince yourself, however, that we never need more than two rotations if we had a balanced tree before adding or removing an element.

We tie the rotation algorithm into a BBST implementation by using it to rebalance the tree after every addElt or remElt call. The rebalance calls should be built into the addElt or remElt implementations.

Wikipedia has a good description of AVL trees. Refer to that for details on the rotation algorithm.

3 Take Aways for Different Groups of Students

We’ve covered a lot of ground here: invariants, binary search trees, AVL trees, algorithms, instanceof, and casting (oh my!). What should you take away from all this?

If you are here to learn programming, you need to understand invariants on data structures and the concepts of both binary search trees and balanced binary search trees (AVL trees).

If you are here to learn Java in particular, you need to understand casting and how to program without instanceof.

If you are going on to be a CS major, you should understand the algorithms underlying both binary search trees and AVL trees.

In terms of overall course outcomes, I expect that you understand what invariants are and how to program towards them. I expect that you know what binary search trees and AVL trees are (and when they are useful). I do not expect that you know the algorithms in detail (in other words, you will not be tested on details of these algorithms). I expect that you can write code without instanceof and can use casts when necessary to make your code compile.

Summarizing the key details that you should take away from this:

An invariant is a constraint on every valid instance of a data structure. If a data structure has an invariant, all implementations are required to maintain that invariant.
Binary search trees and AVL trees are binary trees with different invariants. Each of these invariants yields a different run-time performance guarantee.
A data structure consists of a known data structure and an (optional) invariant. Follow the design recipe (template, etc) for the known data structure. Build the invariant into the methods that you write over that core template.
Different programming languages explicitly capture different invariants. Types are a common invariant that are captured in code. More complex invariants, such as those for binary search trees or AVL trees, need to be documented in Java classes. We can write separate methods to check for invariants, but Java does not provide mechanisms to check invariants at compile-time as it does for types.
For proper OO solutions, rework your code so that it does not use instanceof. You can do this by creating methods with different implementation on different types of data.
Casts are a way to tell the Java compiler than an object of one general type will have another, more specific, type at run-time. Java will trust cast annotations at compile time, but will also check the more specific type when the program runs.

1	Binary Search Trees
2	AVL Trees
3	Take Aways for Different Groups of Students