1 Binary Search Trees
1.1 Understanding BSTs
2 Formalizing Performance
3 AVL Trees
3.1 AVL-tree Performance
4 Summary

Tree-based Implementations of Sets

Lists provide a plausible data structure for implementing sets, but their run-time performance can be slow on some operations. Lists are, by definition, linear (meaning that there is a straight-line order in which elements get accessed). When we check whether an item is in a list, we may end up traversing the entire list. Put differently, every time we check one element of the list against the item we are searching for, we throw away only one element from further consideration. Using a sorted list (rather than an unsorted one) doesn’t help (we might be looking for the last element). To improve on the efficiency of our set implementation, we need a different implementation that discards more elements with each check.

Binary trees seem a natural way to throw away more elements. If the data is placed in the tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design, as well as how to implement these in Java..

To simply the discussion, we will focus on trees containing only integers (rather than trees of people or tournament contestants).

1 Binary Search Trees

In a binary search tree (BST), every node has the following property:

Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.

(this statement assumes no duplicates, but that is okay since we are modeling sets.) Wikipedia’s entry on binary search trees has diagrams of trees with this property.}

Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. You need to learn how to identify, state, and implement invariants.

1.1 Understanding BSTs

Consider the following BST (convince yourself that it meets the BST property):

         10

       /    \

      4      12

    /   \

   2     7

 /  \   /  \

1   3  6    8

      /

     5

 

As a way to get familiar with BSTs and to figure out how the addElt and other set operations work on BSTs, work out (by hand) the trees that should result from each of the following operations on the original tree: addElt(9), remElt(6), remElt(3), remElt(4).

With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:

This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).

An implementation of BSTs following our in-class recipe is in a separate file. Several aspects of this implementation are worth highlighting:

2 Formalizing Performance

We set out on this topic in hoping of getting a more efficient implementation of set operations than lists would provide. In particular, we considered the efficiency of checking whether an element is in the set. Let’s contrast the efficiency of hasElt on a list implementation to that on a BST implementation.

Recall that the list implementation of hasElt eliminated at one element from consideration on each iteration. Let T be a function that consumes the number of elements in a list and produces the worst-case time required to check whether that element is in the list:

T(n) = T(n-1) + c

Read this formula as saying "computing hasElt on a list with n elements takes the time to compute it on n-1 elements plus a constant". The T(n-1) comes from calling hasElt on the rest of the list. The constant c captures the time for the other operations that link hasElt on the rest to hasElt on the whole list (conditional, or, etc). We use a constant here since none of these extra operations depend on characteristics of the particular datum in the first of the list.

Technically, recursive formulas such as this are called recurrences.

Let’s write a similar formula for hasElt on a BST. At worst, how many elements still need to be considered after we rule out the root of the tree? A BST could have all the remaining nodes in only one child. If this situation applied to all nodes, the BST is really just a list, so we get the same recurrence:

T(n) = T(n-1) + c

Put differently, in the worst case, a BST is no more efficient than a list. In practice, the BST could be better, since BSTs are not list-shaped as a general rule. However, we get no guarantee of good performance from a BST-implementation of sets. If we want guaranteed performance improvements, we need a data structure with a stronger invariant.

When you get to Algorithms (CS2223), you will discuss recurrences and related issues in much more detail, include looking at average-case running times and space efficiency. Recurrences are just one high-level way of expressing the running-time of an algorithm. We’ll return to them periodically during the course.

3 AVL Trees

AVL trees are a form of balanced binary search trees (BBST). BBSTs augment the binary search tree invariant to require that the heights of the left and right subtrees at every node differ by at most one ("height" is the length of the longest path from the root to a leaf). For the two trees that follow, the one on the left is a BBST, but the one on the right is not (it is just a BST):

  6             6

 / \           / \

3   8         3   9

   /               \

  7                 12

                   /

                  10

We already saw how to maintain the BST invariants, so we really just need to understand how to maintain balance. We’ll look at this by way of examples. Consider the following BST. It is not balanced, since the left subtree has height 2 and the right has height 0:

    4

   /

  2

 / \

1   3

How do we balance this tree efficiently? Efficiency will come from not structurally changing more of the tree than necessary. Note that this tree is heavy on the left. This suggests that we could "rotate" the tree to the right, by making 2 the new root:

     2

    / \

   ?   4

1 .. 3

When we do this, 4 rotates around to the right subtree of 2. But on the left, we have the trees rooted at 1 and 3 that both used to connect to 2. The 3 tree is on the wrong side of a BST rooted at 2. So we move the 3-tree to hang off of 4 and leave 1 as the left subtree of 2:

  2

 / \

1   4

    /

   3

The 4 node must have an empty child, since it only keeps one of its subtrees after rotation. We can always hang the former right subtree of the new root (2) off the old root (4).

Even though the new tree has the same height as the original, the number of nodes in the two subtrees is closer together. This makes it more likely that we can add elements into either side of the tree without rebalancing later.

A similar rotation counterclockwise handles the case when the right subtree is taller than the left.

One more example:

  7

 / \

5   10

   /  \

  8    15

   \

    9

This tree is heavier on the right, so we rotate counterclockwise. This yields

      10

    /    \

   7      15

 /   \

5     8

        \

         9

which is no better than what we started with. Rotating clockwise would give us back the original tree. The problem here is that when we rotate counterclockwise, the left child of the right subtree moves to the left subtree. If that left child is larger than its sibling on the right, the tree can end up unbalanced.

The solution in this case is to first rotate the tree rooted at 10 (in the original tree) clockwise, then rotate the resulting whole tree (starting at 7) counterclockwise:

     7

    / \

   5   8

        \

         10

        /  \

       9    15

 

-----------------

 

     8

    / \

   7   10

  /   /  \

 5   9   15

An astute reader would notice that the original tree in this example was not balanced, so maybe this case doesn’t arise in practice. Not so fast. The original tree without the 9 is balanced. So we could start with the original tree sans 9, then addElt(9), and end up with a tree that needs a double rotation. You should convince yourself, however, that we never need more than two rotations if we had a balanced tree before adding or removing an element.

We tie the rotation algorithm into a BBST implementation by using it to rebalance the tree after every addElt or remElt call. The rebalance calls should be built into the addElt or remElt implementations.

Wikipedia has a good description of AVL trees. Refer to that for details on the rotation algorithm.

3.1 AVL-tree Performance

What is the recurrence for the time performance of hasElt on AVL trees? Since the tree is balanced, we are guaranteed to throw away half of the elements on each iteration. Therefore,

T(n) = T(n/2) + c

Any recurrence of this form produces a solution in log(n) iterations. This is a clear performance improvement over BSTs or Lists, though at the cost of a more complicated algorithm.

4 Summary

  1. An invariant is a constraint on every valid instance of a data structure. If a data structure has an invariant, all implementations are required to maintain that invariant.

  2. Binary search trees and AVL trees are binary trees with different invariants. Each of these invariants yields a different run-time performance guarantee.

  3. A data structure consists of a known data structure and an (optional) invariant. Follow the design recipe (template, etc) for the known data structure. Build the invariant into the methods that you write over that core template.

  4. Different programming languages explicitly capture different invariants. Types are a common invariant that are captured in code. More complex invariants, such as those for binary search trees or AVL trees, need to be documented in Java classes. We can write separate methods to check for invariants, but Java does not provide mechanisms to check invariants at compile-time as it does for types.