Introduction to Data Structures and Binary Search Trees
1 Three Motivating Problems
1.1 Operations on Sets
2 The Efficiency of Lists for Sets
3 Binary Search Trees
3.1 Understanding BSTs
4 Implementing BSTs
4.1 Run-time Performance of Set Operations via BSTs

Introduction to Data Structures and Binary Search Trees

Today, we shift gears away from Java Programming and start a new topic: Data Structures.

Up until now, the CS1101/CS2102 sequence has emphasized helping you figure out what shape of data a problem requires. We’ve mainly studied three general shapes:

Until now, our goal has been to help you see the relationships and structure within data so that you could pick some shape to capture your information. Today we shift gears, and start thinking about other criteria for choosing how to represent data. Specifically, we are going to think about which operations are most important in a specific application, and choose data organizations which allow those operations to perform as fast as possible.

This general topic is called data structures – different ways of organizing the same data and providing the same operations, but with different characteristics of how the operations behave. We’ll focus this week on efficiency, but data structures also pull in other issues such as security, maintainability, and the like.

1 Three Motivating Problems

Consider the following problems:

Each of these problems involves working with a bunch of strings that cannot have duplicates. This suggests that all three problems could be done as a list of strings. But if we look more closely, each problem is more concerned with certain operations on the collection of strings:

There are many ways to organize a bunch of strings as data (whether as Java classes, Racket structs, or constructs in other languages). Different organizations make different operations easier or harder to perform. Today, we begin to look at several different data structures that you can use to capture a bunch of strings and the tradeoffs between them in terms of which operations they best support.

1.1 Operations on Sets

Getting more precise, we will be studying data structures for sets, collections of elements without duplicates. We’ll return to the question of duplicates a bit later.

There are many standard operations on sets. Here, we will be concerned with supporting the following operations:

2 The Efficiency of Lists for Sets

Lists provide a plausible data structure for implementing sets, but their run-time performance can be slow on some operations. Lists are, by definition, linear (meaning that there is a straight-line order in which elements get accessed). When we check whether an item is in a list, we may end up traversing the entire list. Put differently, every time we check one element of the list against the item we are searching for, we throw away only one element from further consideration. Using a sorted list (rather than an unsorted one) doesn’t help (we might be looking for the last element). To improve on the efficiency of our set implementation, we need a different implementation that discards more elements with each check.

Trees seem a natural way to throw away more elements. If the data is placed in the tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design. We’ll also look at how to implement these in Java..

To simply the discussion, we will focus on trees containing only integers (rather than trees of people or tournament contestants).

3 Binary Search Trees

In a binary search tree (BST), every node has the following property:

Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.

(this statement assumes no duplicates, but that is okay since we are modeling sets.) Wikipedia’s entry on binary search trees has diagrams of trees with this property.}

Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. You need to learn how to identify, state, and implement invariants.

3.1 Understanding BSTs

Consider the following BST (convince yourself that it meets the BST property):

         10

       /    \

      4      12

    /   \

   2     7

 /  \   /  \

1   3  6    8

      /

     5

 

As a way to get familiar with BSTs and to figure out how the addElt and other set operations work on BSTs, work out (by hand) the trees that should result from each of the following operations on the original tree: addElt(9), remElt(6), remElt(3), remElt(4).

The interesting case is remElt(4). Here, the solution that requires the least change in the tree replaces the 4 with either the 3 (largest element to the left of 4) or the 5 (smallest element on the right of 4) – you should convince yourself that replacing 4 with either of these numbers still leaves a BST.

With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:

This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).

4 Implementing BSTs

We haven’t yet implemented tree programs in 2102, so let’s look at the actual code for BSTs. What classes do you need?

Each node has an integer, a left subtree, and a right subtree. You might think you need one class (say, Node), in which each of the left and right subtrees are nodes. How do we build a BST with only one element this way, though? We would have to do something like:

  Node fiveTree = new Node(5, new Node(???), new Node(???))

Since we can’t build nodes with no data, we clearly need some other value to represent the empty tree. In fact, any time you create a tree you need two classes (empty tree and non-empty tree), as well as an interface to tie them together:

  interface IBST {}

  

  class EmptyBST implements IBST  {

    EmptyBST() {}

  }

  

  class DataBST implements IBST {

    int data;

    IBST left;

    IBST right;

  }

(If you know null, you may want to use it here, but that isn’t good OO practice. null means "we know nothing", which is different from "we know we have a tree with no contents".)

Once you have these classes, the implementations of size, hasElt, and addElt are fairly straightforward. Here is the code.

The implementation of remElt, however, has some interesting implications in Java. Here are some separate (optional) notes on that – these show the solution to this week’s advanced lab assignment.

4.1 Run-time Performance of Set Operations via BSTs

Our description of hasElt on BSTs suggests that we’ve made progress on our performance problems with hasElt: on each comparison, we throw away one subtree, which is more than the single element we got to throw away on each comparison within lists.

Stop and think: do we always get to throw away more than one element in hasElt on a BST?

Consider the BST resulting from the following sequence of calls:

addElt(5), addElt(4), addElt(3), addElt(2), addElt(1).

Draw the BST – what does it look like? It looks like a list. So we didn’t actually gain any performance from hasElt in this case.

This illustrates one of the subtleties of arguing about run-time performance: we have to distinguish what happens in the best case, worst case, and average case. With lists, the performance of hasElt is linear in each of the best, worst, and average cases. With BSTs, the best case performance of hasElt is logarithmic (the mathematical term for "we throw away half each time"), but the worst case is still linear. Ideally, we would like a data structure in which the worst case performance of hasElt is also logarithmic.