Introduction to Data Structures and Binary Search Trees
Today, we shift gears away from Java Programming and start a new topic: Data Structures.
Up until now, the CS1101/CS2102 sequence has emphasized helping you figure out what shape of data a problem requires. We’ve mainly studied three general shapes:
Data with a fixed number of components/parts: We capture this with a define-struct in Racket and a new class in Java.
Data with an arbitrary number of elements (of the same type): Lists (using cons in Racket or LinkedList in Java).
Hierarchical data (such as family trees): In Racket, you defined trees using define-struct; in Java, you would define classes to capture trees.
Until now, our goal has been to help you see the relationships and structure within data so that you could pick some shape to capture your information. Today we shift gears, and start thinking about other criteria for choosing how to represent data. Specifically, we are going to think about which operations are most important in a specific application, and choose data organizations which allow those operations to perform as fast as possible.
This general topic is called data structures – different ways of organizing the same data and providing the same operations, but with different characteristics of how the operations behave. We’ll focus this week on efficiency, but data structures also pull in other issues such as security, maintainability, and the like.
1 Three Motivating Problems
Consider the following problems:
Store all of the URLs visited from a web browser, with the ability to check whether a specific URL has been visited.
Gather all the words that someone generated during a word game (such as "write down all words can you make from the letters d e h l l l o o r w"), with the ability to add words, find out how many words the player generated, and easily access the longest words for purposes of scoring.
Maintain information on who is coming to a dinner party. In addition to adding and removing names, you need to be able to compare the current number of attendees to the number of chairs in your apartment.
Each of these problems involves working with a bunch of strings that cannot have duplicates. This suggests that all three problems could be done as a list of strings. But if we look more closely, each problem is more concerned with certain operations on the collection of strings:
The web browser problem needs to check membership (whether a particular string is in the bunch).
The word game needs to access the strings in order, in this case based on the length of the string.
The dinner party RSVP needs to check the size of the bunch and make it easy to remove guests who initially expect to come then have a change of plans.
There are many ways to organize a bunch of strings as data (whether as Java classes, Racket structs, or constructs in other languages). Different organizations make different operations easier or harder to perform. Today, we begin to look at several different data structures that you can use to capture a bunch of strings and the tradeoffs between them in terms of which operations they best support.
1.1 Operations on Sets
Getting more precise, we will be studying data structures for sets, collections of elements without duplicates. We’ll return to the question of duplicates a bit later.
There are many standard operations on sets. Here, we will be concerned with supporting the following operations:
addElt: Adds an item to the set
remElt: Removes an item from the set
size: Report how many items are in the set
hasElt: Check whether a specific item is in the set
2 The Efficiency of Lists for Sets
Lists provide a plausible data structure for implementing sets, but their run-time performance can be slow on some operations. Lists are, by definition, linear (meaning that there is a straight-line order in which elements get accessed). When we check whether an item is in a list, we may end up traversing the entire list. Put differently, every time we check one element of the list against the item we are searching for, we throw away only one element from further consideration. Using a sorted list (rather than an unsorted one) doesn’t help (we might be looking for the last element). To improve on the efficiency of our set implementation, we need a different implementation that discards more elements with each check.
Trees seem a natural way to throw away more elements. If the data is placed in the tree in a useful way, we should be able to discard some portion of the tree (such as the left or right subtree) after each search. There are several tree-based data structures with different placements of data. We will contrast three in detail, using them to illustrate how we start to account for efficiency concerns in program design. We’ll also look at how to implement these in Java..
To simply the discussion, we will focus on trees containing only integers (rather than trees of people or tournament contestants).
3 Binary Search Trees
In a binary search tree (BST), every node has the following property:
Every element in the left subtree is smaller than the element at the root, every element in the right subtree is larger than the element at the root, and both the left and right subtrees are BSTs.
Constraints on all instances of a data structure are called invariants. Invariants are an essential part of program design. Often, we define a new data structure via an invariant on an existing data structure. You need to learn how to identify, state, and implement invariants.
3.1 Understanding BSTs
Consider the following BST (convince yourself that it meets the BST property):
10 |
/ \ |
4 12 |
/ \ |
2 7 |
/ \ / \ |
1 3 6 8 |
/ |
5 |
|
The interesting case is remElt(4). Here, the solution that requires the least change in the tree replaces the 4 with either the 3 (largest element to the left of 4) or the 5 (smallest element on the right of 4) – you should convince yourself that replacing 4 with either of these numbers still leaves a BST.
With your examples in hand, let’s describe how each of the set operations has to behave to maintain or exploit the BST invariant:
size behaves as in a plain binary tree.
hasElt optimizes on hasElt on a plain binary tree: if the element you’re looking for is not in the root, the search recurs on only one of the left subtree (if the element to find is smaller than that in the root) or the right subtree (if the element to find is larger than that in the root).
addElt always inserts new elements at a leaf in the tree. It starts from the root, moving to the left or right subtree as needed to maintain the invariant. When it hits a empty tree, addElt replaces it with a new node with the data to add and two empty subtrees.
remElt traverses the BST until it finds the node with the element to remove at the root. If the node has no children, remElt returns the empty tree. If the node has only one child, remElt replaces it with its child node. If the node has two children, remElt replaces the value in the node with either the largest element in the left subtree or the smallest element in the right subtree; remElt then removes the moved node value from its subtree.
The wikipedia entry also has diagrams illustrating this operation.
This description illustrates that any operation that modifies the data structure (here, addElt and remElt) must maintain the invariant. Operations that merely inspect the data structure (such as hasElt) are free to exploit the invariant, though the invariant does not necessarily affect all operations (such as size).
4 Implementing BSTs
We haven’t yet implemented tree programs in 2102, so let’s look at the actual code for BSTs. What classes do you need?
Each node has an integer, a left subtree, and a right subtree. You might think you need one class (say, Node), in which each of the left and right subtrees are nodes. How do we build a BST with only one element this way, though? We would have to do something like:
Node fiveTree = new Node(5, new Node(???), new Node(???)) |
Since we can’t build nodes with no data, we clearly need some other value to represent the empty tree. In fact, any time you create a tree you need two classes (empty tree and non-empty tree), as well as an interface to tie them together:
interface IBST {} |
|
class EmptyBST implements IBST { |
EmptyBST() {} |
} |
|
class DataBST implements IBST { |
int data; |
IBST left; |
IBST right; |
} |
(If you know null, you may want to use it here, but that isn’t good OO practice. null means "we know nothing", which is different from "we know we have a tree with no contents".)
Once you have these classes, the implementations of size, hasElt, and addElt are fairly straightforward. Here is the code.
The implementation of remElt, however, has some interesting implications in Java. Here are some separate (optional) notes on that – these show the solution to this week’s advanced lab assignment.
4.1 Run-time Performance of Set Operations via BSTs
Our description of hasElt on BSTs suggests that we’ve made progress on our performance problems with hasElt: on each comparison, we throw away one subtree, which is more than the single element we got to throw away on each comparison within lists.
Stop and think: do we always get to throw away more than one element in hasElt on a BST?
Consider the BST resulting from the following sequence of calls:
addElt(5), addElt(4), addElt(3), addElt(2), addElt(1).
Draw the BST – what does it look like? It looks like a list. So we didn’t actually gain any performance from hasElt in this case.
This illustrates one of the subtleties of arguing about run-time performance: we have to distinguish what happens in the best case, worst case, and average case. With lists, the performance of hasElt is linear in each of the best, worst, and average cases. With BSTs, the best case performance of hasElt is logarithmic (the mathematical term for "we throw away half each time"), but the worst case is still linear. Ideally, we would like a data structure in which the worst case performance of hasElt is also logarithmic.