Homework 5: Maps, Memoization and Graphs

1 Problem 1: Levenshtein Distance

Have you ever used a software application (such as a text-messaging tool) that tries to automatically correct or complete your typing? For example, if I typed "Jaca" in a smartphone, the editor might list "Java" as an alternative. Such applications find alternatives using a measure of the "closeness" of two words. One standard metric for closeness is called Levenshtein distance. For this problem, you will write a function to compute the Levenshtein distance between two strings.

Levenshtein distance computes the minimum number of operations needed to convert one string into another. Valid operations here include inserting a character, deleting a character or substituting one single character with another. For example, we could convert "baking" to "masking" by (1) changing "b" to "m" and (2) inserting an "s". (Other sequences of operations could also do the conversion–such as deleting all the letters of one word and adding all the letters of the second–but our goal is to compute the fewest number of operations required.)

Levenshtein distance can be computed with a straightforward recursive algorithm that works on prefixes of the initial strings. Let’s illustrate first with some examples (we’re calling the method LevDist here):

LevDist(hi","hi") = 0
LevDist("","hi") = 2
If one string is empty, we need one operation (insertion or deletion) for each character in the non-empty string.
LevDist("my","me") = 1
Intuitively, we see that the answer here should be 1, since we should be able to replace "y" with "e". To compute this though, the algorithm tries three options:
1. Compute the distance between the first word and all-but-the-last character of the second word, then add 1 (to account for "deleting" the last character).
  LevDist("my", "m") + 1
2. Compute the distance between all-but-the-last character of the first word and the second word, then add 1 (to account for "inserting" the last character).
  LevDist("m", "me") + 1
3. Compute the distance between the all-but-last character of the first word and all-but-last character of the second word, then add 1 (to account for "replacing" one letter with another).
  LevDist("m", "m") + 1
Whichever of these three numbers is smallest is our answer.

Summarizing, the core algorithm is as follows (where str-last means the substring of str that drops the last character, and lastChar(str) is the last character in str):

updated: Thurs Dec 1, 7pm

int LevDist(str1,str2) {

if str1 == str2

return 0

else if str1 == ""

return length(str2)

else if str2 == ""

return length(str1)

else if (lastChar(str1) == lastChar(str2))

return LevDist(str1-last, str2-last)

else

return 1 + min (LevDist(str1-last, str2),

LevDist(str1, str2-last),

LevDist(str1-last, str2-last))

}

Note that here we are interested only in the number of operations, NOT the sequence of operations to perform.

1.1 Understand the Algorithm

Before you go further, make sure you understand how the algorithm works. Work out a couple of smaller (strings fewer than 4 characters) examples by hand.

1.2 Implement the Algorithm

First, import java.lang.Math into your file.

Create a class called WordFinder. This class should implement a public method called LevDist as described above. This method will take two Strings as inputs and return an integer.

In writing this, you may want to use the following methods on Strings. These methods work with indices into a string: the first character in a string is at index 0, and the last character is at the index of one less than the length of the string.

substring, which takes two integers (both valid indices into the string) and returns the substring between those indices, inclusive. See the substring documentation if you need examples or details.
length, which returns the length of the string as an integer.
charAt, which takes an integer and returns the character at that index into the string.

Your implementation should follow the recursive structure shown above. Your code should not iterate over the indices into the string (a warning mainly for those of you with prior Java experience). You should not convert the strings to another data structure, or add any other data structure for this basic implementation.

Warning: Descriptions of the Levenshtein algorithm online are likely to confuse you more than help you (they are largely based on auxiliary data structures and iteration over indices). If you are confused about the algorithm, we strongly recommend the discussion board, office hours, or discussions with classmates.

1.3 Testing

Write a set of test cases for LevDist (in your Examples class). We will be checking for whether you are testing the various paths through the code, as well as interesting or potentially challenging cases. Diversity in the kinds of inputs you test is more important than the sheer number of tests (in other words, we want quality over quantity).

1.4 Optimize the Performance with a HashMap

The naive version of the algorithm you have implemented can be expensive to run on longer strings, because it might do the same computation multiple times. Each time you call LevDist on a pair of non-empty strings, it spawns three more calls. These in turn can each spawn three more calls. If you write out an example, you can find multiple calls to LevDist with the same inputs. Since LevDist always returns the same answer on the same inputs, these additional calls are wasteful.

To improve the performance, we would ideally like to remember the inputs and corresponding answers from previous calls to the LevDist method. LevDist can then check to see whether the answer on its inputs has been computed before: if it has, it should just return the previously-computed answer; if not, it can compute the answer and store it for future use.

A HashMap is a good data structure for remembering the previous inputs and outputs: the keys are a data structure containing the inputs; the corresponding result of LevDist is the value stored for that key. Storing previously-computed answers to a function is called memoization. It can be a useful technique for improving performance of tree-shaped computations in which many paths lead to the same core calculations. For this problem, you will memoize LevDist.

Concretely, you should:

Determine a single representation for the pair of strings given as inputs to LevDist,
Add a HashMap to WordFinder that uses your representation as the key and the results of LevDist as the value,
Edit your LevDist function to use your HashMap to store and retrieve previously-computed answers as described above.
Answer the following question in a comment after your existing test cases in the Examples class: how would you test that your optimized version is correct? Describe what you would test, and the rough mechanics you would use to do so. You do not need to implement your proposal – just describe it with technical precision.
Mark which of your test cases should benefit from the optimization (do this is a brief //optimize test comment before each relevant test case). If you don’t already have concrete test cases that earn this mark, add some.

2 Problem 2: Graphs

In class, we looked at graphs with cities as the core content of the nodes, and we wrote a hasRoute method to determine whether there exists a route from one city to another. For this assignment, we want a different twist on a similar problem: we’d like to gather a list of all the cities that one can get to from a given starting city. In other words, we want to add a method reachableFrom to graphs, as captured by the following (partial) interface:

interface IGraph { ...

// produce a list of all nodes reachable from given starting node
LinkedList<Node> reachableFrom (Node from);
}

For the graph example we’ve been using in lecture, the list of nodes reachable from the node for Providence would contain both Providence and Hartford. The list of nodes reachable from the node for Hartford would just be Hartford.

Concretely, solve the following problems, starting from the initial graph implementation from Thursday’s lecture:

Add a reachableFrom method to the Graph class.
Include a good set of tests for reachableFrom. Ideally, your tests should not presume a specific order for nodes in the output of reachableFrom. (Good tests that assume a specific order will earn points towards test cases; tests that are independent of the order of output can earn points towards testing methods with multiple correct answers).
In a clearly-labeled comment, argue why your reachableFrom method will terminate. A good answer will state an invariant on the code and how that contributes to termination.
Your Graph class could (in theory) capture friend relationships in a social network just as well as it captures routes between cities. For a social network graph, however, the content at each node isn’t just a string, but a user profile. The following class defines a simple user profile:
  class Profile {
    String name;
    String college;
  }
1. Use generics to enable you to create and find routes in either social networks or city graphs using the same interface and classes.
2. Create at least one city graph and at least one social network in your Examples class.
3. Provide test cases for hasRoute using each of graphs over cities and social network graphs over profiles.

}

3 Grading Criteria

Clean Code points will be checking for good OO practice, including appropriate encapsulation, appropriate use of public/private designations on fields and methods, avoiding bad style (such as type checks or instanceOf in your code), creating interfaces and abstract classes when appropriate, and proper use of checked versus runtime exceptions (if applicable). We will also check for documentation on your methods and interfaces.

4 What to Turn In

Submit a separate file for each class and interface created for this assignment. You may use either a separate Examples class per problem, or a combined Examples class (your choice).

1	Problem 1: Levenshtein Distance
2	Problem 2: Graphs
3	Grading Criteria
4	What to Turn In