1 Problem 1: Levenshtein Distance
1.1 Understand the Algorithm
1.2 Implement the Algorithm
1.3 Testing
1.4 Optimize the Performance with a Hash Map
2 Problem 2: Graphs
3 Grading Criteria
4 What to Turn In

Homework 5: Maps, Memoization and Graphs

1 Problem 1: Levenshtein Distance

Have you ever used a software application (such as a text-messaging tool) that tries to automatically correct or complete your typing? For example, if I typed "Jaca" in a smartphone, the editor might list "Java" as an alternative. Such applications find alternatives using a measure of the "closeness" of two words. One standard metric for closeness is called Levenshtein distance. For this problem, you will write a function to compute the Levenshtein distance between two strings.

Levenshtein distance computes the minimum number of operations needed to convert one string into another. Valid operations here include inserting a character, deleting a character or substituting one single character with another. For example, we could convert "baking" to "masking" by (1) changing "b" to "m" and (2) inserting an "s". (Other sequences of operations could also do the conversion–such as deleting all the letters of one word and adding all the letters of the second–but our goal is to compute the fewest number of operations required.)

Levenshtein distance can be computed with a straightforward recursive algorithm that works on prefixes of the initial strings. Let’s illustrate first with some examples (we’re calling the method LevDist here):

Summarizing, the core algorithm is as follows (where str-last means the substring of str that drops the last character, and lastChar(str) is the last character in str):

updated: Thurs Dec 1, 7pm

  int LevDist(str1,str2) {

    if str1 == str2

      return 0

    else if str1 == ""

      return length(str2)

    else if str2 == ""

      return length(str1)

    else if (lastChar(str1) == lastChar(str2))

      return LevDist(str1-last, str2-last)

    else

      return 1 + min (LevDist(str1-last, str2),

                      LevDist(str1, str2-last),

                      LevDist(str1-last, str2-last))

  }

Note that here we are interested only in the number of operations, NOT the sequence of operations to perform.

1.1 Understand the Algorithm

Before you go further, make sure you understand how the algorithm works. Work out a couple of smaller (strings fewer than 4 characters) examples by hand.

1.2 Implement the Algorithm

First, import java.lang.Math into your file.

Create a class called WordFinder. This class should implement a public method called LevDist as described above. This method will take two Strings as inputs and return an integer.

In writing this, you may want to use the following methods on Strings. These methods work with indices into a string: the first character in a string is at index 0, and the last character is at the index of one less than the length of the string.

Your implementation should follow the recursive structure shown above. Your code should not iterate over the indices into the string (a warning mainly for those of you with prior Java experience). You should not convert the strings to another data structure, or add any other data structure for this basic implementation.

Warning: Descriptions of the Levenshtein algorithm online are likely to confuse you more than help you (they are largely based on auxiliary data structures and iteration over indices). If you are confused about the algorithm, we strongly recommend the discussion board, office hours, or discussions with classmates.

1.3 Testing

Write a set of test cases for LevDist (in your Examples class). We will be checking for whether you are testing the various paths through the code, as well as interesting or potentially challenging cases. Diversity in the kinds of inputs you test is more important than the sheer number of tests (in other words, we want quality over quantity).

1.4 Optimize the Performance with a HashMap

The naive version of the algorithm you have implemented can be expensive to run on longer strings, because it might do the same computation multiple times. Each time you call LevDist on a pair of non-empty strings, it spawns three more calls. These in turn can each spawn three more calls. If you write out an example, you can find multiple calls to LevDist with the same inputs. Since LevDist always returns the same answer on the same inputs, these additional calls are wasteful.

To improve the performance, we would ideally like to remember the inputs and corresponding answers from previous calls to the LevDist method. LevDist can then check to see whether the answer on its inputs has been computed before: if it has, it should just return the previously-computed answer; if not, it can compute the answer and store it for future use.

A HashMap is a good data structure for remembering the previous inputs and outputs: the keys are a data structure containing the inputs; the corresponding result of LevDist is the value stored for that key. Storing previously-computed answers to a function is called memoization. It can be a useful technique for improving performance of tree-shaped computations in which many paths lead to the same core calculations. For this problem, you will memoize LevDist.

Concretely, you should:

2 Problem 2: Graphs

In class, we looked at graphs with cities as the core content of the nodes, and we wrote a hasRoute method to determine whether there exists a route from one city to another. For this assignment, we want a different twist on a similar problem: we’d like to gather a list of all the cities that one can get to from a given starting city. In other words, we want to add a method reachableFrom to graphs, as captured by the following (partial) interface:

interface IGraph { ...

  // produce a list of all nodes reachable from given starting node
  LinkedList<Node> reachableFrom (Node from);
}

For the graph example we’ve been using in lecture, the list of nodes reachable from the node for Providence would contain both Providence and Hartford. The list of nodes reachable from the node for Hartford would just be Hartford.

Concretely, solve the following problems, starting from the initial graph implementation from Thursday’s lecture:

  1. Add a reachableFrom method to the Graph class.

  2. Include a good set of tests for reachableFrom. Ideally, your tests should not presume a specific order for nodes in the output of reachableFrom. (Good tests that assume a specific order will earn points towards test cases; tests that are independent of the order of output can earn points towards testing methods with multiple correct answers).

  3. In a clearly-labeled comment, argue why your reachableFrom method will terminate. A good answer will state an invariant on the code and how that contributes to termination.

  4. Your Graph class could (in theory) capture friend relationships in a social network just as well as it captures routes between cities. For a social network graph, however, the content at each node isn’t just a string, but a user profile. The following class defines a simple user profile:

      class Profile {

        String name;

        String college;

      }

    1. Use generics to enable you to create and find routes in either social networks or city graphs using the same interface and classes.

    2. Create at least one city graph and at least one social network in your Examples class.

    3. Provide test cases for hasRoute using each of graphs over cities and social network graphs over profiles.

}

3 Grading Criteria

Clean Code points will be checking for good OO practice, including appropriate encapsulation, appropriate use of public/private designations on fields and methods, avoiding bad style (such as type checks or instanceOf in your code), creating interfaces and abstract classes when appropriate, and proper use of checked versus runtime exceptions (if applicable). We will also check for documentation on your methods and interfaces.

4 What to Turn In

Submit a separate file for each class and interface created for this assignment. You may use either a separate Examples class per problem, or a combined Examples class (your choice).