CS 2102 Bterm 09 - Accelerated Project #3

Computer Science Department

CS 2102 B Term 2009
Object-Oriented Design Concepts

Project #3 - The Netflix Challenge Continues

For the third and last project of this accelerated section, you will continue fleshing out your solution to the Netflix Challenge. You will implement a standard solution to the community filtering problem, albeit a simple one, called nearest neighbor filtering. Then you will implement the formula used in the contest to score the quality of the recommendations. It is a measure called root mean square error. The last part of the project will be open-ended. One of the great aspects of the community filtering problem is, no matter how good your recommendations are there always seems to be a way to do better. How much better can you make it?

Throughout this project, keep in mind that Netflix data is rather large. Make sure to use binary search trees whenever you can to avoid searches across long lists.

Part 1

In the official contest, the goal was something slightly different than generating good recommendations. The goal was to accurately predict the ratings the clients will give to movies sometime in the future. If you were given the last six months worth of ratings (say from January to June), how well would you be able to predict the values of the ratings that will be given during the next six-month (from July to December)? One way to measure the quality of the predictions is to wait until December and compare the set of predicted ratings with the set of actual ratings, and then see how much they differ by using the root mean square error formula.

The root mean square error (or root mean squared deviation) is computed as follow. Take x_0,i to be the ratings given as an input, x_1,i to be the predicted ratings (computed using the x_0,i's), and x_2,i to be future ratings we are measuring against. The index i ranges over the ratings ids, from 1 to n, where n is the number of ratings. Then, the formula is

In other words, to compute the root mean square error, take the square root of the mean of the squares of the differences between the predicted ratings and the future ratings.

Write a function called RMSE() that computes the root mean square error of a list of predicted ratings, given a list of future ratings.

Part 2

The Netflix data comes as a large list of ratings. In order to be able to use last week's code to predict ratings on that data, you need to transform that flat representation into the explicitly cyclic data structure.

To load the Netflix data you should use the NetflixIO class I posted in the discussion board for Project #3. The NetflixIO class has one public constructor and one public field:

new NetflixIO(String directoryName)
Construct a new NetflixIO object by loading the ratings from the collection of files in the given directory.
NetflixIO.data
A field of type LoLoSR (which stands for list of list of stored ratings.) Each sub list contains the ratings for a different movie, represented using the StoredRating class.

The constructor will load all the files with a name of the form mv_0000001.txt, where the number is the id number of the movie. It also looks for a file called movie_titles.txt, from which it loads the movie titles. The whole Netflix data set contains a lot of ratings. At the beginning, you will have only a few data files to work with. Once you are confident your code is working, e-mail me and I will post some extra files on the course bulletin board.

The support code uses the classes MtLoR, ConsLoR, and the interface LoSR to represent list of ratings, and it uses MtLoLoR, ConsLoLoR, and the interface LoLoSR to represent the list of lists. You will need to modify these classes in order to process the result of loading the files.

I will not provide you with the source code (.java file) for NetflixIO, because I do not want you to concern yourself with its implementation. Instead, you will obtain it as a compiled .jar file. Install this file into your project the same way you installed tester library (project/properties/java build path/add external jar.)

Write a function called constructCyclicData() that consumes a LoLoRS as an input and returns a list of Movies objects that are fully populated with the ratings, as an explicitly cyclic data structure.
Reuse last week's code to implement the following two methods.

Movie.predictUsingAverage()
Given a movie, find all people who have rated that movie, and return the average rating.

LoLoSR.predictAllUsingAverages(ListOfMovie movies)
Given a list of lists of stored ratings, loaded from the set of the future ratings, and a list of Movie objects, loaded from the set of past ratings and constructed with the constructCyclicData() method, returns predictions for all the future ratings. Of course, you should not use the future rating value when making your prediction, only the past one. The contest challenge is, after all, to predict these future ratings as if they had not happened yet. You should only use the future ratings to know which prediction to make.
Load the two data sets, the past and future one, then evaluate the quality of your recommendations using the RMSE() method you wrote in Part 1.

Part 3

The recommendation algorithm you coded last week is decent, but surely it can be improved. The root mean square score you received from Part 2 should give you an idea of how much space there is for improvement. Netflix's original algorithm scored a RMSE of 0.9525. The winner of the official contest scored 0.8553.

Design and implement an improvement on the simple recommendation algorithm from Part 2. Your program should improve on the RMSE score of last week's program by some small amount. Don't worry about making large improvements; improving the score by small amounts is already quite hard, as the contest demonstrated.
Submit a short report describing your approach. State the original RMSE and the RMSE after your changes. Discuss the changes you made to last week's algorithm, and explain why it is better at making predictions.

One avenue for improvement is to find people of similar taste. Specifically, when trying to make a prediction for a particular person, begin by gathering a small set of people who have similar tastes as the given person, then use their rating to make the prediction.

Another way of saying the same thing is, to make a prediction for the rating a person p will give on movie m, find all the people q_i who have already rated movie m. From all the q_i, select the k people who are the most similar to p, then use their ratings about movie m to make a prediction for p.

There are many ways to measure whether two people have similar tastes of movies, such as how many movies they have in common, and how often they have rated these common movies in the same way.

If you decide to code this improvement, you will be coding the classic kNN solution to the collaborative filtering problem. The name is short for k nearest neighbors. And, like all classic knn implementation, you get to choose k, you get to choose how you want to measure similarity, and you get to choose how you want to make the prediction from the q_i's you have selected, either with an average or with something else. So, if you decide to code kNN, document those choices in your report.

Once you are done

Once you are done, make a zip file containing all of your project's file and submit it. To make the zip, in Eclipse, select File, choose Export. Expand the General folder and choose Archive File. Eclipse will ask you for the file name of the zip file.

Turnin is at http://turnin.cs.wpi.edu:8087/servlets/turnin.ss

The deadline is Tuesday, November 17, 2009 at 11:59 PM.

[CS2102]