![]()

Objectives | Options | Staff&Contact Information | Where&When | Textbook | Grading | Policies | Schedule&Assignments | Accelerated Track
Project #3 - The Netflix Challenge Continues
For the third and last project of this accelerated section, you will continue fleshing out your solution to the Netflix Challenge. You will implement a standard solution to the community filtering problem, albeit a simple one, called nearest neighbor filtering. Then you will implement the formula used in the contest to score the quality of the recommendations. It is a measure called root mean square error. The last part of the project will be open-ended. One of the great aspects of the community filtering problem is, no matter how good your recommendations are there always seems to be a way to do better. How much better can you make it?
Throughout this project, keep in mind that Netflix data is rather large. Make sure to use binary search trees whenever you can to avoid searches across long lists.
In the official contest, the goal was something slightly different than generating good recommendations. The goal was to accurately predict the ratings the clients will give to movies sometime in the future. If you were given the last six months worth of ratings (say from January to June), how well would you be able to predict the values of the ratings that will be given during the next six-month (from July to December)? One way to measure the quality of the predictions is to wait until December and compare the set of predicted ratings with the set of actual ratings, and then see how much they differ by using the root mean square error formula.
The root mean square error (or root mean squared deviation) is computed as follow. Take x0,i to be the ratings given as an input, x1,i to be the predicted ratings (computed using the x0,i's), and x2,i to be future ratings we are measuring against. The index i ranges over the ratings ids, from 1 to n, where n is the number of ratings. Then, the formula is

In other words, to compute the root mean square error, take the square root of the mean of the squares of the differences between the predicted ratings and the future ratings.
The Netflix data comes as a large list of ratings. In order to be able to use last week's code to predict ratings on that data, you need to transform that flat representation into the explicitly cyclic data structure.
To load the Netflix data you should use the NetflixIO class I posted in the discussion board for Project #3. The NetflixIO class has one public constructor and one public field:
The constructor will load all the files with a name of the form mv_0000001.txt, where the number is the id number of the movie. It also looks for a file called movie_titles.txt, from which it loads the movie titles. The whole Netflix data set contains a lot of ratings. At the beginning, you will have only a few data files to work with. Once you are confident your code is working, e-mail me and I will post some extra files on the course bulletin board.
The support code uses the classes MtLoR, ConsLoR, and the interface LoSR to represent list of ratings, and it uses MtLoLoR, ConsLoLoR, and the interface LoLoSR to represent the list of lists. You will need to modify these classes in order to process the result of loading the files.
I will not provide you with the source code (.java file) for NetflixIO, because I do not want you to concern yourself with its implementation. Instead, you will obtain it as a compiled .jar file. Install this file into your project the same way you installed tester library (project/properties/java build path/add external jar.)
The recommendation algorithm you coded last week is decent, but surely it can be improved. The root mean square score you received from Part 2 should give you an idea of how much space there is for improvement. Netflix's original algorithm scored a RMSE of 0.9525. The winner of the official contest scored 0.8553.
One avenue for improvement is to find people of similar taste. Specifically, when trying to make a prediction for a particular person, begin by gathering a small set of people who have similar tastes as the given person, then use their rating to make the prediction.
Another way of saying the same thing is, to make a prediction for the rating a person p will give on movie m, find all the people qi who have already rated movie m. From all the qi, select the k people who are the most similar to p, then use their ratings about movie m to make a prediction for p.
There are many ways to measure whether two people have similar tastes of movies, such as how many movies they have in common, and how often they have rated these common movies in the same way.
If you decide to code this improvement, you will be coding the classic kNN solution to the collaborative filtering problem. The name is short for k nearest neighbors. And, like all classic knn implementation, you get to choose k, you get to choose how you want to measure similarity, and you get to choose how you want to make the prediction from the qi's you have selected, either with an average or with something else. So, if you decide to code kNN, document those choices in your report.
Once you are done, make a zip file containing all of your project's file and submit it. To make the zip, in Eclipse, select File, choose Export. Expand the General folder and choose Archive File. Eclipse will ask you for the file name of the zip file.
Turnin is at http://turnin.cs.wpi.edu:8087/servlets/turnin.ss
The deadline is Tuesday, November 17, 2009 at 11:59 PM.
