CS 525D Spring 2004

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
PROJECT 6 - Web Mining. Spring 2004

PROF. CAROLINA RUIZ

Due Date: April 20th 2004 at 3:00 pm

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is to find patterns in web access data with the goal of predicting what pages from a website a typical user will visit based on what other pages on the same website the person has at. For this project, the Microsoft Anonymous Web Data will be used. This dataset is available at the UCI KDD Repository

PROJECT ASSIGNMENT

The precise prediction task is described in the "classification/collaborative filtering task" that accompanies the dataset. For this prediction task, you are allowed to employ any of the data mining techniques that we have studied during the semester, or (better yet!) a combination of them. As usual, the more ideas you explore and the more robust your experimentation is, the better your grade on the project will be.

The dataset is also accompanied by references to Breese, Heckerman, and Kadie's work on this dataset and you're encouraged to read their paper and/or the Microsoft Technical Report that is available in the dataset's webpage.

Students are free to work individually on this project or in groups of two. If you decide to work with another student in the class on this project, please let me know by email by Friday, April 16th (midnight).

The following are guidelines for the analysis of the data:

Code: You can use the data mining methods implemented in the Weka system. You are encouraged to run experiments with several of the methods (or combinations of them) that we studied in class and that are available in Weka and to compare the results from them.
Data Instances:
You may restrict your experiments to a subset of the dataset IF Weka cannot handle your whole dataset (this is unlikely though).
As usual, a main part of the project is the PREPROCESSING of the dataset. You should consider applying relevant concept hierarchies and generalizations (e.g. using the results of previous mining tasks) to your dataset. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

REPORT AND DUE DATE

Written Report. Your written report is due at 3:00 pm. Please hand it in at the beginning of class. Your report should contain the following sections with the corresponding discussions:
1. Code Description: Explain in some detail the algorithms underlying each of the data mining methods used. Describe each method in terms of the input it receives and the output it produces, and the steps it follows to produce this output.
2. Data: Describe the dataset that you analyzed in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
  Provide a detail description of the preprocessing of your data. Justify the preprocessing you applied and why the resulting data is the appropriate one for mining.
3. Experiments: For each experiment you ran describe:
  - Data used for the experiment.
  - Results and DETAILED analysis of the results.
4. Summary of Results
  - What was the most meaningful results you obtained?
  - Strengths and the weaknesses of your system.
Oral Report. We will discuss the results from the individual projects during the class on April 20th 2004. Be ready to show your results and to discuss your project in 7 minutes during class. PREPARE SLIDES SHOWING YOUR WORK.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING PROJECT 6 - Web Mining. Spring 2004

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORT AND DUE DATE

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
PROJECT 6 - Web Mining. Spring 2004