CS 525D Fall 2009 - Project Guidelines

Computer Science Department

CS525D Knowledge Discovery and Data Mining - Fall 2009
Project Guidelines

PROF. CAROLINA RUIZ

Guidelines for Projects and for Written Reports
Guidelines for Oral Reports and Slides
Submission and Due Dates
Grading Criteria

Guidelines for Projects and for Written Reports

Each of the projects in this course deals with one or more specific machine learning techniques. The guidelines below are intended to help you structure the experimental work you are expected to do for each project, as well as your written and oral reports.

An important aspect of both your written and oral reports is the "story-telling" aspect. Try to tell the story of what experiments you ran and why, how each experiment shed lights on what experiment(s) to run next, and what you learned with them.

Experiments' Guidelines

For each of the datasets in the project:

In your written report, describe the dataset in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.

For each machine learning technique used in the project: [100 points (+5 extra points) per machine learning technique, per dataset]
1. Study the corresponding chapters/sections of the textbook specified in the Course Schedule.
2. [15 points] Algorithms and Code: Read the Weka code that implements the technique. In your written report, describe the algorithm underlying the code IN YOUR OWN WORDS. Explain the algorithm in terms of the inputs it receives and the outputs it produces, AND the main steps it follows to construct the model. Make sure to describe the correspondence between the algorithm you describe and the parts of the code that implement it.
3. [5 points] Objectives of the Data Mining Experiments: Before you start running experiments, look at the raw data in detail. Figure out 3 to 5 specific, interesting questions about the domain that you want to answer with your machine learning experiments. Try to choose questions that are about the domain (not about the machine learning method or the experimental parameters!) that would particularly benefit from using the machine learning method under study. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results, or as plain questions.
4. [2 points] Performance metric(s): Explain what performance metric(s) will be used to evaluate the models you construct (e.g., accuracy, error rate, ....) and why.
5. [10 points] Preprocessing of the Data: You should apply relevant filters to your dataset as needed before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality that you need to preprocess your data to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish), or by using a separate tool.
6. [3 points] Training and Testing Instances: Use 10-fold cross-validation, unless the data is insufficient, the combined execution time is prohibiting, cross-validation is not consistent with the performance metric(s) chosen, or the particular project description asks you to use a different experimental protocol. Justify your choice if different from that requested by the project description.
7. [60 points] Experiments:
  - For each experiment you ran describe:
    1. Objectives: Which of your 3-5 specific questions/conjectures about the dataset domain you aim to answer/validate with this experiment. Describe also any additional objectives for this experiment that might have been motivated by your previous experiments.
    2. Data: What data did you use to construct and test your model?
    3. Parameters and Settings: Describe what parameter values and other settings you used and why.
    4. Additional Pre or Post Processing: Any additional pre or post processing done to the data or the model in order to improve the model's performance, as measured by the performance metric(s) chosen.
      [Parts 1-4 combined: 10 points]
    5. Resulting model: Describe the resulting model (e.g., size of the model, readability). If the model is readable summarize in your own words what the model says, focusing on the most interesting/relevant patterns. Elaborate on if and how the model answers the objectives of this experiment.
    6. Performance of the resulting model:
      - State what the performance of the model is. If applicable, elaborate on the confusion matrix and/or other relevant performance indicators.
      - How long it took Weka to construct this model?
      - Compare the performace of this model with that of other models constructed in this project for this dataset.
      [Parts 5-6 combined: 10 points for presention of results and 30 points for in-depth discussion of results]
  [10 points for sufficient and coherent set of experiments]
8. [10 points] Summary of Results
  - What was/were the model/models with the highest performance?
  - Discuss how this performance compares with that of your best performing results from previous projects on the same dataset.
  - Elaborate on whether or not your experiments helped you answer your initial 3-5 objectives, and what answers you obtained for these guiding questions.
  - Discuss how well this particular machine learning method worked on this dataset. What combination of parameters yielded particularly good results?
  - Overall project conclusion: Discuss the strengths and the weaknesses of your project.

Guidelines for Oral Reports and Slides

We will discuss the results of each project in class. Your oral report should summarize the most important parts of your written report and should elaborate only on the most significant or more unique parts of your work. Each student will have 6 minutes to present their project in class. Given the time constraint, your presentation should consist of 3-4 slides (and no more!). Try to summarize results using tables when appropriate. Be prepared and use your presentation time wisely!

Submission and Due Dates

Please submit the following file containing your oral presentation by email to the professor AT LEAST ONE HOUR BEFORE THE BEGINNING OF CLASS the day the project is due:
- [your-lastname]_projn_slides.[ext]
containing your slides for your oral report. n is the project number. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance, the file with my slides for Project 1 would be named ruiz_proj1_slides.ppt
Please submit a hardcopy of your written report by the beginning of class the day that the project is due.

Grading Criteria

The project grade will be distributed as follows:

your written report will count for 85%,
your oral presentation will count for 10%, and
your class participation during project presentation will count for the remaining 5% of your project grade.