CS 539 Fall 2014 - Project Guidelines

Computer Science Department

CS539 Machine Learning
Project Guidelines - Fall 2014

PROF. CAROLINA RUIZ

WARNING: Changes to these guidelines may be made during the course of the semester.

Guidelines for Projects and for Written Reports
Guidelines for Oral Reports and Slides
Submission and Due Dates
Grading Criteria

Guidelines for Projects and for Written Reports

Each of the projects in this course deals with one or more specific machine learning techniques. The guidelines below are intended to help you structure the experimental work you are expected to do for each project, as well as your written and oral reports.

An important aspect of both your written and oral reports is the "story-telling" aspect. Try to tell the story of what experiments you ran and why, how each experiment shed lights on what experiment(s) to run next, and what you learned with them.

Guidelines for the Experiments

For each of the datasets in the project:

Before you start running experiments, make sure to understand the raw data very well, to learn as much as you can about the domain, and to research approaches used by others on this dataset to the extent possible.

Dataset Description:
In your written report, describe the dataset in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics. Describe briefly what pre-processing (if any) was used before any experiments were ran.
For each machine learning technique used in the project:
1. Study the corresponding chapters/sections of the textbook specified in the Course Schedule.
2. Algorithms and Code: State in your report which Weka functions and which R functions you use in your experiments. For R, state if these functions are part of an existing package (if so, say which one), or if you wrote the R code. Read the Weka code and the R code (if available) and documentation that implements and describes the technique. In your written report, describe the algorithm underlying the code in your own words. Explain the algorithm in terms of the inputs it receives and the outputs it produces, AND the main steps it follows to construct the model, using high-level pseudo-code. Make sure to describe the correspondence between the algorithm you describe and the parts of the code that implement it (which are not necessarily the same as the ones described in the textbook or in class).
3. Guiding Questions - Objectives of the Machine Learning Experiments: Before you start running experiments, look at the raw data in detail. Figure out 3 specific, interesting questions about the dataset domain (e.g., diabetis, weather, labor contracts, ...) that you want to answer with your machine learning experiments. Choose questions that are about the domain (not about the machine learning method or the experimental parameters!) that would particularly benefit from using the learning method under study. These questions may be phrased as conjectures that you want to confirm or refute with your experimental results, or as plain questions.
4. Performance metric(s): Use the performance metrics specified in the project description. If these are not given, explain what performance metric(s) you will use to evaluate the models you construct (e.g., accuracy, error rate, size of the model, readability of the model, ...) and why.
5. Preprocessing of the Data: Try to keep the initial pre-processing to a minimum at first. You should apply only necessary filters to your dataset before starting mining the data, and then introduce additional pre-processing as needed based on the results of experiments you run. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka and R do not provide the functionality that you need to preprocess your data to obtain useful patterns, preprocess the data either by writing your own scripts and filters, or by using a separate tool.
6. Training and Testing Instances: Use 10-fold cross-validation, unless the data is insufficient, the combined execution time is prohibiting, cross-validation is not consistent with the performance metric(s) chosen, or the particular project description asks you to use a different experimental protocol. Justify your choice if different from that requested by the project description.
7. Experiments:
  You must run a sufficiently large and coherent set of experiments. Start with a basic experiment with default parameters (if possible), and design new experiments varying the settings (i.e., pre-processing, parameters, and/or post-processing, ideally varying one setting at a time) based on the results that you obtain in your experiments. Each experiment should be motivated by a previous experiment, and by the guiding questions.
  - For each experiment you ran describe:
    1. Objectives: Which of your 3 specific questions/conjectures about the dataset domain you aim to answer/validate with this experiment. Describe also any additional objectives for this experiment that might have been motivated by your previous experiments.
    2. Data: What data did you use to construct and test your model?
    3. Parameters and Settings: Describe what parameter values and other settings you used and why.
    4. Additional Pre or Post Processing: Any additional pre or post processing done to the data or the model in order to improve the model's performance, as measured by the performance metric(s) chosen.
    5. Analysis of the constructed model:
      - Describe the constructed model (e.g., size of the model, readability). If the model is readable summarize in your own words what the model says, focusing on the most interesting/relevant patterns. Elaborate on if and how the model answers the objectives of this experiment.
      - State what the performance of the model is, using the performance metrics provided in the project description. If applicable, elaborate on the confusion matrix and/or other relevant performance indicators.
      - How long it took Weka/R to construct this model?
      - Compare the performace of this model with that of other models constructed in this project for this dataset.
8. Summary of Results
  - Provide insightful observations and comments on the results of the experiments.
  - Use appropriate visualizations and graphs when possible to convey your observations and results.
  - What general observations can you draw regarding the quality/performance of the models as you varied the settings (i.e., pre-processing, parameters, post-processing) of the experiments?
  - What was/were the model/models with the highest performance? Consider both quantitative (e.g., model accuracy, time taken to built the model, ...) as well as qualitative (e.g., model size, readability, ...) evaluation criteria.
    - Discuss how this performance compares with that of your best performing results from previous projects (if any) on the same dataset, including ZeroR and OneR.
    - Include the model (or at least a representative part of the model if the full model is too large) in your report.
  - Elaborate on whether or not your experiments helped you answer your initial 3 objectives, and what answers you obtained for these guiding questions.
  - Discuss how well this particular machine learning method worked on this dataset. What combination of parameters yielded particularly good results?
  - Overall project conclusion: Discuss the strengths and the weaknesses of your project.

Structure for Written Reports

Your written report must follow the structure below. Only the required sections within the given space limits will be read and graded.

The font size must be at least 11pts.

Your written report must be at most 6 pages long (including all the graphs, figures, and appendices).
The entirely of your written report must be your own work, written in your own words. Any plagiarism or copy will be penalized and reported in accordance with the WPI Academic Honesty Policy.
- Page 1:
  - Your name and project name.
  - Half a page: dataset description (as specified in the Guidelines for Projects above).
  - Half a page: 3 guiding questions and a brief description of each guiding question (as specified in the Guidelines for Projects above).
- Page 2 (Weka) and Pages 3 and 4 (R): Code description and tables summarizing your experiments following the "Guidelines for the Experiments" above.
  You are expected to run a large number of experiments so that you can become very familiar with the data mining technique, with how it performs on the dataset, and can draw general conclusions to include in your summary of results). But due to page constraints, you should include in these tables only the most relevant/salient experiments.
- Page 5: Summary of results (as specified in the Guidelines for Projects above).
- Page 6: Advanced topic (as specified in the project description).
*** You must use the template provided for your written report *** which is included in each project webpage.

Guidelines for Oral Reports and Slides

We will discuss the results of each project in class. Your oral report should summarize the most important parts of your written report and should elaborate only on the most significant or more unique parts of your work. Students will have 3 minutes to present their project in class. Given the time constraint, your presentation should consist of 3-4 slides (and no more!). Try to summarize results using tables when appropriate. Once again, an important aspect of both your written and oral reports is the "story-telling" aspect. Try to tell the story of what experiments you ran and why, how each experiment shed lights on what experiment(s) to run next, and what you learned with them. Be prepared and use your presentation time wisely!

Submission and Due Dates

Please submit the following file containing your oral presentation by email to the professor at least FOUR HOUR before the beginning of class the day the project is due:
- [your-lastname]_projn_slides.[ext]
containing your slides for your oral report. n is the project number. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Please use only lower case letters in the name file. For instance, the file with my slides for Project 1 would be named ruiz_proj1_slides.ppt
Please submit a hardcopy of your written report by the beginning of class the day that the project is due.

Grading Criteria

The project grade will be distributed as follows:

your written report will count for 90%,
your oral presentation will count for 7%, and
your class participation during project presentation will count for the remaining 3% of your project grade.