CS 548 Fall 2018 - Project Guidelines

Computer Science Department

CS548 Knowledge Discovery and Data Mining
Project Guidelines - Fall 2018

PROF. CAROLINA RUIZ

WARNING: Changes to these guidelines may be made during the course of the semester.

Guidelines for Projects and for Written Reports
Guidelines for Oral Reports and Slides
Submission and Due Dates
Grading Criteria

Guidelines for Projects and for Written Reports

Each of the projects in this course deals with one or more specific data mining techniques. The guidelines below are intended to help you structure the experimental work you are expected to do for each project, as well as your written and oral reports.

An important aspect of both your written and oral reports is the "story-telling" aspect. Try to tell the story of what experiments you ran and why, how each experiment shed lights on what experiment(s) to run next, and what you learned with them.

Guidelines for the Experiments

For each of the datasets in the project:

Before you start running experiments, make sure to understand the raw data very well, to learn as much as you can about the domain, and to research approaches used by others on this dataset to the extent possible.

Dataset Description:
In your written report, describe the dataset in terms of the attributes present in the data, the number of instances, whether there are missing values, the distribution of the target attribute, and other relevant characteristics. Describe briefly what pre-processing (if any) was used before any experiments were ran.
For each data mining technique used in the project:
1. Study the corresponding chapters/sections of the textbook specified in the Course Schedule.
2. Algorithms and Code: State in your report which Weka functions and which Python functions you use in your experiments. For Python, state if these functions are part of an existing package (if so, say which one), or if you wrote the Python code. Read the Weka code and the Python code (if available) and documentation that implements and describes the technique. In your written report, describe the algorithm underlying the code in your own words. Explain the algorithm in terms of the inputs it receives and the outputs it produces, AND the main steps it follows to construct the model, using high-level pseudo-code. Make sure to describe the correspondence between the algorithm you describe and the parts of the code that implement it (which are not necessarily the same as the ones described in the textbook or in class). Note: It is not sufficient to describe in your report the data mining method in general as presented in class and/or a textbook. You need to show that you have read and understood the code you are using and describe it in detail on the report.
3. Guiding Questions - Objectives of the Data Mining Experiments: Before you start running experiments, look at the raw data in detail. Figure out 3 specific, interesting questions about the dataset domain (e.g., diabetis, weather, labor contracts, ...) that you want to answer with your data mining experiments. Choose questions that are about the domain (not about the data mining method or the experimental parameters!) that would particularly benefit from using the learning method under study. These questions may be phrased as conjectures that you want to confirm or refute with your experimental results, or as plain questions.
4. Performance metric(s): Use the performance metrics specified in the project description. If these are not given, explain what performance metric(s) you will use to evaluate the models you construct (e.g., accuracy, error rate, size of the model, readability of the model, ...) and why. Do not focus only on numeric metrics of goodness. Analyze also the readability of the models constructed, and the significance of the mined patterns in the application domain.
5. Preprocessing of the Data: Try to keep the initial pre-processing to a minimum at first. You should apply only necessary filters to your dataset before starting mining the data, and then introduce additional pre-processing as needed based on the results of experiments you run. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka and Python do not provide the functionality that you need to preprocess your data to obtain useful patterns, preprocess the data either by writing your own scripts and filters, or by using a separate tool.
6. Training and Testing Instances: Use 10-fold cross-validation, unless the data is insufficient, the combined execution time is prohibiting, cross-validation is not consistent with the performance metric(s) chosen, or the particular project description asks you to use a different experimental protocol. Justify your choice if different from that requested by the project description.
7. Experiments:
  You must run a sufficiently large and coherent set of experiments. Start with a basic experiment with default parameters (if possible), and design new experiments varying the settings (i.e., pre-processing, parameters, and/or post-processing, ideally varying one setting at a time) based on the results that you obtain in your experiments. Each experiment should be motivated by a previous experiment, and by the guiding questions.
  Also, unless otherwise stated, you need to work on each aspect of the project done both in Weka and in Python (separately), with more emphasis on Python than in Weka. This includes using k-fold crossvalidation and everything else. Functionality needed for the project that is not readily available in any Python package needs to be implemented in Python by you.
  - For each experiment you ran describe:
    1. Objectives: Which of your 3 specific questions/conjectures about the dataset domain you aim to answer/validate with this experiment. Describe also any additional objectives for this experiment that might have been motivated by your previous experiments.
    2. Data: What data did you use to construct and test your model?
    3. Parameters and Settings: Describe what parameter values and other settings you used and why.
    4. Additional Pre or Post Processing: Any additional pre or post processing done to the data or the model in order to improve the model's performance, as measured by the performance metric(s) chosen.
    5. Analysis of the constructed model:
      - Describe the constructed model (e.g., size of the model, readability). If the model is readable summarize in your own words what the model says, focusing on the most interesting/relevant patterns. Elaborate on if and how the model answers the objectives of this experiment.
      - State what the performance of the model is, using the performance metrics provided in the project description. If applicable, elaborate on the confusion matrix and/or other relevant performance indicators.
      - How long it took Weka/Python to construct this model?
      - Compare the performace of this model with that of other models constructed in this project for this dataset.
8. Summary of Results
  - Provide insightful observations and comments on the results of the experiments.
  - Use appropriate visualizations and graphs when possible to convey your observations and results.
  - What general observations can you draw regarding the quality/performance of the models as you varied the settings (i.e., pre-processing, parameters, post-processing) of the experiments?
  - What was/were the model/models with the highest performance? Consider both quantitative (e.g., model accuracy, time taken to built the model, ...) as well as qualitative (e.g., model size, readability, ...) evaluation criteria.
    - Discuss how this performance compares with that of your best performing results from previous projects (if any) on the same dataset, including ZeroR and OneR.
    - Include the model (or at least a representative part of the model if the full model is too large) in your report.
  - Elaborate on whether or not your experiments helped you answer your initial 3 objectives, and what answers you obtained for these guiding questions.
  - Discuss how well this particular data mining method worked on this dataset. What combination of parameters yielded particularly good results?
  - Overall project conclusion: Discuss the strengths and the weaknesses of your project.

Structure for Written Reports

A project report template will be provided for each project on the project's webpage. You must use the template provided for your written report . Only the required sections within the given space limits will be read and graded.
The font size must be at least 11pts.
Your written report (including all graphs, figures, and appendices) must fit within the space limits specified in the template. Exceeding page limits will lower your project grade.
You are expected to run a large number of experiments so that you can become very familiar with the data mining technique, with how it performs on the dataset, and so that you can draw general conclusions to include in your summary of results). But due to page constraints, you should include in your report's tables only the most relevant/salient experiments.
The entirely of your written report must be your own work, written in your own words. Any plagiarism or copy will be penalized and reported in accordance with the WPI Academic Honesty Policy.

Guidelines for Oral Reports and Slides

We will discuss the results of each project in class. Your oral report should summarize the most important parts of your written report and should elaborate only on the most significant or more unique parts of your work. Each group will have about 5 minutes to present their project in class. Try to summarize results using tables, visualizations, and graphical depictions when possible. Once again, an important aspect of both your written and oral reports is the "story-telling" aspect. Try to tell the story of what experiments you ran and why, how each experiment shed lights on what experiment(s) to run next, and what you learned with them. Given the time constraints, focus your presentation on the most relevant, unique, or creative parts of your project. Be prepared and use your presentation time wisely!

Submission and Due Dates

Written Report. Please hand in a hardcopy of your report at the beginning of class when the project is due.
Oral Report. Please submit a PowerPoint or a PDF file containing your presentation slides via myWPI (submission name: ProjxSlides, where x is the project number) by the deadline stated on each project's webpage. Only one of the team members needs to submit the slides.

Grading Criteria

Each of the five Project-Test Combinations in this course will count for 17.5% of the course grade, and typically this percentage will be split as follows: 11% test, 6% project report and 0.5% presentation.