For each of the datasets in the project:
In your written report, describe the dataset in terms of the attributes
present in the data, the number of instances, missing values, and
other relevant characteristics.
- For each machine learning technique used in the project:
[100 points (+5 extra points) per machine learning technique, per dataset]
- Study the corresponding chapters/sections of the textbook
specified in the Course Schedule.
- [15 points] Algorithms and Code:
Read the Weka code that implements the technique.
In your written report, describe the algorithm underlying the code
IN YOUR OWN WORDS.
Explain the algorithm in terms of the inputs it receives
and the outputs it produces, AND the main steps it follows to construct
the model. Make sure to describe the correspondence between the algorithm
you describe and the parts of the code that implement it.
- [5 points] Objectives of the Data Mining Experiments:
Before you start running experiments, look at the raw data in detail.
Figure out 3 to 5 specific, interesting questions about the domain
that you want to answer with your machine learning experiments.
Try to choose questions that are about the domain (not about the
machine learning method or the experimental parameters!)
that would particularly benefit from using the machine learning method
under study.
These questions may be phrased as conjectures that you want to
confirm/refute with your experimental results, or as plain questions.
- [2 points] Performance metric(s):
Explain what performance metric(s) will be used to evaluate the models
you construct (e.g., accuracy, error rate, ....) and why.
- [10 points] Preprocessing of the Data:
You should apply relevant filters to your dataset as needed
before doing the mining and/or using the results of previous mining tasks.
For instance, you may decide to remove apparently irrelevant attributes,
replace missing values if any, discretize attributes in a different way, etc.
Your report should contained a detailed description of the preprocessing of
your dataset and justifications of the steps you followed.
If Weka does not provide the functionality that you need to preprocess your
data to obtain useful patterns, preprocess the data yourself either
by writing the necessary filters (you can incorporate them in Weka if you
wish), or by using a separate tool.
- [3 points] Training and Testing Instances:
Use 10-fold cross-validation, unless the data is insufficient,
the combined execution time is prohibiting,
cross-validation is not consistent with the performance metric(s) chosen,
or the particular project description asks you to use a different
experimental protocol. Justify your choice if different from that
requested by the project description.
- [60 points] Experiments:
- For each experiment you ran describe:
- Objectives: Which of your 3-5 specific questions/conjectures
about the dataset domain you aim to answer/validate with
this experiment. Describe also any additional objectives for this
experiment that might have been motivated by your previous
experiments.
- Data: What data did you use to construct and test your model?
- Parameters and Settings:
Describe what parameter values and other settings you used
and why.
- Additional Pre or Post Processing:
Any additional pre or post processing done to the data or the
model in order to improve the model's performance,
as measured by the performance metric(s) chosen.
[Parts 1-4 combined: 10 points]
- Resulting model:
Describe the resulting model (e.g., size of the model, readability).
If the model is readable summarize in your own words what the model
says, focusing on the most interesting/relevant patterns.
Elaborate on if and how the model answers the objectives of this
experiment.
- Performance of the resulting model:
- State what the performance of the model is. If applicable,
elaborate on the confusion matrix and/or other relevant
performance indicators.
- How long it took Weka to construct this model?
- Compare the performace of this model with that of other
models constructed in this project for this dataset.
[Parts 5-6 combined: 10 points for presention of results and
30 points for in-depth discussion of results]
[10 points for sufficient and coherent set of experiments]
- [10 points] Summary of Results
- What was/were the model/models with the highest performance?
- Discuss how this performance compares with that of your
best performing results from previous projects on the same
dataset.
- Elaborate on whether or not your experiments helped you
answer your initial 3-5 objectives, and what answers you
obtained for these guiding questions.
- Discuss how well this particular machine learning method worked
on this dataset. What combination of parameters yielded
particularly good results?
- Overall project conclusion:
Discuss the strengths and the weaknesses of your project.