Guidelines for Projects and for Written Reports
Each of the projects in this course deals with one or more specific
machine learning techniques. The guidelines below are intended to help you
structure the experimental work you are expected to do for each project,
as well as your written and oral reports.
An important aspect of both your written and oral reports is the
"story-telling" aspect. Try to tell the story of what experiments
you ran and why, how each experiment shed lights on what experiment(s)
to run next, and what you learned with them.
Guidelines for the Experiments
- For each of the datasets in the project:
Before you start running experiments, make sure to understand the raw
data very well, to learn
as much as you can about the domain, and to research approaches used by
others on this dataset to the extent possible.
- Dataset Description:
In your written report, describe the dataset in terms of the attributes
present in the data, the number of instances, missing values, and
other relevant characteristics.
Describe briefly what pre-processing (if any) was used before any
experiments were ran.
- For each machine learning technique used in the project:
- Study the corresponding chapters/sections of the textbook
specified in the Course Schedule.
- Algorithms and Code:
State in your report which Weka functions and which R functions
you use in your experiments.
For R, state if these functions are part of an existing package
(if so, say which one), or if you wrote the R code.
Read the Weka code and the R code (if available) and documentation
that implements and describes the technique.
In your written report, describe the algorithm underlying the code
in your own words.
Explain the algorithm in terms of the inputs it receives
and the outputs it produces, AND the main steps it follows to construct
the model, using high-level pseudo-code. Make sure to describe the correspondence between the algorithm
you describe and the parts of the code that implement it
(which are not necessarily the same as the ones described in the textbook or in class).
- Guiding Questions - Objectives of the Machine Learning Experiments:
Before you start running experiments, look at the raw data in detail.
Figure out 3 specific, interesting questions about the dataset domain
(e.g., diabetis, weather, labor contracts, ...)
that you want to answer with your machine learning experiments.
Choose questions that are about the domain (not about the
machine learning method or the experimental parameters!)
that would particularly benefit from using the learning method
under study.
These questions may be phrased as conjectures that you want to
confirm or refute with your experimental results, or as plain questions.
- Performance metric(s):
Use the performance metrics specified in the project description.
If these are not given,
explain what performance metric(s) you will use to evaluate the models
you construct (e.g., accuracy, error rate, size of the model,
readability of the model, ...) and why.
- Preprocessing of the Data:
Try to keep the initial pre-processing to a minimum at first.
You should apply only necessary filters to your dataset
before starting mining the data, and then introduce additional pre-processing
as needed based on the results of experiments you run.
Your report should contained a detailed description of the preprocessing of
your dataset and justifications of the steps you followed.
If Weka and R do not provide the functionality that you need
to preprocess your
data to obtain useful patterns, preprocess the data either
by writing your own scripts and filters, or by using a separate tool.
- Training and Testing Instances:
Use 10-fold cross-validation, unless the data is insufficient,
the combined execution time is prohibiting,
cross-validation is not consistent with the performance metric(s) chosen,
or the particular project description asks you to use a different
experimental protocol. Justify your choice if different from that
requested by the project description.
- Experiments:
You must run a sufficiently large and coherent set of experiments.
Start with a basic experiment with default parameters (if possible),
and design new experiments varying the settings
(i.e., pre-processing, parameters, and/or post-processing, ideally
varying one setting at a time)
based on the results that you obtain in your experiments.
Each experiment should be motivated by a previous experiment,
and by the guiding questions.
- For each experiment you ran describe:
- Objectives: Which of your 3 specific questions/conjectures
about the dataset domain you aim to answer/validate with
this experiment. Describe also any additional objectives for this
experiment that might have been motivated by your previous
experiments.
- Data: What data did you use to construct and test your model?
- Parameters and Settings:
Describe what parameter values and other settings you used
and why.
- Additional Pre or Post Processing:
Any additional pre or post processing done to the data or the
model in order to improve the model's performance,
as measured by the performance metric(s) chosen.
- Analysis of the constructed model:
- Describe the constructed model
(e.g., size of the model, readability).
If the model is readable summarize in your own words what the model
says, focusing on the most interesting/relevant patterns.
Elaborate on if and how the model answers the objectives of this
experiment.
- State what the performance of the model is, using the performance
metrics provided in the project description. If applicable,
elaborate on the confusion matrix and/or other relevant
performance indicators.
- How long it took Weka/R to construct this model?
- Compare the performace of this model with that of other
models constructed in this project for this dataset.
- Summary of Results
- Provide insightful observations and comments on the results of
the experiments.
- Use appropriate visualizations and graphs when possible to
convey your observations and results.
- What general observations can you draw regarding the
quality/performance of the models as you varied the
settings (i.e., pre-processing, parameters, post-processing)
of the experiments?
- What was/were the model/models with the highest performance?
Consider both quantitative
(e.g., model accuracy, time taken to built the model, ...) as well as
qualitative (e.g., model size, readability, ...)
evaluation criteria.
- Discuss how this performance compares with that of your
best performing results from previous projects (if any) on the same
dataset, including ZeroR and OneR.
- Include the model (or at least a representative part of the model
if the full model is too large) in your report.
- Elaborate on whether or not your experiments helped you
answer your initial 3 objectives, and what answers you
obtained for these guiding questions.
- Discuss how well this particular machine learning method worked
on this dataset. What combination of parameters yielded
particularly good results?
- Overall project conclusion:
Discuss the strengths and the weaknesses of your project.
Structure for Written Reports
Your written report must follow the structure below.
Only the required sections within the given space limits will
be read and graded.