THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written summary, and how to study for the test.
You must follow the 5 page written report format described in the
PROJECT GUIDELINES.
In particular for this project:
Page 2 should contain a table summarizing classification experiments ran with
decision trees.
Page 3 should contain a table summarizing regression experiments ran with
linear regression, model trees, and regression trees.
*** You must use the
Project 2 Template provided for your written report. ***
(if you prefer not to use Word, you can copy and paste this format in a
different editor as long as you respect the stated page structure and
page limit.)
- Data Mining Technique(s):
Run experiments in Weka, Matlab, SPSS (linear regression only), and RapidMiner
using the following techniques:
- Pre-processing Techniques:
Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...
- Classification Techniques:
- Zero-R (majority class)
- One-R
- Decision trees:
Using Weka (J4.80),
Matlab (see
Matlab decision tree demo),
and RapidMiner.
Since these decision tree implementations are able to handle
numeric attributes and missing values directly, make sure to run
some experiments with no pre-processing and
some experiments with pre-processing (discretizing continuous attributes and
replacing missing values before hand), and compare the results.
- Regression Techniques:
- Linear Regression: Weka (under "functions"), Matlab, SPSS.
- Regression Trees: Weka (M5P under "trees"), Matlab.
- Model Trees: Weka (M5P under "trees"), Matlab.
- Dataset(s):
In this project, we will use the
Bike Sharing Dataset Data Set
(use the day.csv data file) available at the
UCI Machine Learning Repository.
Run experiments with and without discretizing the predicting attributes;
removing attributes that are too related to the target (e.g., casual
and registered when predicting cnt) or that make the trees too long;
and with any other pre-processing and post-processing that produce useful and
meaningful models.
- Performance Metric(s):
- Use the following metrics or evaluation methods:
- For classification tasks:
use classification accuracy and confusion matrices.
For regression tasks:
use any subset of the following error metrics
that you find appropriate: mean-squared error, root mean-squared error,
mean absolute error, relative squared error, root relative squared error,
relative absolute error, correlation coefficient . An important part
of the data mining evaluation in this project is to try to make sense
of these performance metrics and to become familiar with them.
- size of the tree,
- readability of the tree, as separate measures to evaluate the "goodness" of your models, and
- time it took to construct the tree.
- Compare each accuracy/error you obtained against those of benchmarking techniques
as ZeroR and OneR over the same (sub-)set of data instances you used in
the corresponding experiment.
- Remember to experiment with pruning of your tree:
Experiment with pre- and/or
post-prunning of the tree in order to increase the classification
accuracy, reduce the prediction error, and/or reduce the size of the tree.