CS 444X Data Mining and Knowledge Discovery in Databases - D Term 2004
This project is due on Wednesday, March 31 2004 at 12 NOON.
Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees
The purpose of this project is multi-fold:
- To gain experience "pre-processing" datasets
to clean, normalize, and discretize data attributes,
and, when needed, reduce the dimensionality of the data.
- To gain experience with the construction of decision trees.
- To gain experience with the evaluation of the models/patterns
constructed with a data mining technique.
- To gain familiarity with
the Weka system, its GUI, its code, and its input data format (arff).
For this and other course projects, we will use the
Weka is an excellent machine-learning/data-mining environment.
It provides a large collection of Java-based mining algorithms,
data preprocessing filters, and experimentation capabilities.
Weka is open source software issued under the GNU General Public License.
For more information on the Weka sytem, to download the system and
to get its documentation, look at
You should download and use the latest stable GUI version of the system.
Study the tutorial (Chapter 8 of your textbook) provided with the Weka system.
Note that the tutorial uses Weka's command line to illustrate
how to run the system, but you can actually use the GUI provided
with the system to execute the same commands.
- Datasets: Consider the following sets of data:
- The Mushroom Data Set.
The classification target is the "editable/poisonous" attribute.
- 1995 Data Analysis Exposition.
This dataset contains college data taken from the U.S. News & World Report's Guide to
America's Best Colleges. The necessary files are:
Let's make "private/public" the classification target. Note that even though the values
of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.
For each of the above datasets,
use the "Explorer" option of the Weka system to perform the
- Load the data. Note that you need to translate the dataset into the arff format first.
- Preprocessing of the Data:
A main part of this project is the PREPROCESSING of your dataset. You
should apply relevant filters to your dataset
before doing the mining and/or using the results of previous mining tasks.
For instance, you may decide to remove apparently irrelevant attributes,
replace missing values if any, discretize attributes in a different way, etc.
Your report should contained a detailed description of the preprocessing of
your dataset and justifications of the steps you followed.
If Weka does not provide the functionality you need to preprocess your data
as you need to obtain useful patterns, preprocess the data yourself either
by writing the necessary filters (you can incorporate them in Weka if you wish).
- explore different ways of discretizing continuous attributes.
That is, convert numeric attributes into "nominal" ones
by binning numeric values into intervals - See the
weka.filter.DiscretizeFilter in Weka.
Play with the filter and read the Java code implementing it.
- explore different ways of removing missing values.
Missing values in arff files are represented with the character "?".
See the weka.filter.ReplaceMissingValuesFilter in Weka.
Play with the filter and read the Java code implementing it.
To the extent possible/necessary, modify the attribute names and the nominal
value names so that the resulting decision trees are easy to read.
- Mining of Patterns:
Use the "ZeroR" classifier under the "Classify" tab.
This would provide you with a benchmark classification accuracy to compare the
accuracy of your decision trees below against.
- Decision Trees:
The following are guidelines for the construction of your decision tree:
- Evaluation and Testing:
Use different ways of testing your results for each of the mining techniques employed
(i.e. ZeroR, ID3, J4.8).
- Supply input data and mine and evaluate your model over this same input data.
- Supply separate training and testing data to Weka.
- Supply input
data to Weka and experiment with several split ratios
for training and testing data.
- Supply input
data to Weka and
use n-fold crossvalidation to test your results.
Experiment with different values for the number of folds.
- Pruning of your decision tree:
Experiment with Weka's J4.8 classifier to see how it performs
pre- and/or post-prunning of the decision tree in order
to increase the classification accuracy and/or to reduce the size of the
REPORTS AND DUE DATE
- Written Report.
Your report should contain the following sections with the corresponding discussions:
- Group members:
Name of the group members and a description of what precisely each group member
contributed to the project.
- Code Description:
Briefly describe the Weka code of the classifiers and filters that you used in
More precisely, explain the algorithm underlying the code in terms of the input
it receives, the output it produces, and the main steps it follows to produce this
Describe the dataset that you selected in terms of the attributes
present in the data, the number of instances, missing values, and
other relevant characteristics.
Provide a detail description of the preprocessing of your data.
Justify the preprocessing you apply and why the resulting data
is the appropriate one for mining decision trees from it.
For each experiment you ran describe:
- Instances: What data did you use for the experiment?
That is, did you use the entire dataset of just a subset of it?
- Any pre-processing done to the data. That is, did you remove
any attributes? Did you discretize any continuous attribute?
If so, what strategy did you use to bin the values?
Did you replace missing values?
If so, what strategy did you use to select a replacement of
the missing values?
- Your system parameters.
- For the ZeroR, ID3, and J4.8 classifiers,
- Results and detail ANALYSIS of results of the experiments you ran
using different ways of testing the classifier (crossvalidation, etc.).
- Accuracy of the resulting models
- Comparison the classification accuracies of the models
obtained with the ZeroR, ID3, and J4.8 classifiers.
- Summary of Results
- What was the accuracy of the most accurate decision tree constructed in
- Include the most accurate tree (after prunning, to save space) that
you obtained in your report.
- strengths and the weaknesses of your project.
- Oral Report.
We will discuss the results from the individual projects during the class
on April 1st. Your oral report should summarize the different sections of
your written report as described above.
Each group will have about 4 minutes to explain your results and to
discuss your project in class. Be prepared!
- Submission and Due Date.
Please submit the following files using the
turnin system by
12:00 NOON on Wed, March 31 2004.
For your turnin submission, THE NAME OF THE PROJECT IS "project1".
PLEASE MAKE JUST ONE PROJECT SUBMISSION PER GROUP.
Submissions received on Wed, March 31
between 12:01 pm and 4:00 pm will be penalized with 30% off the grade and
submissions after March 31 4:00 pm won't be accepted.
containing your written report in PDF.
For instance my file would be named
(note the use of lower case letters only):
Turnin complains about file names that are too long.
If the name of your file is too long, feel free to shorten it
as necessary, but please keep the _proj1_report.pdf part
intact for easy identification.
- ruiz_smith_proj1_report.pdf if I worked with Joe Smith on this project.
- ruiz_proj1_report.pdf if I worked alone (only in the case that I'm
taking this course for grad. credit).
containing your slides for your oral report.
This file should be either a PDF file (ext=pdf)
or a PowerPoint file (ext=ppt).
TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY
(TOTAL: 20 points) PRE-PROCESSING OF THE DATASET:
(05 points) Translating both input datasets into .arff
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
(i.e. using nominal values instead of numeric
when appropriate, using as many of them
as possible, etc.)
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(04 points) Description of the algorithm underlying the Weka filters used
(02 points) Description of the algorithm underlying Weka's ZeroR code
(04 points) Description of the algorithm underlying Weka's ID3 code
(05 points) Description of the algorithm underlying Weka's J4.8 code
(TOTAL: 60 points) EXPERIMENTS
(TOTAL: 28 points each dataset) FOR EACH DATASET:
(02 points) ran at least a reasonable number of experiments
to get familiar with ZeroR
(TOTAL: 26 points) For each decision tree method required
ID3 and J4.8 (13 points each):
(05 points) ran at least a reasonable number of experiments
to get familiar with the decision tree method and
different evaluation methods (%split, cross-validation,...)
(03 points) good description of the experiment setting and the results
(05 points) good analysis of the results of the experiments
(up to 4 extra credit points)
excellent analysis of the results
(04 points) comparison of the results obtained with ZeroR,
ID3, and J4.8 and summary of the project
(TOTAL 5 points) SLIDES - how well do they summarize concisely
the results of the project?