A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.
Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.
Your report should contain the following sections with the corresponding discussions:
Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
Please submit the following files using the turnin system by 10:00 am on Wed, March 26 2003. For your turnin submission, THE NAME OF THE PROJECT IS "project1". PLEASE MAKE JUST ONE PROJECT SUBMISSION PER GROUP. Submissions received on Wed, March 26 between 10:01 am and 12 noon will be penalized with 30% off the grade and submissions after March 26 AT NOON won't be accepted.
TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY (TOTAL: 20 points) PRE-PROCESSING OF THE DATASET: (05 points) Translating both input datasets into .arff (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e. using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (04 points) Description of the algorithm underlying the Weka filters used (02 points) Description of the algorithm underlying Weka's ZeroR code (04 points) Description of the algorithm underlying Weka's ID3 code (05 points) Description of the algorithm underlying Weka's J4.8 code (TOTAL: 60 points) EXPERIMENTS (TOTAL: 28 points each dataset) FOR EACH DATASET: (02 points) ran at least a reasonable number of experiments to get familiar with ZeroR (TOTAL: 26 points) For each decision tree method required ID3 and J4.8 (13 points each): (05 points) ran at least a reasonable number of experiments to get familiar with the decision tree method and different evaluation methods (%split, cross-validation,...) (03 points) good description of the experiment setting and the results (05 points) good analysis of the results of the experiments (up to 4 extra credit points) excellent analysis of the results (04 points) comparison of the results obtained with ZeroR, ID3, and J4.8 and summary of the project (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project?