Use the ticdata2000.txt data for training and the ticeval2000.txt data for testing. You may restrict your experiments to a subset of the instances IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision tree is, the better.
A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
To the extent possible, modify the attribute names and the value names so that the resulting decision trees are easier to read.
Determine (by reading Weka's ID3 code) whether or not Weka performs any pre- or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree. If so, experiment with this functionality. Modify the code if needed to allow for pre- and/or post-prunning of the tree. Also, experiment with Weka's J48 classifier.
Your report should contain the following sections with the corresponding discussions:
Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
Please submit the following files by email to ruiz@cs.wpi.edu by 8:00 am on Monday, January 27 2003. Submissions received on Monday, Jan. 27 between 8:01 am and 10:00 am will be penalized with 30% off the grade and submissions after 10:00 am won't be accepted.