This project consists of two parts:
Consider the following subset of the Mushroom dataset.
@relation sample-mushroom @attribute cap-surface {fibrous,grooves,scaly,smooth} @attribute bruises? {bruises,no} @attribute gill-size {broad,narrow} @attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods} @attribute poisonousness {edible,poisonous} @data scaly,bruises,broad,waste,edible smooth,no,narrow,woods,poisonous fibrous,no,broad,grasses,edible scaly,bruises,broad,woods,edible scaly,no,narrow,leaves,poisonous scaly,bruises,broad,paths,edible smooth,no,broad,leaves,edible scaly,no,broad,woods,poisonous scaly,no,narrow,woods,poisonous smooth,no,broad,leaves,edible fibrous,no,broad,paths,poisonous fibrous,bruises,broad,woods,edible smooth,bruises,narrow,grasses,poisonous fibrous,no,broad,paths,poisonous smooth,bruises,narrow,grasses,poisonous scaly,no,narrow,leaves,poisonous scaly,no,narrow,woods,poisonous fibrous,no,broad,grasses,edible scaly,bruises,broad,woods,edible fibrous,no,broad,grasses,edible
To learn more about the Mushroom Data Set, see Part II of this assignment.
Show all the steps of the calculations. For your convenience, the logarithm in base 2 of selected values are provided. (For those log_2 values that you need and are not provided here, make sure you compute log_2 correctly as some calculators don't have a log_2 primitive).
x | 1/2 | 1/3 | 1/4 | 3/4 | 1/5 | 2/5 | 3/5 | 1/6 | 5/6 | 1/7 | 2/7 | 3/7 | 4/7 | 1 |
log2(x) | -1 | -1.5 | -2 | -0.4 | -2.3 | -1.3 | -0.7 | -2.5 | -0.2 | -2.8 | -1.8 | -1.2 | -0.8 | 0 |
fibrous,no,broad,grasses,poisonous YOUR DECISION TREE PREDICTS: __________ scaly,bruises,broad,grasses,edible YOUR DECISION TREE PREDICTS: __________ scaly,no,broad,grasses,poisonous YOUR DECISION TREE PREDICTS: __________ scaly,no,broad,paths,poisonous YOUR DECISION TREE PREDICTS: __________ smooth,bruises,broad,grasses,edible YOUR DECISION TREE PREDICTS: __________ smooth,bruises,broad,waste,edible YOUR DECISION TREE PREDICTS: __________ smooth,no,broad,grasses,edible YOUR DECISION TREE PREDICTS: __________ smooth,no,broad,leaves,edible YOUR DECISION TREE PREDICTS: __________ smooth,no,narrow,leaves,poisonous YOUR DECISION TREE PREDICTS: __________ smooth,no,narrow,paths,poisonous YOUR DECISION TREE PREDICTS: __________The accuracy of your decision tree on this test data is: ________________
Each student in the class should complete the following steps on his/her own:
A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
In particular,
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.
Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.
Submit an individual report of the work you have done on your own as described above. See the details of the submission below. Your report should contain the following sections with the corresponding discussions:
Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
Once that you have completed PartII.1 of the project on your own, work with your project partner analyzing the experiments and results that each of you obtained. This joint analysis of the results should include:
Submit a joint report of the work you have done together described above. Only one of you needs to submit the joint report. See the details of the submission below. Your joint report should contain the following sections with the corresponding discussions:
Given the short time allowed for presentations, you should have at most 4 to 6 slides. Describe your experiments and results using tables. For instance, you can use (if you want) tables with pre-processing of the dataset used (as rows) vs. mining technique and system parameters (as columns), with the size and accuracy of the resulting tree in the cells of this table. Any other good way of summarizing your results would be fine as well. DURING YOUR PRESENTATION TRY TO FOCUS ON THE MOST INTESTING RESULTS YOU OBTAINED AND/OR THE MOST INTERESTING/UNUSUAL IDEAS THAT YOU TRIED.
Please submit the following files using the myWpi digital drop box:
If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report and not a joint report, as you are working all by yourself on the projects.
TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY (TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (04 points) Description of the algorithm underlying the Weka filters used (02 points) Description of the algorithm underlying Weka's ZeroR code (04 points) Description of the algorithm underlying Weka's ID3 code (05 points) Description of the algorithm underlying Weka's J4.8 code (TOTAL 40 points: 20 on the individual part and 20 on the joint part) PRE-PROCESSING OF THE DATASET: (05 points) Translating both input datasets into .arff (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e. using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 10 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL 120: 60 points for the individual part and 60 points for the joint part) EXPERIMENTS (TOTAL: 30 points each dataset) FOR EACH DATASET: (02 points) ran at least a reasonable number of experiments to get familiar with ZeroR (TOTAL: 26 points) For each decision tree method required ID3 and J4.8 (13 points each): (05 points) ran at least a reasonable number of experiments to get familiar with the decision tree method and different evaluation methods (%split, cross-validation,...) (03 points) good description of the motivation and purpose of the experiment, of experiment setting and the results (05 points) good analysis of the results of the experiments (up to 4 extra credit points) excellent analysis of the results (02 points) comparison of the results obtained with ZeroR, ID3, and J4.8 and summary of the project (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project? (TOTAL 15 points) Class presentation - how well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/usuful of your experiments and results. This grade is given individually to each team member.