This project consists of two parts:
See solutions by Chiying Wang.
Consider the following dataset.
@relation simple-weather @attribute outlook {sunny,overcast,rainy} @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data sunny, 80, FALSE, no sunny, 90, TRUE, no overcast, 80, FALSE, yes rainy, 96, FALSE, yes rainy, 80, FALSE, yes rainy, 72, TRUE, no overcast, 72, TRUE, yes sunny, 96, FALSE, no sunny, 72, FALSE, yes rainy, 80, FALSE, yes sunny, 72, TRUE, yes overcast, 90, TRUE, yes overcast, 80, TRUE, yes rainy, 96, TRUE, nowhere the play attribute is the classification target.
outlook = ?, humidy = 80, and windy = FALSE.
In particular,
Begin with the dataset without any additional pre-processing (other than removing fnlwgt). Create a preliminary decision tree model using Weka's implementation of J4.8 using the default parameters. Use salary as the classificiation target.
Now apply supervized discretization to all the numeric attributes. Create a new decision tree model using Weka's implementation of J4.8 again with salary as the target attribute. Examine your two models. Compare and contrast them. Use 10-fold cross-validation to perform an analysis of the classification accuracy. Answer the following questions in your description about this experiment:
Use preprocessing and postprocessing techniques to generate a J4.8
decision tree that predicts salary as accurately as possible,
but with 40 or fewer leaf nodes. Include an image of the tree in your
report.
In searching for this model, you must experiment with:
Preprocessing: Experiment with and without:
(1) attribute discretization; (2) replacing missing values;
(3) attribute selection (Correlation-based Feature Selection);
(4) feature reduction: For this, replace the numeric attributes in the dataset with the components resulting from applying PCA to just these numberic attributes. Keep the nominal attributes intact.
Parameter Values and Postprocessing:
Vary the values of the J4.8 parameters: binarySplit, confidenceFactor, minNumObj, reducedErrorPruning, subtreeRaising, and unpruned.
Examine the model. Compare and contrast this model against a ZeroR model, a OneR model, and models generated in the Easy Level challenge above. Answer the following questions in your description of this experiment: