Part I. INDIVIDUAL HOMEWORK ASSIGNMENT
See
solutions by Ken Loomis.
Consider the following dataset.
@relation movie-preferences
@attribute genre {comedy, drama, action}
@attribute critics-reviews {thumbs-up, neutral, thumbs-down}
@attribute rating {R, PG-13}
@attribute IMAX {true, false}
@attribute likes {yes, no}
@data
( 1) comedy, thumbs-up, R, false, no
( 2) comedy, thumbs-up, R, true, no
( 3) comedy, neutral, R, false, no
( 4) comedy, thumbs-down, PG-13, false, yes
( 5) comedy, neutral, PG-13, true, yes
( 6) drama, thumbs-up, R, false, yes
( 7) drama, thumbs-down, PG-13, true, yes
( 8) drama, neutral, R, true, yes
( 9) drama, thumbs-up, PG-13, false, yes
(10) action, neutral, R, false, yes
(11) action, thumbs-down, PG-13, false, yes
(12) action, thumbs-down, PG-13, true, no
(13) action, neutral, PG-13, false, yes
(14) action, neutral, R, true, no
where the likes attribute is the classification target.
- (30 points) Construct the full ID3 decision tree
using entropy to rank
the predictive attributes (genre, critics-reviews, rating, IMAX)
with respect to the target/classification attribute (likes).
Show all the steps of the calculations.
Make sure you compute log in base b (for the appropriate b) correctly as
some calculators don't have a log_b primitive for all b's.
Also, state explicitly in your tree what instances exactly belong to each tree
node using the line numbers provided next to each data instance
in the dataset above.
- (5 points)
Propose approaches to using your decision tree above to classify instances
that contain missing values. Use the following instance to illustrate your
ideas.
genre = action, critics-reviews = ?, rating = R, IMAX = ?
-
Study how J4.8 performs post-prunning by reading in detail:
Part II. GROUP PROJECT ASSIGNMENT
- Project Instructions:
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
- Data Mining Technique(s):
We will run experiment using the following techniques:
- Pre-processing Techniques:
- Feature selection, feature creation, dimensionality reduction,
noise reduction, attribute discretization, ...
- Data Mining Techniques:
- Zero-R
- One-R
- Decision Trees: J4.8.
Given that J4.8 is able to handle numeric attributes and
missing values directly, make sure to run
some experiments with no pre-processing
and
some experiments with pre-processing, and compare your results.
Experiment also with pre- and post-prunning of the J4.8 decision tree
to see if they increase the classification accuracy.
- Advanced Techniques:
- You can consider using advanced techniques to improve the accuracy
of your predictions. For instace, you can try
ensemble methods (see Section 5.6 of your textbook),
ways to deal with inbalanced classification targets
(see Section 5.7 of your textbook), etc.
But, in terms of data mining techniques, this project is restricted to Zero-R,
One-R, and decisions trees.
- Dataset:
We will work with the same dataset used in project 1.
The following 2 files contain the dataset:
- Challenges:
In each of the following challenges provide a detailed description of the
preprocessing techniques used, the motivation for using these techniques,
and any hypothesis/intuition gained about the information represented
in the dataset. Answer the question provided as well as provide the
information described in the
PROJECT GUIDELINES.
- Easy Level:
This is to be a simple guided experimentation, thus little description
is needed for preprocessing techniques.
Begin with the Pennsylvania PSSA dataset; remove any textual or identifying attributes. Create a preliminary decision tree model using Weka's implementation of J4.8 using the default parameters. Use AYP2012 as the target attribute.
Now perform missing value replacement for the target attribute. Replace the missing values with a new nominal value called "Missing". Create a new decision tree model using Weka's implementation of J4.8 again with AYP2012 as the target attribute.
Examine your two models. Compare and contrast them. Use 10-fold cross-validation to perform an analysis of the classification accuracy. Answer the following questions in your description about this experiment:
- What attribute and values were at the root node of each decision tree? Are they the same?
- Can you justify why the attributes and values might or might not be the same? Describe data and/or visualizations that support your justification.
- How does the overall structure of the two decision trees differ? How are they similar? Compare the size of the trees? The attributes that appear near the root node.
- Which of the two models performed more accurately in correctly classifying the target attribute?
- Could a similar decision tree be used to predict AYP results in 2013? If so, which of the two models would you recommend be used?
- Moderate Level:
This is a bit more of a challenge (be sure to leave yourself time for
the WPI challenge below).
Continue working with the dataset from above in which missing
values in the target attribute have been replaced with "Missing".
Save this dataset in case you need to revert back to this state if
some preprocessing techniques to do not work out as expected.
Provide descriptions of any additional preprocessing that you performed
Provide descriptions about the parameters used
to develop your model. One should be able to repeat the experiment from
your description.
Use preprocessing and postprocessing techniques to generate a J4.8
decision tree that predicts AYP2012 as accurately as possible,
but with 25 or fewer leaf nodes. Include an image of the tree in your
report.
Examine the model. Compare and contrast this model against a ZeroR model,
a OneR model, and models generated in the Easy Level challenge above.
Answer the following questions in your description of this experiment:
- Is this model a better model than the other models?
If so, why? If not, why not?
- (a.) In preprocessing this dataset, did you alter/remove any attributes
that might have resulted in higher classification accuracy otherwise?
(b.) If you answered no above: are there any attributes that if
altered/removed would result in reducing classification accuracy?
Give a justification for why one might consider doing this even if you didn't.
- What challenge(s) did you encounter while developing this model?
Give an more detailed explanation of how you used preprocessing,
postprocessing, or some other technique to overcome a specific challenge.
- WPI Level:
You should spend the most time on this challenge.
Use preprocessing and postprocessing to generate a decision tree that
attempts to accurately classify the percentage of students with advanced
science scores (PctAdvancedScience). For this, you need to find a
good discretization of this new target attribute.
Your decision tree should in particular accurately classify the highest
bin of this attribute: the best of the best in science.
Provide descriptions of the preprocessing, postprocessing, and
parameters used to develop your model. One should be able to repeat
the experiment from your description.
Examine the model. Describe the model and its performance. Identify the
attributes in the tree that are used to classify the aforementioned
highest bin. Answer the following questions about this experiment:
- Are there any limitations of the dataset that make this a more
challenging experiment? Explain.
- What challenge(s) did you encounter while developing this
model? Give an more detailed explanation of how you used preprocessing,
postprocessing, or some other technique to overcome a specific challenge.
- Which attributes (and their values) appear to be "good classifiers"
of the highest bin advanced science? Are these attributes that you
expected to find?
Grading sheet for this project