Solutions - Problem I. Decision Trees
Consider the following subset of
the Weka's contact lenses dataset.
@relation contact-lenses-weka.filters.AttributeFilter-R2
@attribute age {young,pre-presbyopic,presbyopic}
@attribute astigmatism {no,yes}
@attribute tear-prod-rate {reduced,normal}
@attribute contact-lenses {soft,hard,none}
@data
young, no, normal, soft
young, yes, reduced, none
young, yes, normal, hard
pre-presbyopic, no, reduced, none
pre-presbyopic, no, normal, soft
pre-presbyopic, yes, normal, hard
pre-presbyopic, yes, normal, none
pre-presbyopic, yes, normal, none
presbyopic, no, reduced, none
presbyopic, no, normal, none
presbyopic, yes, reduced, none
presbyopic, yes, normal, hard
- (25 points) Construct the full ID3 decision tree using entropy to rank the
predictive attributes (age, astigmatism, tear-prod-rate) with respect to the
target/classification attribute (contact-lenses). Show all the steps of the
calculations. For your inconvenience, the logarithm in base 2 of selected
values are provided.
| x
| 1/2
| 1/3
| 1/4
| 3/4
| 1/5
| 2/5
| 3/5
| 1/6
| 5/6
| 1/7
| 2/7
| 3/7
| 4/7
| 1 |
| log2(x)
| -1
| -1.5
| -2
| -0.4
| -2.3
| -1.3
| -0.7
| -2.5
| -0.2
| -2.8
| -1.8
| -1.2
| -0.8
| 0 |
First we have to choose the root of the tree. The root will be the attribute with lowest entropy.
CONTACT LENSES SOFT HARD NONE
AGE: young (3/12)*[ -(1/3)*log2(1/3) - (1/3)*log2(1/3) - (1/3)*log2(1/3) ] = (3/12)*1.5847
pre-presb (5/12)*[ -(1/5)*log2(1/5) - (1/5)*log2(1/5) - (3/5)*log2(3/5) ] = (5/12)*1.34
presb (4/12)*[ -(0/4)*log2(0/4) - (1/4)*log2(1/4) - (3/4)*log2(3/4) ] = (4/12)*0.8
-------------
= 1.222
ASTIGM: yes (7/12)*[ -(0/7)*log2(0/7) - (3/7)*log2(3/7) - (4/7)*log2(4/7) ] = (7/12)*0.9704
no (5/12)*[ -(2/5)*log2(2/5) - (0/5)*log2(0/5) - (3/5)*log2(3/5) ] = (5/12)*0.94
--------------
= 0.957
TEAR-P: normal (8/12)*[ -(2/8)*log2(2/8) - (3/8)*log2(3/8) - (3/8)*log2(3/8) ] = (8/12)*1.56
reduced (4/12)*[ -(0/4)*log2(0/4) - (0/4)*log2(0/4) - (4/4)*log2(4/4) ] = (4/12)*0
------------
= 1.04
Since the attribute astigmatism has the lowest entropy, the root of the ID3 tree will be astigmatism:
ASTIGMATISM
/ \
yes / \ no
/ \
Now for astigmatism=yes we measure the entropy for the rest of the two attributes:age, tears-pr-rate
AGE: young (2/7)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.286
pre-presb (3/7)*[ 0 - (1/3)*log2(1/3) - (2/3)*log2(2/3) ] = 0.394
presb (2/12)*[0 - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.286
-------------
= 0.966
TEAR-P: normal (5/7)*[ 0-(3/5)*log2(3/5) - (2/5)*log2(2/5)] = 0.694
reduced (2/7)*[ 0-0 - (1)*log2(1) ] = 0
------------
= 0.694
So the next node in the tree will be tears-pr-rate, and the last age.
So far the tree look like
ASTIGMATISM
/ \
yes / \ no
/ \
Tp-rate
/ \
normal / \ reduced
/ \
Now according to the ID3 for the branch reduced all examples have class none and the tree will
have leaf none. Branch normal doesn't give single class so we have to add the last attribute
age to this branch.
ASTIGMATISM
/ \
yes / \ no
/ \
Tp-rate
/ \
normal / \ reduced
/ \
Age none
/ | \
young /PP| \ P
/ | \
hard none hard
Now for astigmatism=no we measure the entropy for the rest of the two attributes:age, tears-pr-rate
AGE: young (1/5)*[ 0 - 0 - 1*log2(1) ] = 0
pre-presb (2/5)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.4
presb (2/5)*[0 - 0 - (2/2)*log2(2/2)] = 0
-------------
= 0.4
TEAR-P: normal (3/5)*[ 0-(2/3)*log2(2/3) - (1/3)*log2(1/3)] = 0.55
reduced (2/5)*[ 0-0 - (2/2)*log2(2/2) ] = 0
------------
= 0.55
Since the gain of the attribute age is with smaller entropy, we add this atttribute to
the tree:
ASTIGMATISM
/ \
yes / \ no
/ \
TP-rate AGE
/ \ | \ \
normal / red. \ young| P| \PP
/ \ | | \
AGE none \
/ | \
young /PP| \ P
/ | \
hard none hard
Now for branch(astigmastism=no, age =yes) all examples have class soft and for
this branch the tree will have leaf soft.
Now for branch(astigmastism=no, age =P) all examples have class none and for
this branch the tree will have leaf none.
The Branch (astigmastism=no, age =PP) doesn't give single class so we have to add the last attribute
TP-rate to this branch. So we will obtain the final tree:
ASTIGMATISM
/ \
yes / \ no
/ \
TP-rate AGE
/ \ | \ \
normal / red. \ young| P| \PP
/ \ | | \
AGE none soft none \
/ | \ TP-rate
young /PP| \ P / \
/ | \ reduced / \ normal
hard none hard / \
none soft
- (10 points) Compute the accuracy of the decision tree you constructed on
the following test data instances:
young, no, reduced, none our decision tree predicts: soft (wrong:test instance's class is none)
pre-pre, yes, reduced, none our decision tree predicts: none (correct:test instance's class is none)
presbyopic, no, normal, soft our decision tree predicts: none (wrong:test instance's class is none)
presbyopic, yes, normal, none our decision tree predicts: hard (wrong:test instance's class is none)
So the accuracy of this decision tree on this test dataset is: 25%
- (10 points) Consider the subtree replacement pruning technique to make
your decision tree more accurate over the this test data.
- (2 points) What node in your decision tree above would be considered
first for subtree replacement pruning? Why?
Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node,
and assigning it the most common classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than the original over the
testing set. So the the rightmost "age" node is the first candidat for pruning.
- (8 points) Show in detail how this pruning technique works on that one
node in your previous answer. Will the node be pruned or not by the
technique. Explain your answer.
Implementing this prunning technique and pruning the first candidate will give us the following tree:
Astigmatism
/ \
yes no
/ \
Tp-rate age
/ \ / | \
normal reduced young P PP
/ \ / | \
hard none soft none Tp-rate
/ \
reduced normal
/ \
none soft
The node will be pruned because the obtaiened tree performs no worse then the original tree on the
testing data (i.e. the accuracy of the pruned decision tree is 25% as well).