### Prof. Carolina RuizDepartment of Computer ScienceWorcester Polytechnic Institute

#### Solutions - Problem I. Decision Trees

Consider the following subset of the Weka's contact lenses dataset.
```@relation contact-lenses-weka.filters.AttributeFilter-R2

@attribute age {young,pre-presbyopic,presbyopic}
@attribute astigmatism {no,yes}
@attribute tear-prod-rate {reduced,normal}
@attribute contact-lenses {soft,hard,none}

@data

young,          no,  normal,  soft
young,          yes, reduced, none
young,          yes, normal,  hard
pre-presbyopic, no,  reduced, none
pre-presbyopic, no,  normal,  soft
pre-presbyopic, yes, normal,  hard
pre-presbyopic, yes, normal,  none
pre-presbyopic, yes, normal,  none
presbyopic,     no,  reduced, none
presbyopic,     no,  normal,  none
presbyopic,     yes, reduced, none
presbyopic,     yes, normal,  hard
```
```

```

1. (25 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (age, astigmatism, tear-prod-rate) with respect to the target/classification attribute (contact-lenses). Show all the steps of the calculations. For your inconvenience, the logarithm in base 2 of selected values are provided.

 x 1/2 1/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x) -1 -1.5 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0
```
First we have to choose the root of the tree. The root will be the attribute with lowest entropy.

CONTACT LENSES               SOFT               HARD                 NONE

AGE: young       (3/12)*[ -(1/3)*log2(1/3)  - (1/3)*log2(1/3) - (1/3)*log2(1/3) ] = (3/12)*1.5847
pre-presb   (5/12)*[ -(1/5)*log2(1/5)  - (1/5)*log2(1/5) - (3/5)*log2(3/5) ] = (5/12)*1.34
presb       (4/12)*[ -(0/4)*log2(0/4)  - (1/4)*log2(1/4) - (3/4)*log2(3/4) ] = (4/12)*0.8
-------------
= 1.222

ASTIGM: yes      (7/12)*[ -(0/7)*log2(0/7)  - (3/7)*log2(3/7) - (4/7)*log2(4/7) ] = (7/12)*0.9704
no       (5/12)*[ -(2/5)*log2(2/5)  - (0/5)*log2(0/5) - (3/5)*log2(3/5) ] = (5/12)*0.94
--------------
= 0.957

TEAR-P: normal   (8/12)*[ -(2/8)*log2(2/8)  - (3/8)*log2(3/8) - (3/8)*log2(3/8) ] = (8/12)*1.56
reduced  (4/12)*[ -(0/4)*log2(0/4)  - (0/4)*log2(0/4) - (4/4)*log2(4/4) ] = (4/12)*0
------------
= 1.04

Since the attribute astigmatism has the lowest entropy, the root of the ID3 tree will be astigmatism:

ASTIGMATISM
/     \
yes  /       \ no
/         \

Now for astigmatism=yes we  measure the entropy for the rest of the two attributes:age, tears-pr-rate

AGE: young       (2/7)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.286
pre-presb   (3/7)*[ 0 - (1/3)*log2(1/3) - (2/3)*log2(2/3) ] = 0.394
presb       (2/12)*[0 - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.286
-------------
= 0.966

TEAR-P: normal   (5/7)*[ 0-(3/5)*log2(3/5)  - (2/5)*log2(2/5)] = 0.694
reduced  (2/7)*[ 0-0 - (1)*log2(1) ]                   = 0
------------
= 0.694

So the next node in the tree will be tears-pr-rate, and the last age.
So far the tree look like

ASTIGMATISM
/     \
yes  /       \ no
/         \
Tp-rate
/      \
normal  /        \ reduced
/          \

Now according to the ID3 for the branch reduced all examples have class none and the tree will
have leaf none. Branch normal doesn't give single class so we have to add the last attribute
age to this branch.

ASTIGMATISM
/     \
yes  /       \ no
/         \
Tp-rate
/      \
normal  /        \ reduced
/          \
Age          none
/ | \
young /PP|  \ P
/   |   \
hard  none  hard

Now for astigmatism=no we measure the entropy for the rest of the two attributes:age, tears-pr-rate

AGE: young       (1/5)*[ 0 - 0 - 1*log2(1) ]                     = 0
pre-presb   (2/5)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.4
presb       (2/5)*[0 - 0  - (2/2)*log2(2/2)]                = 0
-------------
= 0.4

TEAR-P: normal   (3/5)*[ 0-(2/3)*log2(2/3)  - (1/3)*log2(1/3)] = 0.55
reduced  (2/5)*[ 0-0 - (2/2)*log2(2/2) ]               = 0
------------
= 0.55

Since the gain of the attribute age is with smaller entropy, we add this atttribute to
the tree:

ASTIGMATISM
/         \
yes  /           \ no
/             \
TP-rate            AGE
/      \            | \ \
normal  /   red. \      young| P| \PP
/          \          |  |  \
AGE          none              \
/ | \
young /PP|  \ P
/   |   \
hard  none  hard

Now for  branch(astigmastism=no, age =yes) all examples have class soft and for
this branch the tree will have leaf soft.
Now for  branch(astigmastism=no, age =P) all examples have class none and for
this branch the tree will have leaf none.

The Branch (astigmastism=no, age =PP) doesn't give single class so we have to add the last attribute
TP-rate to this branch. So we will obtain the final tree:

ASTIGMATISM
/         \
yes  /           \ no
/             \
TP-rate            AGE
/      \            | \ \
normal  /   red. \      young| P| \PP
/          \          |  |  \
AGE        none     soft none  \
/ | \                           TP-rate
young /PP|  \ P                        /      \
/   |   \               reduced  /        \ normal
hard  none  hard                   /          \
none         soft

```

2. (10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:
```young,      no,  reduced, none     our decision tree predicts: soft  (wrong:test instance's  class is none)
pre-pre,    yes, reduced, none     our decision tree predicts: none  (correct:test instance's  class is none)
presbyopic, no,  normal,  soft     our decision tree predicts: none  (wrong:test instance's  class is none)
presbyopic, yes, normal,  none     our decision tree predicts: hard  (wrong:test instance's  class is none)

So the accuracy of this decision tree on this test dataset is: 25%

```

3. (10 points) Consider the subtree replacement pruning technique to make your decision tree more accurate over the this test data.
• (2 points) What node in your decision tree above would be considered first for subtree replacement pruning? Why?
```
Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node,
and assigning it the most common classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than the original over the
testing set. So the the rightmost "age" node is the first candidat for pruning.

```

• (8 points) Show in detail how this pruning technique works on that one node in your previous answer. Will the node be pruned or not by the technique. Explain your answer.
```
Implementing this prunning technique and pruning the first candidate will give us the following tree:

Astigmatism
/             \
yes                no
/                    \
Tp-rate                       age
/       \                     /    |     \
normal   reduced               young   P      PP
/             \                /      |         \
hard          none       soft   none           Tp-rate
/     \
reduced     normal
/              \
none             soft

The node will be pruned because the obtaiened tree performs no worse then the original tree on the
testing data (i.e. the accuracy of the pruned decision tree is 25% as well).

```