## CS444X Data Mining and Knowledge Discovery in Databases. D-2003 Solutions Exam 1 - April 4, 2003

### Solutions by Prof. Carolina Ruiz, Alek Icev, and Senthil Palanisamy Department of Computer ScienceWorcester Polytechnic Institute

#### Instructions

• Ask in case of doubt

#### Problem I. Decision Trees (45 points)

Consider the following subset of the Weka's contact lenses dataset.
```@relation contact-lenses-weka.filters.AttributeFilter-R2

@attribute age {young,pre-presbyopic,presbyopic}
@attribute astigmatism {no,yes}
@attribute tear-prod-rate {reduced,normal}
@attribute contact-lenses {soft,hard,none}

@data

young,          no,  normal,  soft
young,          yes, reduced, none
young,          yes, normal,  hard
pre-presbyopic, no,  reduced, none
pre-presbyopic, no,  normal,  soft
pre-presbyopic, yes, normal,  hard
pre-presbyopic, yes, normal,  none
pre-presbyopic, yes, normal,  none
presbyopic,     no,  reduced, none
presbyopic,     no,  normal,  none
presbyopic,     yes, reduced, none
presbyopic,     yes, normal,  hard
```
```

```

1. (25 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (age, astigmatism, tear-prod-rate) with respect to the target/classification attribute (contact-lenses). Show all the steps of the calculations. For your inconvenience, the logarithm in base 2 of selected values are provided.

 x 1/2 1/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x) -1 -1.5 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0
```
First we have to choose the root of the tree. The root will be the attribute with lowest entropy.

CONTACT LENSES               SOFT               HARD                 NONE

AGE: young       (3/12)*[ -(1/3)*log2(1/3)  - (1/3)*log2(1/3) - (1/3)*log2(1/3) ] = (3/12)*1.5847
pre-presb   (5/12)*[ -(1/5)*log2(1/5)  - (1/5)*log2(1/5) - (3/5)*log2(3/5) ] = (5/12)*1.34
presb       (4/12)*[ -(0/4)*log2(0/4)  - (1/4)*log2(1/4) - (3/4)*log2(3/4) ] = (4/12)*0.8
-------------
= 1.222

ASTIGM: yes      (7/12)*[ -(0/7)*log2(0/7)  - (3/7)*log2(3/7) - (4/7)*log2(4/7) ] = (7/12)*0.9704
no       (5/12)*[ -(2/5)*log2(2/5)  - (0/5)*log2(0/5) - (3/5)*log2(3/5) ] = (5/12)*0.94
--------------
= 0.957

TEAR-P: normal   (8/12)*[ -(2/8)*log2(2/8)  - (3/8)*log2(3/8) - (3/8)*log2(3/8) ] = (8/12)*1.56
reduced  (4/12)*[ -(0/4)*log2(0/4)  - (0/4)*log2(0/4) - (4/4)*log2(4/4) ] = (4/12)*0
------------
= 1.04

Since the attribute astigmatism has the lowest entropy, the root of the ID3 tree will be astigmatism:

ASTIGMATISM
/     \
yes  /       \ no
/         \

Now for astigmatism=yes we  measure the entropy for the rest of the two attributes:age, tears-pr-rate

AGE: young       (2/7)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.286
pre-presb   (3/7)*[ 0 - (1/3)*log2(1/3) - (2/3)*log2(2/3) ] = 0.394
presb       (2/12)*[0 - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.286
-------------
= 0.966

TEAR-P: normal   (5/7)*[ 0-(3/5)*log2(3/5)  - (2/5)*log2(2/5)] = 0.694
reduced  (2/7)*[ 0-0 - (1)*log2(1) ]                   = 0
------------
= 0.694

So the next node in the tree will be tears-pr-rate, and the last age.
So far the tree look like

ASTIGMATISM
/     \
yes  /       \ no
/         \
Tp-rate
/      \
normal  /        \ reduced
/          \

Now according to the ID3 for the branch reduced all examples have class none and the tree will
have leaf none. Branch normal doesn't give single class so we have to add the last attribute
age to this branch.

ASTIGMATISM
/     \
yes  /       \ no
/         \
Tp-rate
/      \
normal  /        \ reduced
/          \
Age          none
/ | \
young /PP|  \ P
/   |   \
hard  none  hard

Now for astigmatism=no we measure the entropy for the rest of the two attributes:age, tears-pr-rate

AGE: young       (1/5)*[ 0 - 0 - 1*log2(1) ]                     = 0
pre-presb   (2/5)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.4
presb       (2/5)*[0 - 0  - (2/2)*log2(2/2)]                = 0
-------------
= 0.4

TEAR-P: normal   (3/5)*[ 0-(2/3)*log2(2/3)  - (1/3)*log2(1/3)] = 0.55
reduced  (2/5)*[ 0-0 - (2/2)*log2(2/2) ]               = 0
------------
= 0.55

Since the gain of the attribute age is with smaller entropy, we add this atttribute to
the tree:

ASTIGMATISM
/         \
yes  /           \ no
/             \
TP-rate            AGE
/      \            | \ \
normal  /   red. \      young| P| \PP
/          \          |  |  \
AGE          none              \
/ | \
young /PP|  \ P
/   |   \
hard  none  hard

Now for  branch(astigmastism=no, age =yes) all examples have class soft and for
this branch the tree will have leaf soft.
Now for  branch(astigmastism=no, age =P) all examples have class none and for
this branch the tree will have leaf none.

The Branch (astigmastism=no, age =PP) doesn't give single class so we have to add the last attribute
TP-rate to this branch. So we will obtain the final tree:

ASTIGMATISM
/         \
yes  /           \ no
/             \
TP-rate            AGE
/      \            | \ \
normal  /   red. \      young| P| \PP
/          \          |  |  \
AGE        none     soft none  \
/ | \                           TP-rate
young /PP|  \ P                        /      \
/   |   \               reduced  /        \ normal
hard  none  hard                   /          \
none         soft

```

2. (10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:
```young,      no,  reduced, none     our decision tree predicts: soft  (wrong:test instance's  class is none)
pre-pre,    yes, reduced, none     our decision tree predicts: none  (correct:test instance's  class is none)
presbyopic, no,  normal,  soft     our decision tree predicts: none  (wrong:test instance's  class is none)
presbyopic, yes, normal,  none     our decision tree predicts: hard  (wrong:test instance's  class is none)

So the accuracy of this decision tree on this test dataset is: 25%

```

3. (10 points) Consider the subtree replacement pruning technique to make your decision tree more accurate over the this test data.
• (2 points) What node in your decision tree above would be considered first for subtree replacement pruning? Why?
```
Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node,
and assigning it the most common classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than the original over the
testing set. So the the leftmost "age" node is the first candidat for pruning.
(Note that it's important to start with nodes in the "deepest" level of the tree, whether
you start with the leftmost node ("age" in this case) or the rightmost node ("tr-rate")
is not important.)

```

• (8 points) Show in detail how this pruning technique works on that one node in your previous answer. Will the node be pruned or not by the technique. Explain your answer.
```
Implementing this prunning technique and pruning the first candidate will give us the following tree:

Astigmatism
/             \
yes                no
/                    \
Tp-rate                       age
/       \                     /    |     \
normal   reduced               young   P      PP
/             \                /      |         \
hard          none       soft   none           Tp-rate
/     \
reduced     normal
/              \
none             soft

The node will be pruned because the obtaiened tree performs no worse then the original tree on the
testing data (i.e. the accuracy of the pruned decision tree is 25% as well).

```

#### Problem II. Classification Rules (45 points)

Construct classification rules using the basic PRISM sequential covering algorithm from the 12 instances in the dataset below.
```@relation contact-lenses-weka.filters.AttributeFilter-R2

@attribute age {young,pre-presbyopic,presbyopic}
@attribute astigmatism {no,yes}
@attribute tear-prod-rate {reduced,normal}
@attribute contact-lenses {soft,hard,none}

@data

young,          no,  normal,  soft
young,          yes, reduced, none
young,          yes, normal,  hard
pre-presbyopic, no,  reduced, none
pre-presbyopic, no,  normal,  soft
pre-presbyopic, yes, normal,  hard
pre-presbyopic, yes, normal,  none
pre-presbyopic, yes, normal,  none
presbyopic,     no,  reduced, none
presbyopic,     no,  normal,  none
presbyopic,     yes, reduced, none
presbyopic,     yes, normal,  hard
```

1. (15 points) Construct perfect classification rules for the class label contact-lenses = soft until all instances with this class label are covered. Use the p/t measure to select conditions to add to the rule. Show all your work!
```

If ?  ==> soft

our choices

age = young  		1/3
age = pre-presbyopic    1/5
age = presbyopic 		0/4

astigmatism = no		2/5
astigmatism = yes       0/7

tear-prod-rate = reduced	0/4
tear-prod-rate = normal		2/8

pick the attribute-value with the largest fraction and that happens to be astigmatism = no
add this term to the rule.

if astigmatism=no ==> soft

covers 5 instances and gets 2 right

pick another attribute-value to improve accuracy

consider those instances where astigmatism = no

age = young  		1/1
age = pre-presbyopic    1/2
age = presbyopic 		0/2

tear-prod-rate = reduced	0/2
tear-prod-rate = normal		2/3

select age=young (highest p/t)

if astigmatism=no and age=young ==> soft

covers 1 and classifies it right

We can add this to our rule set

Rule 1: if astigmatism=no and age=young ==> soft

so this rule covers 1 out of the 2 instances with class label equal to soft (we're left with 1 more instance with class label = soft).

let's go for another rule

? ==> soft

age = young  		0/2
age = pre-presbyopic    1/5
age = presbyopic 		0/4

astigmatism = no		1/4
astigmatism = yes       0/7

tear-prod-rate = reduced	0/4
tear-prod-rate = normal		1/7

select astigmatism=no to add to the rule

astigmatism=no ==> soft

covers 4 and predicts 1 correctly

age = young  		0/0
age = pre-presbyopic    1/2
age = presbyopic 		0/2

tear-prod-rate = reduced	0/2
tear-prod-rate = normal		1/2

select age=pre-presbyopic to add to the rule

astigmatism=no && age=pre-presbyopic ==> soft

covers 2 and predicts 1 correctly

tear-prod-rate = reduced	0/1
tear-prod-rate = normal		1/1

select tear-prod-rate = normal

astigmatism=no && age=pre-presbyopic && tear-prod-rate=normal ==> soft
this covers 1 and classifies it correctly

we're done with all instances with class = soft ( 2 instances )

Rule Set:

Rule 1: if astigmatism=no and age=young ==> soft
Rule 2: astigmatism = no && age=pre-presbyopic && tear-prod-rate=normal ==> soft

```

2. (15 points) Construct perfect classification rules for the class label contact-lenses = none until all instances with this class label are covered. Use the p/t measure to select conditions to add to the rule. Show all your work!
```

? ==> none

age = young  		1/3
age = pre-presbyopic    3/5
age = presbyopic 		3/4

astigmatism = no		3/5
astigmatism = yes       4/7

tear-prod-rate = reduced	4/4
tear-prod-rate = normal		3/8

Select tear-prod-rate=reduced to add to the rule

tear-prod-rate=reduced ==> none
This rule covers 4 and classifies all correctly, therefore we can add to the rule set.

R1: tear-prod-rate=reduced ==> none

we're left with 3 instances that have none for class label

let's create another rule

? ==> none

age = young  		0/2
age = pre-presbyopic    2/4
age = presbyopic 		1/2

astigmatism = no		1/3
astigmatism = yes       2/5

tear-prod-rate = reduced	0/0
tear-prod-rate = normal		3/8

select age = pre-presbyopic

age = pre-presbyopic ==> none

this rule covers 4 but classifies only 2 correctly...

let's try improving this rule

astigmatism = no		0/1
astigmatism = yes       2/3

tear-prod-rate = reduced	0/0
tear-prod-rate = normal		2/4

we'll select astigmatism = yes

age=pre-presbyopic && astigmatism=yes ==> none

covers 3 but only classifies 2 correctly

let's try adding one more attribute

we don't have a choice but add tear-prod-rate

age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none
add this rule to the rule set and remove those instances covered by this rule

R1: tear-prod-rate=reduced ==> none
R2: age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none

we're left with 1 instance

? ==> none

age = young  		0/1
age = pre-presbyopic    0/1
age = presbyopic 		1/2

astigmatism = no		1/2
astigmatism = yes       0/2

tear-prod-rate = reduced	0/0
tear-prod-rate = normal		1/4

select age = presbyopic

age=presbyopic ==> none

covers 2 and classifies 1 correctly

let's look at 2 covered instances

astigmatism = no		1/1
astigmatism = yes       0/1

tear-prod-rate = reduced	0/0
tear-prod-rate = normal		1/2

select astigmatism = no

age=presbyopic && astigmatism = no ==> none

covers 1 and classifies 1, therefore we can add this to the rule set

all instances are covered for class label = none

Rule Set:

R1: tear-prod-rate=reduced ==> none
R2: age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none
R3: age=presbyopic && astigmatism = no ==> none
```

3. (EXTRA 10 points - attempt only if you have extra time) Construct perfect classification rules for the class label contact-lenses = hard until all instances with this class label are covered. Use the p/t measure to select conditions to add to the rule. Show all your work!
```Rules from Weka's Prism

Prism rules
----------
If astigmatism = no
and age = young then soft
If astigmatism = no
and age = pre-presbyopic
and tear-prod-rate = normal then soft
If astigmatism = yes
and tear-prod-rate = normal
and age = young then hard
If astigmatism = yes
and tear-prod-rate = normal
and age = presbyopic then hard
If age = pre-presbyopic
and astigmatism = yes
and tear-prod-rate = normal then hard
If tear-prod-rate = reduced then none
If age = pre-presbyopic
and astigmatism = yes
and tear-prod-rate = normal then none
If age = presbyopic
and astigmatism = no then none

```

4. (5 points) Compute the accuracy of the classification rules that you constructed on the following test data instances:
```@relation test-contact-lenses-weka.filters.AttributeFilter-R2

@attribute age {young,pre-presbyopic,presbyopic}
@attribute astigmatism {no,yes}
@attribute tear-prod-rate {reduced,normal}
@attribute contact-lenses {soft,hard,none}

@data

young,          no,  reduced, none        your rules prediction: soft
pre-presbyopic, yes, reduces, none        your rules prediction: none
presbyopic,     no,  normal,  soft        your rules prediction: none
presbyopic,     yes, normal,  none        your rules prediction: cannot be classified

```
The accuracy of your classification rules on this test data is: __1/3 = 33____________

5. (10 points) Assume you are given a function m defined on rules that measures the probability that a rule occurs by chance. More precisely, if R is a rule that classifies for say, contact-lenses=none, then m(R) is the probability that the improvement of accuracy by using this rule with respect to 7/12 (number of instances with contact-lenses=none over total number of instances) occurs just by chance.

Explain in detail how you would use this function m to post-prune your classification rules.

```
In post-pruning, we will consider all rules that have more than condition on their
Left-hand sides. Suppose for a rule R, we find the value m(R) by applying the function m.
To be able to prune this rule by removing the rightmost condition in R, we must show that
when this condition is removed from this rule to form a new rule S, m(S) is smaller than m(R).
m(S) < m(R) means that the probability that the improvement of accuracy by using rule S with
respect to default rule (IF "no condition" THEN right-hand side of R) occurs by chance is
smaller than the corresponding probability for Rule R. The process is repeated for the
last (rightmost) condition in S, until a removal of a condition increases the value of the
function m for that rule. In other words, the process of removing conditions is repeated until
the removal of the last term makes the rule worse in terms of m.
```

#### Problem III. Data Preprocessing and Evaluation of Patterns(20 points)

• (10 points) Select 3 different ways of dealing with missing values during pre-processing of the data before mining and explain each one of them in detail.
```
Remove attribute(s) containing missing values
Remove instance(s) containing missing values
Replace missing values with mean of the values present in the attribute
(in case of a numeric attribute) or with the mode, the most frequent of the
values present in the attribute(in case of a nominal attribute).
Keep the missing value in the dataset by introducing a new value "missing"
for the attribute.
Replace the missing values of an attribute in a stratified manner, that
is subtitute the missing values with existing values of the attribute in such a
way that the distribution of the existing values is preserved.
Replace the missing values with the set of non-missing values of the attribute,
each one weighted by the frequency of that value in the dataset.

```
• (10 points) Explain in detail the following 3 methods for evaluating patterns/models:

• Testing over training data
```Test the model with the same data used to build the model

```
• Splitting input data into training and testing data using stratification
```Split the input data at random into two sets, one for training and one for
testing, in such a way that the original distribution of the (values of the)
class attribute in the input data is preserved in both sets.

```
• Using n-fold cross-validation
```Split the input data into n folds (approximately of the same size) at random
using stratification. Let's denote those folds F1,F2,..., Fn.
Now, perform the following process:

For i := 1 to n do
- construct model Mi using as training data the union of all folds except for Fi.
That is, the uning of F1, ..., F(i-1), F(1+1), ..., Fn
- test model Mi on fold Fi
Average the performance of all the models Mi.
```