Solutions Exam 1 - April 4, 2003

Department of Computer Science

Worcester Polytechnic Institute

- Show your work
**Justify**your answers- Use the space provided to write your answers
- Ask in case of doubt

@relation contact-lenses-weka.filters.AttributeFilter-R2 @attribute age {young,pre-presbyopic,presbyopic} @attribute astigmatism {no,yes} @attribute tear-prod-rate {reduced,normal} @attribute contact-lenses {soft,hard,none} @data young, no, normal, soft young, yes, reduced, none young, yes, normal, hard pre-presbyopic, no, reduced, none pre-presbyopic, no, normal, soft pre-presbyopic, yes, normal, hard pre-presbyopic, yes, normal, none pre-presbyopic, yes, normal, none presbyopic, no, reduced, none presbyopic, no, normal, none presbyopic, yes, reduced, none presbyopic, yes, normal, hard

- (25 points) Construct the full ID3 decision tree using entropy to rank the
predictive attributes (age, astigmatism, tear-prod-rate) with respect to the
target/classification attribute (contact-lenses).
**Show all the steps of the calculations.**For your inconvenience, the logarithm in base 2 of selected values are provided.**x**1/2 1/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 **log2(x)**-1 -1.5 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0 First we have to choose the root of the tree. The root will be the attribute with lowest entropy. CONTACT LENSES SOFT HARD NONE AGE: young (3/12)*[ -(1/3)*log2(1/3) - (1/3)*log2(1/3) - (1/3)*log2(1/3) ] = (3/12)*1.5847 pre-presb (5/12)*[ -(1/5)*log2(1/5) - (1/5)*log2(1/5) - (3/5)*log2(3/5) ] = (5/12)*1.34 presb (4/12)*[ -(0/4)*log2(0/4) - (1/4)*log2(1/4) - (3/4)*log2(3/4) ] = (4/12)*0.8 ------------- = 1.222 ASTIGM: yes (7/12)*[ -(0/7)*log2(0/7) - (3/7)*log2(3/7) - (4/7)*log2(4/7) ] = (7/12)*0.9704 no (5/12)*[ -(2/5)*log2(2/5) - (0/5)*log2(0/5) - (3/5)*log2(3/5) ] = (5/12)*0.94 -------------- = 0.957 TEAR-P: normal (8/12)*[ -(2/8)*log2(2/8) - (3/8)*log2(3/8) - (3/8)*log2(3/8) ] = (8/12)*1.56 reduced (4/12)*[ -(0/4)*log2(0/4) - (0/4)*log2(0/4) - (4/4)*log2(4/4) ] = (4/12)*0 ------------ = 1.04 Since the attribute astigmatism has the lowest entropy, the root of the ID3 tree will be astigmatism: ASTIGMATISM / \ yes / \ no / \ Now for astigmatism=yes we measure the entropy for the rest of the two attributes:age, tears-pr-rate AGE: young (2/7)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.286 pre-presb (3/7)*[ 0 - (1/3)*log2(1/3) - (2/3)*log2(2/3) ] = 0.394 presb (2/12)*[0 - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.286 ------------- = 0.966 TEAR-P: normal (5/7)*[ 0-(3/5)*log2(3/5) - (2/5)*log2(2/5)] = 0.694 reduced (2/7)*[ 0-0 - (1)*log2(1) ] = 0 ------------ = 0.694 So the next node in the tree will be tears-pr-rate, and the last age. So far the tree look like ASTIGMATISM / \ yes / \ no / \ Tp-rate / \ normal / \ reduced / \ Now according to the ID3 for the branch reduced all examples have class none and the tree will have leaf none. Branch normal doesn't give single class so we have to add the last attribute age to this branch. ASTIGMATISM / \ yes / \ no / \ Tp-rate / \ normal / \ reduced / \ Age none / | \ young /PP| \ P / | \ hard none hard Now for astigmatism=no we measure the entropy for the rest of the two attributes:age, tears-pr-rate AGE: young (1/5)*[ 0 - 0 - 1*log2(1) ] = 0 pre-presb (2/5)*[ 0 - (1/2)*log2(1/2) - (1/2)*log2(1/2) ] = 0.4 presb (2/5)*[0 - 0 - (2/2)*log2(2/2)] = 0 ------------- = 0.4 TEAR-P: normal (3/5)*[ 0-(2/3)*log2(2/3) - (1/3)*log2(1/3)] = 0.55 reduced (2/5)*[ 0-0 - (2/2)*log2(2/2) ] = 0 ------------ = 0.55 Since the gain of the attribute age is with smaller entropy, we add this atttribute to the tree: ASTIGMATISM / \ yes / \ no / \ TP-rate AGE / \ | \ \ normal / red. \ young| P| \PP / \ | | \ AGE none \ / | \ young /PP| \ P / | \ hard none hard Now for branch(astigmastism=no, age =yes) all examples have class soft and for this branch the tree will have leaf soft. Now for branch(astigmastism=no, age =P) all examples have class none and for this branch the tree will have leaf none. The Branch (astigmastism=no, age =PP) doesn't give single class so we have to add the last attribute TP-rate to this branch. So we will obtain the final tree: ASTIGMATISM / \ yes / \ no / \ TP-rate AGE / \ | \ \ normal / red. \ young| P| \PP / \ | | \ AGE none soft none \ / | \ TP-rate young /PP| \ P / \ / | \ reduced / \ normal hard none hard / \ none soft

- (10 points) Compute the accuracy of the decision tree you constructed on
the following test data instances:
young, no, reduced, none our decision tree predicts: soft (wrong:test instance's class is none) pre-pre, yes, reduced, none our decision tree predicts: none (correct:test instance's class is none) presbyopic, no, normal, soft our decision tree predicts: none (wrong:test instance's class is none) presbyopic, yes, normal, none our decision tree predicts: hard (wrong:test instance's class is none) So the accuracy of this decision tree on this test dataset is: 25%

- (10 points) Consider the subtree replacement pruning technique to make
your decision tree more accurate over the this test data.
- (2 points) What node in your decision tree above would be considered
first for subtree replacement pruning? Why?
Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node. Nodes are removed only if the resulting pruned tree performs no worse than the original over the testing set. So the the leftmost "age" node is the first candidat for pruning. (Note that it's important to start with nodes in the "deepest" level of the tree, whether you start with the leftmost node ("age" in this case) or the rightmost node ("tr-rate") is not important.)

- (8 points) Show in detail how this pruning technique works on that one
node in your previous answer. Will the node be pruned or not by the
technique. Explain your answer.
Implementing this prunning technique and pruning the first candidate will give us the following tree: Astigmatism / \ yes no / \ Tp-rate age / \ / | \ normal reduced young P PP / \ / | \ hard none soft none Tp-rate / \ reduced normal / \ none soft The node will be pruned because the obtaiened tree performs no worse then the original tree on the testing data (i.e. the accuracy of the pruned decision tree is 25% as well).

- (2 points) What node in your decision tree above would be considered
first for subtree replacement pruning? Why?

@relation contact-lenses-weka.filters.AttributeFilter-R2 @attribute age {young,pre-presbyopic,presbyopic} @attribute astigmatism {no,yes} @attribute tear-prod-rate {reduced,normal} @attribute contact-lenses {soft,hard,none} @data young, no, normal, soft young, yes, reduced, none young, yes, normal, hard pre-presbyopic, no, reduced, none pre-presbyopic, no, normal, soft pre-presbyopic, yes, normal, hard pre-presbyopic, yes, normal, none pre-presbyopic, yes, normal, none presbyopic, no, reduced, none presbyopic, no, normal, none presbyopic, yes, reduced, none presbyopic, yes, normal, hardFollow these guidelines in your construction of the rules:

- (15 points) Construct perfect classification rules for the class label
contact-lenses = soft until all instances with this class label are covered.
Use the p/t measure to select conditions to add to the rule.
**Show all your work!**If ? ==> soft our choices age = young 1/3 age = pre-presbyopic 1/5 age = presbyopic 0/4 astigmatism = no 2/5 astigmatism = yes 0/7 tear-prod-rate = reduced 0/4 tear-prod-rate = normal 2/8 pick the attribute-value with the largest fraction and that happens to be astigmatism = no add this term to the rule. if astigmatism=no ==> soft covers 5 instances and gets 2 right pick another attribute-value to improve accuracy consider those instances where astigmatism = no age = young 1/1 age = pre-presbyopic 1/2 age = presbyopic 0/2 tear-prod-rate = reduced 0/2 tear-prod-rate = normal 2/3 select age=young (highest p/t) add this to the rule if astigmatism=no and age=young ==> soft covers 1 and classifies it right We can add this to our rule set Rule 1: if astigmatism=no and age=young ==> soft so this rule covers 1 out of the 2 instances with class label equal to soft (we're left with 1 more instance with class label = soft). let's go for another rule ? ==> soft age = young 0/2 age = pre-presbyopic 1/5 age = presbyopic 0/4 astigmatism = no 1/4 astigmatism = yes 0/7 tear-prod-rate = reduced 0/4 tear-prod-rate = normal 1/7 select astigmatism=no to add to the rule astigmatism=no ==> soft covers 4 and predicts 1 correctly let's try adding another attribute age = young 0/0 age = pre-presbyopic 1/2 age = presbyopic 0/2 tear-prod-rate = reduced 0/2 tear-prod-rate = normal 1/2 select age=pre-presbyopic to add to the rule astigmatism=no && age=pre-presbyopic ==> soft covers 2 and predicts 1 correctly let's try adding another attribute tear-prod-rate = reduced 0/1 tear-prod-rate = normal 1/1 select tear-prod-rate = normal add this to the rule astigmatism=no && age=pre-presbyopic && tear-prod-rate=normal ==> soft this covers 1 and classifies it correctly we're done with all instances with class = soft ( 2 instances ) Rule Set: Rule 1: if astigmatism=no and age=young ==> soft Rule 2: astigmatism = no && age=pre-presbyopic && tear-prod-rate=normal ==> soft

- (15 points) Construct perfect classification rules for the class label
contact-lenses = none until all instances with this class label are covered.
Use the p/t measure to select conditions to add to the rule.
**Show all your work!**? ==> none age = young 1/3 age = pre-presbyopic 3/5 age = presbyopic 3/4 astigmatism = no 3/5 astigmatism = yes 4/7 tear-prod-rate = reduced 4/4 tear-prod-rate = normal 3/8 Select tear-prod-rate=reduced to add to the rule tear-prod-rate=reduced ==> none This rule covers 4 and classifies all correctly, therefore we can add to the rule set. R1: tear-prod-rate=reduced ==> none we're left with 3 instances that have none for class label let's create another rule ? ==> none age = young 0/2 age = pre-presbyopic 2/4 age = presbyopic 1/2 astigmatism = no 1/3 astigmatism = yes 2/5 tear-prod-rate = reduced 0/0 tear-prod-rate = normal 3/8 select age = pre-presbyopic age = pre-presbyopic ==> none this rule covers 4 but classifies only 2 correctly... let's try improving this rule astigmatism = no 0/1 astigmatism = yes 2/3 tear-prod-rate = reduced 0/0 tear-prod-rate = normal 2/4 we'll select astigmatism = yes age=pre-presbyopic && astigmatism=yes ==> none covers 3 but only classifies 2 correctly let's try adding one more attribute we don't have a choice but add tear-prod-rate age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none add this rule to the rule set and remove those instances covered by this rule R1: tear-prod-rate=reduced ==> none R2: age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none we're left with 1 instance ? ==> none age = young 0/1 age = pre-presbyopic 0/1 age = presbyopic 1/2 astigmatism = no 1/2 astigmatism = yes 0/2 tear-prod-rate = reduced 0/0 tear-prod-rate = normal 1/4 select age = presbyopic age=presbyopic ==> none covers 2 and classifies 1 correctly let's look at 2 covered instances astigmatism = no 1/1 astigmatism = yes 0/1 tear-prod-rate = reduced 0/0 tear-prod-rate = normal 1/2 select astigmatism = no age=presbyopic && astigmatism = no ==> none covers 1 and classifies 1, therefore we can add this to the rule set all instances are covered for class label = none Rule Set: R1: tear-prod-rate=reduced ==> none R2: age=pre-presbyopic && astigmatis=yes && tear-prod-rate = normal ==> none R3: age=presbyopic && astigmatism = no ==> none

- (EXTRA 10 points - attempt only if you have extra time) Construct perfect
classification rules for the class label contact-lenses = hard until all
instances with this class label are covered. Use the p/t measure to select
conditions to add to the rule.
**Show all your work!**Rules from Weka's Prism Prism rules ---------- If astigmatism = no and age = young then soft If astigmatism = no and age = pre-presbyopic and tear-prod-rate = normal then soft If astigmatism = yes and tear-prod-rate = normal and age = young then hard If astigmatism = yes and tear-prod-rate = normal and age = presbyopic then hard If age = pre-presbyopic and astigmatism = yes and tear-prod-rate = normal then hard If tear-prod-rate = reduced then none If age = pre-presbyopic and astigmatism = yes and tear-prod-rate = normal then none If age = presbyopic and astigmatism = no then none

- (5 points) Compute the accuracy of the classification rules
that you constructed on the following test data instances:
@relation test-contact-lenses-weka.filters.AttributeFilter-R2 @attribute age {young,pre-presbyopic,presbyopic} @attribute astigmatism {no,yes} @attribute tear-prod-rate {reduced,normal} @attribute contact-lenses {soft,hard,none} @data young, no, reduced, none your rules prediction: soft pre-presbyopic, yes, reduces, none your rules prediction: none presbyopic, no, normal, soft your rules prediction: none presbyopic, yes, normal, none your rules prediction: cannot be classified

The accuracy of your classification rules on this test data is: __1/3 = 33____________ - (10 points) Assume you are given a function
*m*defined on rules that measures the probability that a rule occurs by chance. More precisely, if R is a rule that classifies for say, contact-lenses=none, then*m*(R) is the probability that the improvement of accuracy by using this rule with respect to 7/12 (number of instances with contact-lenses=none over total number of instances) occurs just by chance.Explain in detail how you would use this function

*m*to post-prune your classification rules.In post-pruning, we will consider all rules that have more than condition on their Left-hand sides. Suppose for a rule R, we find the value m(R) by applying the function m. To be able to prune this rule by removing the rightmost condition in R, we must show that when this condition is removed from this rule to form a new rule S, m(S) is smaller than m(R). m(S) < m(R) means that the probability that the improvement of accuracy by using rule S with respect to default rule (IF "no condition" THEN right-hand side of R) occurs by chance is smaller than the corresponding probability for Rule R. The process is repeated for the last (rightmost) condition in S, until a removal of a condition increases the value of the function m for that rule. In other words, the process of removing conditions is repeated until the removal of the last term makes the rule worse in terms of m.

**(10 points)**Select 3 different ways of dealing with missing values during pre-processing of the data before mining and explain each one of them in detail.- Remove attribute(s) containing missing values
- Remove instance(s) containing missing values
- Replace missing values with mean of the values present in the attribute (in case of a numeric attribute) or with the mode, the most frequent of the values present in the attribute(in case of a nominal attribute).
- Keep the missing value in the dataset by introducing a new value "missing" for the attribute.
- Replace the missing values of an attribute in a stratified manner, that is subtitute the missing values with existing values of the attribute in such a way that the distribution of the existing values is preserved.
- Replace the missing values with the set of non-missing values of the attribute, each one weighted by the frequency of that value in the dataset.

**(10 points)**Explain in detail the following 3 methods for evaluating patterns/models:- Testing over training data
Test the model with the same data used to build the model

- Splitting input data into training and testing data using stratification
Split the input data at random into two sets, one for training and one for testing, in such a way that the original distribution of the (values of the) class attribute in the input data is preserved in both sets.

- Using n-fold cross-validation
Split the input data into n folds (approximately of the same size) at random using stratification. Let's denote those folds F1,F2,..., Fn. Now, perform the following process: For i := 1 to n do - construct model Mi using as training data the union of all folds except for Fi. That is, the uning of F1, ..., F(i-1), F(1+1), ..., Fn - test model Mi on fold Fi Average the performance of all the models Mi.

- Testing over training data