Solutions Exam 1 - November 20, 2006

Department of Computer Science

Worcester Polytechnic Institute

@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F

Construct the FULL decision tree for this dataset USING THE ID3
ALGORITHM. **Show all the steps of the entropy calculations.**

For your convenience, the logarithm in base 2 of selected values are provided.

x1/2 1/3 2/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x)-1 -1.5 -0.6 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0

SOLUTIONS: Let's compute the entropy of the 3 predicting attributes with respect to the target attribute ("inflated"): INFLATE T F SIZE small (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4 large (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4 ------- 0.8 ACT stretch (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5 dip (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 ------- 0.5 AGE adult (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5 child (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 ------- 0.5 Act and Age have the same entropy. We can choose either one as the root of our decision tree. I'll choose Act: ACT / \ stretch / \ dip / \ 2 T's, 2 F's 4 F's The node at the end of ACT=dip branch contains only F's, then we convert that node into a leaf that predicts F. Since the node at the end of the ACT=stretch branch contains F's and T's, then we need to select an attribute to split it further: Here are the 4 instances under consideration: small, stretch, adult, T small, stretch, child, F large, stretch, adult, T large, stretch, child, F We compute the entropies of SIZE and AGE w.r.t. INFLATE in this smaller dataset: INFLATE T F SIZE small (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5 large (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5 ------- 1.0 AGE adult (2/4)*[ - (2/2)*log2(2/2) - (0/2)*log2(0/2)] = 0 child (2/4)*[ - (0/2)*log2(0/2) - (2/2)*log2(2/2)] = 0 ------- 0 The attribute with the lowest entropy is AGE and then it is used to split the node under consideration: ACT / \ stretch / \ dip / \ AGE F / \ adult / \ child / \ T F This completes the construction of the decision tree.

@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F

Assume that we want to construct classification rules for this dataset.

- (15 points)
Follow the Prism sequential covering algorithm to construct classification
rules for the target
**inflated=T**. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.**Solutions: We begin with the rule: IF ? THEN inflated=T this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/4 size=small 1/4 size=large 2/4 act=stretch 0/4 act=dip 2/4 age=adult 0/4 age=child Since both act=stretch and age=adult have the same, maximum p/t ratio among the conditions, we can select any one of them, say act=stretch. The resulting rule: IF act=stretch THEN inflated=T The rule is not perfect and hence we look for a second condition to add to the antecedent of the rule. p/t CANDIDATE CONDITIONS: 1/2 act=stretch and size=small 1/2 act=stretch and size=large 2/2 act=stretch and age=adult 0/2 act=stretch and age=child The best condition to add to act=stretch is age=adult, resulting in the rule: IF act=stretch and age=adult THEN inflated=T The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=T remain in the dataset, we are done with the construction of rules predicting inflated=T. The resulting set of rules consists of the rule: IF act=stretch and age=adult THEN inflated=T**

- (15 points)
Assume that the perfect rule

**IF act=dip THEN inflated=F****inflated=F**. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.**Solutions: After the rule IF act=dip THEN F is constructed, all the instances correctly classified by this rule are removed from consideration. The remaining set of instances is: small, stretch, adult, T small, stretch, child, F large, stretch, adult, T large, stretch, child, F Now we start with a new rule: IF ? THEN F this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/2 size=small 1/2 size=large 2/4 act=stretch 0/2 age=adult 2/2 age=child The condition with the best p/t ratio is age=child: IF age=child THEN F The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=F remain in the dataset, we are done with the construction of rules predicting inflated=F. The resulting set of rules is: IF act=dip THEN inflated=F IF age=child THEN inflated=F**

Assume that we want to mine association rules with@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F

**(25 Points)**Use the Apriori algorithm to construct all the**frequent itemsets**in this dataset. The first two levels of frequent itemsets are provided below.**LEVEL 1**SUPPORT ITEMSETS 4/8 {size=small} 4/8 {size=large} 4/8 {act=stretch} 4/8 {act=dip} 4/8 {age=adult} 4/8 {age=child} 2/8 {inflated=T} 6/8 {inflated=F}**LEVEL 2**SUPPORT ITEMSETS 2/8 {size=small, act=stretch} 2/8 {size=small, act=dip} 2/8 {size=small, age=adult} 2/8 {size=small, age=child} 1/8 {size=small, inflated=T} 3/8 {size=small, inflated=F} 2/8 {size=large, act=stretch} 2/8 {size=large, act=dip} 2/8 {size=large, age=adult} 2/8 {size=large, age=child} 1/8 {size=large, inflated=T} 3/8 {size=large, inflated=F} 2/8 {act=stretch, age=adult} 2/8 {act=stretch, age=child} 2/8 {act=stretch, inflated=T} 2/8 {act=stretch, inflated=F} 2/8 {act=dip, age=adult} 2/8 {act=dip, age=child} 0/8 {act=dip, inflated=T} 4/8 {act=dip, inflated=F} 2/8 {age=adult, inflated=T} 2/8 {age=adult, inflated=F} 0/8 {age=child, inflated=T} 4/8 {age=child, inflated=F}**LEVEL 3**Compute all the candidate and frequent itemsets for level 3. Use both the**join**and the**subset pruning**criteria to make the process more efficient.SUPPORT ITEMSETS

**SOLUTIONS: 1/8 {size=small, act=stretch, age=adult} 1/8 {size=small, act=stretch, age=child} 1/8 {size=small, act=stretch, inflated=F} 1/8 {size=small, act=dip, age=adult} 1/8 {size=small, act=dip, age=child} 2/8 {size=small, act=dip, inflated=F} 1/8 {size=small, age=adult, inflated=F} 2/8 {size=small, age=child, inflated=F} 1/8 {size=large, act=stretch, age=adult} 1/8 {size=large, act=stretch, age=child} 1/8 {size=large, act=stretch, inflated=F} 1/8 {size=large, act=dip, age=adult} 1/8 {size=large, act=dip, age=child} 2/8 {size=large, act=dip, inflated=F} 1/8 {size=large, age=adult, inflated=F} 2/8 {size=large, age=child, inflated=F} 2/8 {act=stretch, age=adult, inflated=T} 0/8 {act=stretch, age=adult, inflated=F} XXX {act=stretch, age=child, inflated=T} --> There is no need to check the support of this itemset as it is removed by subset prune because {age=child, inflated=T} is not frequent. 2/8 {act=stretch, age=child, inflated=F} 2/8 {act=dip, age=adult, inflated=F} 2/8 {act=dip, age=child, inflated=F} Hence, the frequent 3-itemsets are: 2/8 {size=small, act=dip, inflated=F} 2/8 {size=small, age=child, inflated=F} 2/8 {size=large, act=dip, inflated=F} 2/8 {size=large, age=child, inflated=F} 2/8 {act=stretch, age=adult, inflated=T} 2/8 {act=stretch, age=child, inflated=F} 2/8 {act=dip, age=adult, inflated=F} 2/8 {act=dip, age=child, inflated=F}****LEVEL 4**Compute all the candidate and frequent itemsets for level 4. Use both the**join**and the**subset pruning**criteria to make the process more efficient.SUPPORT ITEMSETS

**Solutions: Note that no pair of frequent 3-itemsets satisfies the join condition. Hence there are no candidate and no frequent itemsets in Level 4.**

- (5 points) Consider the association rule:
act=stretch -> age=adult, inflated=T

Compute the**confidence**of this rule. Show the steps of your calculations.**Solution: confidence( act=stretch -> age=adult, inflated=T ) = support (act=stretch, age=adult, inflated=T) / support (act=stretch) = 2/4 = 0.5 (or 50%)**

- (10 points)
Explain in detail the procedure followed by
*n*-fold cross-validation for the above purpose.**Solution: For***n*-fold cross-validation, the dataset instances are divided into*n*subsets, roughly of the same size. Let D_{1},...,D_{n}denote those*n*subsets. Then the following procedure is followed: For*k*:= 1 to*n*do Construct a model using the union of D_{1},...,D_{k-1,}D_{k+1},...,D_{n}as the training set Calculate the accuracy of the model using D_{k}as the test set. Let's call that accuracy A_{k}end-For Return the average of A_{1},...,A_{n}as the accuracy of the data mining technique over the given dataset.

- (5 points) Assume that we construct a model by applying the data mining technique
to the full dataset
*D*. Let*T*be the accuracy obtained by testing this model on a given, separate test dataset. On the other hand, let*A*be the accuracy obtained by using*n*-fold cross-validation on the dataset*D*. How do*A*and*T*compare with each other? Is*A*always higher than*T*? Is*A*always the same as*T*? Is*A*always lower than*T*? Or none of the above? Explain your answer in detail.**Solution:***n*-fold cross-validation won't always return a higher (similarly, lower, equal) classification accuracy than that of testing over a separate test dataset. It just produces a more reliable estimate of the accuracy of a data mining technique over a given dataset.