@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F
Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.
For your convenience, the logarithm in base 2 of selected values are provided.
x 1/2 1/3 2/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x) -1 -1.5 -0.6 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0
SOLUTIONS: Let's compute the entropy of the 3 predicting attributes with respect to the target attribute ("inflated"): INFLATE T F SIZE small (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4 large (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4 ------- 0.8 ACT stretch (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5 dip (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 ------- 0.5 AGE adult (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5 child (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0 ------- 0.5 Act and Age have the same entropy. We can choose either one as the root of our decision tree. I'll choose Act: ACT / \ stretch / \ dip / \ 2 T's, 2 F's 4 F's The node at the end of ACT=dip branch contains only F's, then we convert that node into a leaf that predicts F. Since the node at the end of the ACT=stretch branch contains F's and T's, then we need to select an attribute to split it further: Here are the 4 instances under consideration: small, stretch, adult, T small, stretch, child, F large, stretch, adult, T large, stretch, child, F We compute the entropies of SIZE and AGE w.r.t. INFLATE in this smaller dataset: INFLATE T F SIZE small (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5 large (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5 ------- 1.0 AGE adult (2/4)*[ - (2/2)*log2(2/2) - (0/2)*log2(0/2)] = 0 child (2/4)*[ - (0/2)*log2(0/2) - (2/2)*log2(2/2)] = 0 ------- 0 The attribute with the lowest entropy is AGE and then it is used to split the node under consideration: ACT / \ stretch / \ dip / \ AGE F / \ adult / \ child / \ T F This completes the construction of the decision tree.
@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F
Assume that we want to construct classification rules for this dataset.
Solutions: We begin with the rule: IF ? THEN inflated=T this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/4 size=small 1/4 size=large 2/4 act=stretch 0/4 act=dip 2/4 age=adult 0/4 age=child Since both act=stretch and age=adult have the same, maximum p/t ratio among the conditions, we can select any one of them, say act=stretch. The resulting rule: IF act=stretch THEN inflated=T The rule is not perfect and hence we look for a second condition to add to the antecedent of the rule. p/t CANDIDATE CONDITIONS: 1/2 act=stretch and size=small 1/2 act=stretch and size=large 2/2 act=stretch and age=adult 0/2 act=stretch and age=child The best condition to add to act=stretch is age=adult, resulting in the rule: IF act=stretch and age=adult THEN inflated=T The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=T remain in the dataset, we are done with the construction of rules predicting inflated=T. The resulting set of rules consists of the rule: IF act=stretch and age=adult THEN inflated=T
Assume that the perfect rule
IF act=dip THEN inflated=Fhas just been constructed. Follow the Prism sequential covering algorithm to construct the remaining classification rules for the target inflated=F. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.
Solutions: After the rule IF act=dip THEN F is constructed, all the instances correctly classified by this rule are removed from consideration. The remaining set of instances is: small, stretch, adult, T small, stretch, child, F large, stretch, adult, T large, stretch, child, F Now we start with a new rule: IF ? THEN F this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/2 size=small 1/2 size=large 2/4 act=stretch 0/2 age=adult 2/2 age=child The condition with the best p/t ratio is age=child: IF age=child THEN F The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=F remain in the dataset, we are done with the construction of rules predicting inflated=F. The resulting set of rules is: IF act=dip THEN inflated=F IF age=child THEN inflated=F
Assume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances).@relation balloons @attribute size {large, small} @attribute act {stretch, dip} @attribute age {adult, child} @attribute inflated {T, F} @data small, stretch, adult, T small, stretch, child, F small, dip, adult, F small, dip, child, F large, stretch, adult, T large, stretch, child, F large, dip, adult, F large, dip, child, F
LEVEL 1 SUPPORT ITEMSETS 4/8 {size=small} 4/8 {size=large} 4/8 {act=stretch} 4/8 {act=dip} 4/8 {age=adult} 4/8 {age=child} 2/8 {inflated=T} 6/8 {inflated=F} LEVEL 2 SUPPORT ITEMSETS 2/8 {size=small, act=stretch} 2/8 {size=small, act=dip} 2/8 {size=small, age=adult} 2/8 {size=small, age=child} 1/8 {size=small, inflated=T} 3/8 {size=small, inflated=F} 2/8 {size=large, act=stretch} 2/8 {size=large, act=dip} 2/8 {size=large, age=adult} 2/8 {size=large, age=child} 1/8 {size=large, inflated=T} 3/8 {size=large, inflated=F} 2/8 {act=stretch, age=adult} 2/8 {act=stretch, age=child} 2/8 {act=stretch, inflated=T} 2/8 {act=stretch, inflated=F} 2/8 {act=dip, age=adult} 2/8 {act=dip, age=child} 0/8 {act=dip, inflated=T} 4/8 {act=dip, inflated=F} 2/8 {age=adult, inflated=T} 2/8 {age=adult, inflated=F} 0/8 {age=child, inflated=T} 4/8 {age=child, inflated=F}LEVEL 3 Compute all the candidate and frequent itemsets for level 3. Use both the join and the subset pruning criteria to make the process more efficient.
SUPPORT ITEMSETS SOLUTIONS: 1/8 {size=small, act=stretch, age=adult} 1/8 {size=small, act=stretch, age=child} 1/8 {size=small, act=stretch, inflated=F} 1/8 {size=small, act=dip, age=adult} 1/8 {size=small, act=dip, age=child} 2/8 {size=small, act=dip, inflated=F} 1/8 {size=small, age=adult, inflated=F} 2/8 {size=small, age=child, inflated=F} 1/8 {size=large, act=stretch, age=adult} 1/8 {size=large, act=stretch, age=child} 1/8 {size=large, act=stretch, inflated=F} 1/8 {size=large, act=dip, age=adult} 1/8 {size=large, act=dip, age=child} 2/8 {size=large, act=dip, inflated=F} 1/8 {size=large, age=adult, inflated=F} 2/8 {size=large, age=child, inflated=F} 2/8 {act=stretch, age=adult, inflated=T} 0/8 {act=stretch, age=adult, inflated=F} XXX {act=stretch, age=child, inflated=T} --> There is no need to check the support of this itemset as it is removed by subset prune because {age=child, inflated=T} is not frequent. 2/8 {act=stretch, age=child, inflated=F} 2/8 {act=dip, age=adult, inflated=F} 2/8 {act=dip, age=child, inflated=F} Hence, the frequent 3-itemsets are: 2/8 {size=small, act=dip, inflated=F} 2/8 {size=small, age=child, inflated=F} 2/8 {size=large, act=dip, inflated=F} 2/8 {size=large, age=child, inflated=F} 2/8 {act=stretch, age=adult, inflated=T} 2/8 {act=stretch, age=child, inflated=F} 2/8 {act=dip, age=adult, inflated=F} 2/8 {act=dip, age=child, inflated=F}LEVEL 4 Compute all the candidate and frequent itemsets for level 4. Use both the join and the subset pruning criteria to make the process more efficient.
SUPPORT ITEMSETS Solutions: Note that no pair of frequent 3-itemsets satisfies the join condition. Hence there are no candidate and no frequent itemsets in Level 4.
act=stretch -> age=adult, inflated=TCompute the confidence of this rule. Show the steps of your calculations.
Solution: confidence( act=stretch -> age=adult, inflated=T ) = support (act=stretch, age=adult, inflated=T) / support (act=stretch) = 2/4 = 0.5 (or 50%)
Solution: For n-fold cross-validation, the dataset instances are divided into n subsets, roughly of the same size. Let D1,...,Dn denote those n subsets. Then the following procedure is followed: For k := 1 to n do Construct a model using the union of D1,...,Dk-1,Dk+1,...,Dn as the training set Calculate the accuracy of the model using Dk as the test set. Let's call that accuracy Ak end-For Return the average of A1,...,An as the accuracy of the data mining technique over the given dataset.
Solution: n-fold cross-validation won't always return a higher (similarly, lower, equal) classification accuracy than that of testing over a separate test dataset. It just produces a more reliable estimate of the accuracy of a data mining technique over a given dataset.