@relation balloons
@attribute size {large, small}
@attribute act {stretch, dip}
@attribute age {adult, child}
@attribute inflated {T, F}
@data
small, stretch, adult, T
small, stretch, child, F
small, dip, adult, F
small, dip, child, F
large, stretch, adult, T
large, stretch, child, F
large, dip, adult, F
large, dip, child, F
Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.
For your convenience, the logarithm in base 2 of selected values are provided.
x 1/2 1/3 2/3 1/4 3/4 1/5 2/5 3/5 1/6 5/6 1/7 2/7 3/7 4/7 1 log2(x) -1 -1.5 -0.6 -2 -0.4 -2.3 -1.3 -0.7 -2.5 -0.2 -2.8 -1.8 -1.2 -0.8 0
SOLUTIONS:
Let's compute the entropy of the 3 predicting attributes with respect to
the target attribute ("inflated"):
INFLATE T F
SIZE
small (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4
large (4/8)*[ - (1/4)*log2(1/4) - (3/4)*log2(3/4)] = 0.4
-------
0.8
ACT
stretch (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5
dip (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0
-------
0.5
AGE
adult (4/8)*[ - (2/4)*log2(2/4) - (2/4)*log2(2/4)] = 0.5
child (4/8)*[ - (0/4)*log2(0/4) - (4/4)*log2(4/4)] = 0
-------
0.5
Act and Age have the same entropy. We can choose either one as the
root of our decision tree. I'll choose Act:
ACT
/ \
stretch / \ dip
/ \
2 T's, 2 F's 4 F's
The node at the end of ACT=dip branch contains only F's, then we
convert that node into a leaf that predicts F.
Since the node at the end of the ACT=stretch branch contains
F's and T's, then we need to select an attribute to split it further:
Here are the 4 instances under consideration:
small, stretch, adult, T
small, stretch, child, F
large, stretch, adult, T
large, stretch, child, F
We compute the entropies of SIZE and AGE w.r.t. INFLATE in this
smaller dataset:
INFLATE T F
SIZE
small (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5
large (2/4)*[ - (1/2)*log2(1/2) - (1/2)*log2(1/2)] = 0.5
-------
1.0
AGE
adult (2/4)*[ - (2/2)*log2(2/2) - (0/2)*log2(0/2)] = 0
child (2/4)*[ - (0/2)*log2(0/2) - (2/2)*log2(2/2)] = 0
-------
0
The attribute with the lowest entropy is AGE and then it is used
to split the node under consideration:
ACT
/ \
stretch / \ dip
/ \
AGE F
/ \
adult / \ child
/ \
T F
This completes the construction of the decision tree.
@relation balloons
@attribute size {large, small}
@attribute act {stretch, dip}
@attribute age {adult, child}
@attribute inflated {T, F}
@data
small, stretch, adult, T
small, stretch, child, F
small, dip, adult, F
small, dip, child, F
large, stretch, adult, T
large, stretch, child, F
large, dip, adult, F
large, dip, child, F
Assume that we want to construct classification rules for this dataset.
Solutions: We begin with the rule: IF ? THEN inflated=T this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/4 size=small 1/4 size=large 2/4 act=stretch 0/4 act=dip 2/4 age=adult 0/4 age=child Since both act=stretch and age=adult have the same, maximum p/t ratio among the conditions, we can select any one of them, say act=stretch. The resulting rule: IF act=stretch THEN inflated=T The rule is not perfect and hence we look for a second condition to add to the antecedent of the rule. p/t CANDIDATE CONDITIONS: 1/2 act=stretch and size=small 1/2 act=stretch and size=large 2/2 act=stretch and age=adult 0/2 act=stretch and age=child The best condition to add to act=stretch is age=adult, resulting in the rule: IF act=stretch and age=adult THEN inflated=T The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=T remain in the dataset, we are done with the construction of rules predicting inflated=T. The resulting set of rules consists of the rule: IF act=stretch and age=adult THEN inflated=T
Assume that the perfect rule
IF act=dip THEN inflated=Fhas just been constructed. Follow the Prism sequential covering algorithm to construct the remaining classification rules for the target inflated=F. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.
Solutions: After the rule IF act=dip THEN F is constructed, all the instances correctly classified by this rule are removed from consideration. The remaining set of instances is: small, stretch, adult, T small, stretch, child, F large, stretch, adult, T large, stretch, child, F Now we start with a new rule: IF ? THEN F this rule is not perfect and so we look for the best attribute-value pair to add to the antecedent of the rule: p/t CANDIDATE CONDITIONS: 1/2 size=small 1/2 size=large 2/4 act=stretch 0/2 age=adult 2/2 age=child The condition with the best p/t ratio is age=child: IF age=child THEN F The rule is now perfect as its accuracy over the training data is 100%. Hence, we are done with the construction of this rule. We now remove the dataset instances covered by this rule. Since no instances with inflated=F remain in the dataset, we are done with the construction of rules predicting inflated=F. The resulting set of rules is: IF act=dip THEN inflated=F IF age=child THEN inflated=F
@relation balloons
@attribute size {large, small}
@attribute act {stretch, dip}
@attribute age {adult, child}
@attribute inflated {T, F}
@data
small, stretch, adult, T
small, stretch, child, F
small, dip, adult, F
small, dip, child, F
large, stretch, adult, T
large, stretch, child, F
large, dip, adult, F
large, dip, child, F
Assume that we want to mine association rules with
minimum support: 0.25 (that is, the itemset has to be present
in at least 2 data instances).
LEVEL 1
SUPPORT ITEMSETS
4/8 {size=small}
4/8 {size=large}
4/8 {act=stretch}
4/8 {act=dip}
4/8 {age=adult}
4/8 {age=child}
2/8 {inflated=T}
6/8 {inflated=F}
LEVEL 2
SUPPORT ITEMSETS
2/8 {size=small, act=stretch}
2/8 {size=small, act=dip}
2/8 {size=small, age=adult}
2/8 {size=small, age=child}
1/8 {size=small, inflated=T}
3/8 {size=small, inflated=F}
2/8 {size=large, act=stretch}
2/8 {size=large, act=dip}
2/8 {size=large, age=adult}
2/8 {size=large, age=child}
1/8 {size=large, inflated=T}
3/8 {size=large, inflated=F}
2/8 {act=stretch, age=adult}
2/8 {act=stretch, age=child}
2/8 {act=stretch, inflated=T}
2/8 {act=stretch, inflated=F}
2/8 {act=dip, age=adult}
2/8 {act=dip, age=child}
0/8 {act=dip, inflated=T}
4/8 {act=dip, inflated=F}
2/8 {age=adult, inflated=T}
2/8 {age=adult, inflated=F}
0/8 {age=child, inflated=T}
4/8 {age=child, inflated=F}
LEVEL 3 Compute all the candidate and frequent itemsets
for level 3. Use both the join and the subset pruning
criteria to make the process more efficient.
SUPPORT ITEMSETS
SOLUTIONS:
1/8 {size=small, act=stretch, age=adult}
1/8 {size=small, act=stretch, age=child}
1/8 {size=small, act=stretch, inflated=F}
1/8 {size=small, act=dip, age=adult}
1/8 {size=small, act=dip, age=child}
2/8 {size=small, act=dip, inflated=F}
1/8 {size=small, age=adult, inflated=F}
2/8 {size=small, age=child, inflated=F}
1/8 {size=large, act=stretch, age=adult}
1/8 {size=large, act=stretch, age=child}
1/8 {size=large, act=stretch, inflated=F}
1/8 {size=large, act=dip, age=adult}
1/8 {size=large, act=dip, age=child}
2/8 {size=large, act=dip, inflated=F}
1/8 {size=large, age=adult, inflated=F}
2/8 {size=large, age=child, inflated=F}
2/8 {act=stretch, age=adult, inflated=T}
0/8 {act=stretch, age=adult, inflated=F}
XXX {act=stretch, age=child, inflated=T} --> There is no need
to check the support of this itemset as
it is removed by subset prune because
{age=child, inflated=T} is not frequent.
2/8 {act=stretch, age=child, inflated=F}
2/8 {act=dip, age=adult, inflated=F}
2/8 {act=dip, age=child, inflated=F}
Hence, the frequent 3-itemsets are:
2/8 {size=small, act=dip, inflated=F}
2/8 {size=small, age=child, inflated=F}
2/8 {size=large, act=dip, inflated=F}
2/8 {size=large, age=child, inflated=F}
2/8 {act=stretch, age=adult, inflated=T}
2/8 {act=stretch, age=child, inflated=F}
2/8 {act=dip, age=adult, inflated=F}
2/8 {act=dip, age=child, inflated=F}
LEVEL 4 Compute all the candidate and frequent itemsets
for level 4. Use both the join and the subset pruning
criteria to make the process more efficient.
SUPPORT ITEMSETS Solutions: Note that no pair of frequent 3-itemsets satisfies the join condition. Hence there are no candidate and no frequent itemsets in Level 4.
act=stretch -> age=adult, inflated=T
Compute the confidence of this rule. Show the steps of your
calculations.
Solution:
confidence( act=stretch -> age=adult, inflated=T )
= support (act=stretch, age=adult, inflated=T) / support (act=stretch)
= 2/4
= 0.5 (or 50%)
Solution:
For n-fold cross-validation, the dataset instances are divided into n
subsets, roughly of the same size. Let D1,...,Dn denote
those n subsets. Then the following procedure is followed:
For k := 1 to n do
Construct a model using the union of D1,...,Dk-1,Dk+1,...,Dn
as the training set
Calculate the accuracy of the model using Dk as the test set.
Let's call that accuracy Ak
end-For
Return the average of A1,...,An as the accuracy
of the data mining technique over the given dataset.
Solution: n-fold cross-validation won't always return a higher (similarly, lower, equal) classification accuracy than that of testing over a separate test dataset. It just produces a more reliable estimate of the accuracy of a data mining technique over a given dataset.