Solutions Exam 1 CS 4445 B06

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2006
Solutions Exam 1 - November 20, 2006

By Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Problem I. Decision Trees (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository. Assume that the classification target is the attribute inflated.

@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute age               {adult, child}
@attribute inflated          {T, F}

@data

small,	stretch,	adult,	T

small,	stretch,	child,	F

small,	dip,		adult,	F

small,	dip,		child,	F

large,	stretch,	adult,	T

large,	stretch,	child,	F

large,	dip,		adult,	F

large,	dip,		child,	F

Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.

For your convenience, the logarithm in base 2 of selected values are provided.

x	1/2	1/3	2/3	1/4	3/4	1/5	2/5	3/5	1/6	5/6	1/7	2/7	3/7	4/7	1
log2(x)	-1	-1.5	-0.6	-2	-0.4	-2.3	-1.3	-0.7	-2.5	-0.2	-2.8	-1.8	-1.2	-0.8	0


SOLUTIONS:

Let's compute the entropy of the 3 predicting attributes with respect to
the target attribute ("inflated"):


      INFLATE            T                  F

SIZE
   
  small   (4/8)*[ - (1/4)*log2(1/4)  - (3/4)*log2(3/4)] = 0.4
  large   (4/8)*[ - (1/4)*log2(1/4)  - (3/4)*log2(3/4)] = 0.4
                                                         -------
                                                          0.8
ACT
   
  stretch (4/8)*[ - (2/4)*log2(2/4)  - (2/4)*log2(2/4)] = 0.5 
  dip     (4/8)*[ - (0/4)*log2(0/4)  - (4/4)*log2(4/4)] = 0
                                                         -------
                                                          0.5

AGE
   
  adult   (4/8)*[ - (2/4)*log2(2/4)  - (2/4)*log2(2/4)] = 0.5
  child   (4/8)*[ - (0/4)*log2(0/4)  - (4/4)*log2(4/4)] = 0
                                                         -------
                                                          0.5

Act and Age have the same entropy. We can choose either one as the
root of our decision tree. I'll choose Act: 


                           ACT
                        /       \ 
             stretch   /         \  dip  
                      /           \
	    2 T's, 2 F's         4 F's

The node at the end of ACT=dip branch contains only F's, then we
convert that node into a leaf that predicts F.
Since the node at the end of the ACT=stretch branch contains
F's and T's, then we need to select an attribute to split it further:

Here are the 4 instances under consideration:


small,	stretch,	adult,	T

small,	stretch,	child,	F

large,	stretch,	adult,	T

large,	stretch,	child,	F


We compute the entropies of SIZE and AGE w.r.t. INFLATE in this
smaller dataset:


      INFLATE            T                  F

SIZE
   
  small   (2/4)*[ - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.5
  large   (2/4)*[ - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.5
                                                         -------
                                                          1.0
AGE
   
  adult   (2/4)*[ - (2/2)*log2(2/2)  - (0/2)*log2(0/2)] = 0 
  child   (2/4)*[ - (0/2)*log2(0/2)  - (2/2)*log2(2/2)] = 0
                                                         -------
                                                          0

The attribute with the lowest entropy is AGE and then it is used
to split the node under consideration:


                           ACT
                        /       \ 
             stretch   /         \  dip  
                      /           \
		    AGE            F  
                   /   \
            adult /     \ child
                 /       \
                T         F

This completes the construction of the decision tree.

Problem II. Classification Rules (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository. Assume that the classification target is the attribute inflated

@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute age               {adult, child}
@attribute inflated          {T, F}

@data

small,	stretch,	adult,	T

small,	stretch,	child,	F

small,	dip,		adult,	F

small,	dip,		child,	F

large,	stretch,	adult,	T

large,	stretch,	child,	F

large,	dip,		adult,	F

large,	dip,		child,	F

Assume that we want to construct classification rules for this dataset.

(15 points) Follow the Prism sequential covering algorithm to construct classification rules for the target inflated=T. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.

Solutions:


We begin with the rule:

  IF ? THEN inflated=T

this rule is not perfect and so we look for the best attribute-value pair
to add to the antecedent of the rule:

 p/t    CANDIDATE CONDITIONS: 

 1/4	size=small
 1/4	size=large
 2/4	act=stretch
 0/4	act=dip
 2/4	age=adult
 0/4	age=child

Since both act=stretch and age=adult have the same, maximum p/t 
ratio among the conditions, we can select any one of them, 
say act=stretch. The resulting rule:

  IF act=stretch THEN inflated=T

The rule is not perfect and hence we look for a second condition 
to add to the antecedent of the rule. 

 p/t    CANDIDATE CONDITIONS: 

 1/2	act=stretch and size=small
 1/2	act=stretch and size=large
 2/2	act=stretch and age=adult
 0/2	act=stretch and age=child

The best condition to add to act=stretch is age=adult, resulting
in the rule:

  IF act=stretch and age=adult THEN inflated=T

The rule is now perfect as its accuracy over the training data is 100%.
Hence, we are done with the construction of this rule.

We now remove the dataset instances covered by this rule. Since no
instances with inflated=T remain in the dataset, we are done
with the construction of rules predicting inflated=T.

The resulting set of rules consists of the rule:

  IF act=stretch and age=adult THEN inflated=T

(15 points)

Assume that the perfect rule

  IF act=dip THEN inflated=F

has just been constructed. Follow the Prism sequential covering algorithm to construct the remaining classification rules for the target inflated=F. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.

Solutions:

After the rule IF act=dip THEN F is constructed, all the instances
correctly classified by this rule are removed from consideration.
The remaining set of instances is:

small,	stretch,	adult,	T

small,	stretch,	child,	F

large,	stretch,	adult,	T

large,	stretch,	child,	F


Now we start with a new rule:

  IF ? THEN F

this rule is not perfect and so we look for the best attribute-value pair
to add to the antecedent of the rule:

 p/t    CANDIDATE CONDITIONS: 

 1/2	size=small
 1/2	size=large
 2/4	act=stretch
 0/2	age=adult
 2/2	age=child


The condition with the best p/t ratio is age=child:

  IF age=child THEN F

The rule is now perfect as its accuracy over the training data is 100%.
Hence, we are done with the construction of this rule.

We now remove the dataset instances covered by this rule. Since no
instances with inflated=F remain in the dataset, we are done
with the construction of rules predicting inflated=F.

The resulting set of rules is:

  IF act=dip THEN inflated=F
  IF age=child THEN inflated=F

Problem III. Association Rules (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository.

@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute age               {adult, child}
@attribute inflated          {T, F}

@data

small,	stretch,	adult,	T

small,	stretch,	child,	F

small,	dip,		adult,	F

small,	dip,		child,	F

large,	stretch,	adult,	T

large,	stretch,	child,	F

large,	dip,		adult,	F

large,	dip,		child,	F

Assume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances).

(25 Points) Use the Apriori algorithm to construct all the frequent itemsets in this dataset. The first two levels of frequent itemsets are provided below.

LEVEL 1

SUPPORT  ITEMSETS              

 4/8	{size=small}
 4/8	{size=large}
 4/8	{act=stretch}
 4/8	{act=dip}
 4/8	{age=adult}
 4/8	{age=child}
 2/8	{inflated=T}
 6/8	{inflated=F}

LEVEL 2

SUPPORT  ITEMSETS              

 2/8	{size=small, act=stretch}
 2/8	{size=small, act=dip}
 2/8	{size=small, age=adult}
 2/8	{size=small, age=child}
 1/8	{size=small, inflated=T}
 3/8	{size=small, inflated=F}


 2/8	{size=large, act=stretch}
 2/8	{size=large, act=dip}
 2/8	{size=large, age=adult}
 2/8	{size=large, age=child}
 1/8	{size=large, inflated=T}
 3/8	{size=large, inflated=F}

 2/8	{act=stretch, age=adult}
 2/8	{act=stretch, age=child}
 2/8	{act=stretch, inflated=T}
 2/8	{act=stretch, inflated=F}

 2/8	{act=dip, age=adult}
 2/8	{act=dip, age=child}
 0/8	{act=dip, inflated=T}
 4/8	{act=dip, inflated=F}

 2/8	{age=adult, inflated=T}
 2/8	{age=adult, inflated=F}

 0/8	{age=child, inflated=T}
 4/8	{age=child, inflated=F}

LEVEL 3 Compute all the candidate and frequent itemsets for level 3. Use both the join and the subset pruning criteria to make the process more efficient.

SUPPORT  ITEMSETS              


SOLUTIONS:

 1/8	{size=small, act=stretch, age=adult}
 1/8	{size=small, act=stretch, age=child}
 1/8	{size=small, act=stretch, inflated=F}
 1/8	{size=small, act=dip, age=adult}
 1/8	{size=small, act=dip, age=child}
 2/8	{size=small, act=dip, inflated=F}
 1/8	{size=small, age=adult, inflated=F}
 2/8	{size=small, age=child, inflated=F}

 1/8	{size=large, act=stretch, age=adult}
 1/8	{size=large, act=stretch, age=child}
 1/8	{size=large, act=stretch, inflated=F}
 1/8	{size=large, act=dip, age=adult}
 1/8	{size=large, act=dip, age=child}
 2/8	{size=large, act=dip, inflated=F}
 1/8	{size=large, age=adult, inflated=F}
 2/8	{size=large, age=child, inflated=F}

 2/8	{act=stretch, age=adult, inflated=T}
 0/8	{act=stretch, age=adult, inflated=F}
 XXX	{act=stretch, age=child, inflated=T} --> There is no need 
                          to check the support of this itemset as 
                          it is removed by subset prune because 
                          {age=child, inflated=T} is not frequent.
 2/8	{act=stretch, age=child, inflated=F}

 2/8	{act=dip, age=adult, inflated=F}
 2/8	{act=dip, age=child, inflated=F}

Hence, the frequent 3-itemsets are:

 2/8	{size=small, act=dip, inflated=F}
 2/8	{size=small, age=child, inflated=F}

 2/8	{size=large, act=dip, inflated=F}
 2/8	{size=large, age=child, inflated=F}

 2/8	{act=stretch, age=adult, inflated=T}
 2/8	{act=stretch, age=child, inflated=F}

 2/8	{act=dip, age=adult, inflated=F}
 2/8	{act=dip, age=child, inflated=F}

LEVEL 4 Compute all the candidate and frequent itemsets for level 4. Use both the join and the subset pruning criteria to make the process more efficient.

SUPPORT  ITEMSETS              

Solutions:

  Note that no pair of frequent 3-itemsets satisfies the join condition.
  Hence there are no candidate and no frequent itemsets in Level 4.

(5 points) Consider the association rule:

      act=stretch -> age=adult, inflated=T

Compute the confidence of this rule. Show the steps of your calculations.

 Solution:

  confidence( act=stretch -> age=adult, inflated=T )
    = support (act=stretch, age=adult, inflated=T) / support (act=stretch) 
    = 2/4
    = 0.5 (or 50%)

Problem IV. Evaluation (15 points)

Given a data mining classification technique (e.g., decision trees, classification rules) and a given dataset D, consider the problem of testing the accuracy of the technique over the dataset D.

(10 points) Explain in detail the procedure followed by n-fold cross-validation for the above purpose.

Solution:
For n-fold cross-validation, the dataset instances are divided into n
subsets, roughly of the same size. Let D₁,...,D_n denote
those n subsets. Then the following procedure is followed:

   For k := 1 to n do
     Construct a model using the union of D₁,...,D_k-1,D_k+1,...,D_n
       as the training set
     Calculate the accuracy of the model using D_k as the test set.
        Let's call that accuracy A_k
   end-For

   Return the average of A₁,...,A_n as the accuracy
   of the data mining technique over the given dataset.

(5 points) Assume that we construct a model by applying the data mining technique to the full dataset D. Let T be the accuracy obtained by testing this model on a given, separate test dataset. On the other hand, let A be the accuracy obtained by using n-fold cross-validation on the dataset D. How do A and T compare with each other? Is A always higher than T ? Is A always the same as T ? Is A always lower than T ? Or none of the above? Explain your answer in detail.
```
Solution:

n-fold cross-validation won't always return a higher (similarly, lower, equal)
classification accuracy than that of testing over a separate test dataset. 
It just produces a more reliable estimate of the accuracy of a data mining 
technique over a given dataset.
```