### CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2006 Solutions Exam 1 - November 20, 2006

#### Problem I. Decision Trees (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository. Assume that the classification target is the attribute inflated.
```@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute inflated          {T, F}

@data

small,	stretch,	child,	F

small,	dip,		child,	F

large,	stretch,	child,	F

large,	dip,		child,	F

```

Construct the FULL decision tree for this dataset USING THE ID3 ALGORITHM. Show all the steps of the entropy calculations.

For your convenience, the logarithm in base 2 of selected values are provided.

```

x
1/2
1/3
2/3
1/4
3/4
1/5
2/5
3/5
1/6
5/6
1/7
2/7
3/7
4/7
1

log2(x)
-1
-1.5
-0.6
-2
-0.4
-2.3
-1.3
-0.7
-2.5
-0.2
-2.8
-1.8
-1.2
-0.8
0
```
```
SOLUTIONS:

Let's compute the entropy of the 3 predicting attributes with respect to
the target attribute ("inflated"):

INFLATE            T                  F

SIZE

small   (4/8)*[ - (1/4)*log2(1/4)  - (3/4)*log2(3/4)] = 0.4
large   (4/8)*[ - (1/4)*log2(1/4)  - (3/4)*log2(3/4)] = 0.4
-------
0.8
ACT

stretch (4/8)*[ - (2/4)*log2(2/4)  - (2/4)*log2(2/4)] = 0.5
dip     (4/8)*[ - (0/4)*log2(0/4)  - (4/4)*log2(4/4)] = 0
-------
0.5

AGE

adult   (4/8)*[ - (2/4)*log2(2/4)  - (2/4)*log2(2/4)] = 0.5
child   (4/8)*[ - (0/4)*log2(0/4)  - (4/4)*log2(4/4)] = 0
-------
0.5

Act and Age have the same entropy. We can choose either one as the
root of our decision tree. I'll choose Act:

ACT
/       \
stretch   /         \  dip
/           \
2 T's, 2 F's         4 F's

The node at the end of ACT=dip branch contains only F's, then we
convert that node into a leaf that predicts F.
Since the node at the end of the ACT=stretch branch contains
F's and T's, then we need to select an attribute to split it further:

Here are the 4 instances under consideration:

small,	stretch,	child,	F

large,	stretch,	child,	F

We compute the entropies of SIZE and AGE w.r.t. INFLATE in this
smaller dataset:

INFLATE            T                  F

SIZE

small   (2/4)*[ - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.5
large   (2/4)*[ - (1/2)*log2(1/2)  - (1/2)*log2(1/2)] = 0.5
-------
1.0
AGE

adult   (2/4)*[ - (2/2)*log2(2/2)  - (0/2)*log2(0/2)] = 0
child   (2/4)*[ - (0/2)*log2(0/2)  - (2/2)*log2(2/2)] = 0
-------
0

The attribute with the lowest entropy is AGE and then it is used
to split the node under consideration:

ACT
/       \
stretch   /         \  dip
/           \
AGE            F
/   \
/       \
T         F

This completes the construction of the decision tree.

```

#### Problem II. Classification Rules (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository. Assume that the classification target is the attribute inflated
```@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute inflated          {T, F}

@data

small,	stretch,	child,	F

small,	dip,		child,	F

large,	stretch,	child,	F

large,	dip,		child,	F

```

Assume that we want to construct classification rules for this dataset.

1. (15 points) Follow the Prism sequential covering algorithm to construct classification rules for the target inflated=T. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.
```Solutions:

We begin with the rule:

IF ? THEN inflated=T

this rule is not perfect and so we look for the best attribute-value pair
to add to the antecedent of the rule:

p/t    CANDIDATE CONDITIONS:

1/4	size=small
1/4	size=large
2/4	act=stretch
0/4	act=dip
0/4	age=child

Since both act=stretch and age=adult have the same, maximum p/t
ratio among the conditions, we can select any one of them,
say act=stretch. The resulting rule:

IF act=stretch THEN inflated=T

The rule is not perfect and hence we look for a second condition
to add to the antecedent of the rule.

p/t    CANDIDATE CONDITIONS:

1/2	act=stretch and size=small
1/2	act=stretch and size=large
0/2	act=stretch and age=child

in the rule:

IF act=stretch and age=adult THEN inflated=T

The rule is now perfect as its accuracy over the training data is 100%.
Hence, we are done with the construction of this rule.

We now remove the dataset instances covered by this rule. Since no
instances with inflated=T remain in the dataset, we are done
with the construction of rules predicting inflated=T.

The resulting set of rules consists of the rule:

IF act=stretch and age=adult THEN inflated=T

```

2. (15 points)

Assume that the perfect rule

```  IF act=dip THEN inflated=F
```
has just been constructed. Follow the Prism sequential covering algorithm to construct the remaining classification rules for the target inflated=F. Use the p/t measure to choose the best conditions for the rules. SHOW ALL THE STEPS OF YOUR CALCULATIONS.
```Solutions:

After the rule IF act=dip THEN F is constructed, all the instances
correctly classified by this rule are removed from consideration.
The remaining set of instances is:

small,	stretch,	child,	F

large,	stretch,	child,	F

IF ? THEN F

this rule is not perfect and so we look for the best attribute-value pair
to add to the antecedent of the rule:

p/t    CANDIDATE CONDITIONS:

1/2	size=small
1/2	size=large
2/4	act=stretch
2/2	age=child

The condition with the best p/t ratio is age=child:

IF age=child THEN F

The rule is now perfect as its accuracy over the training data is 100%.
Hence, we are done with the construction of this rule.

We now remove the dataset instances covered by this rule. Since no
instances with inflated=F remain in the dataset, we are done
with the construction of rules predicting inflated=F.

The resulting set of rules is:

IF act=dip THEN inflated=F
IF age=child THEN inflated=F

```

#### Problem III. Association Rules (30 points)

Consider the following toy dataset adapted from the Ballons Dataset of the UCI Data repository.
```@relation balloons

@attribute size              {large, small}
@attribute act               {stretch, dip}
@attribute inflated          {T, F}

@data

small,	stretch,	child,	F

small,	dip,		child,	F

large,	stretch,	child,	F

large,	dip,		child,	F

```
Assume that we want to mine association rules with minimum support: 0.25 (that is, the itemset has to be present in at least 2 data instances).

1. (25 Points) Use the Apriori algorithm to construct all the frequent itemsets in this dataset. The first two levels of frequent itemsets are provided below.
```LEVEL 1

SUPPORT  ITEMSETS

4/8	{size=small}
4/8	{size=large}
4/8	{act=stretch}
4/8	{act=dip}
4/8	{age=child}
2/8	{inflated=T}
6/8	{inflated=F}

LEVEL 2

SUPPORT  ITEMSETS

2/8	{size=small, act=stretch}
2/8	{size=small, act=dip}
2/8	{size=small, age=child}
1/8	{size=small, inflated=T}
3/8	{size=small, inflated=F}

2/8	{size=large, act=stretch}
2/8	{size=large, act=dip}
2/8	{size=large, age=child}
1/8	{size=large, inflated=T}
3/8	{size=large, inflated=F}

2/8	{act=stretch, age=child}
2/8	{act=stretch, inflated=T}
2/8	{act=stretch, inflated=F}

2/8	{act=dip, age=child}
0/8	{act=dip, inflated=T}
4/8	{act=dip, inflated=F}

0/8	{age=child, inflated=T}
4/8	{age=child, inflated=F}
```
LEVEL 3 Compute all the candidate and frequent itemsets for level 3. Use both the join and the subset pruning criteria to make the process more efficient.
```SUPPORT  ITEMSETS

SOLUTIONS:

1/8	{size=small, act=stretch, age=child}
1/8	{size=small, act=stretch, inflated=F}
1/8	{size=small, act=dip, age=child}
2/8	{size=small, act=dip, inflated=F}
2/8	{size=small, age=child, inflated=F}

1/8	{size=large, act=stretch, age=child}
1/8	{size=large, act=stretch, inflated=F}
1/8	{size=large, act=dip, age=child}
2/8	{size=large, act=dip, inflated=F}
2/8	{size=large, age=child, inflated=F}

XXX	{act=stretch, age=child, inflated=T} --> There is no need
to check the support of this itemset as
it is removed by subset prune because
{age=child, inflated=T} is not frequent.
2/8	{act=stretch, age=child, inflated=F}

2/8	{act=dip, age=child, inflated=F}

Hence, the frequent 3-itemsets are:

2/8	{size=small, act=dip, inflated=F}
2/8	{size=small, age=child, inflated=F}

2/8	{size=large, act=dip, inflated=F}
2/8	{size=large, age=child, inflated=F}

2/8	{act=stretch, age=child, inflated=F}

2/8	{act=dip, age=child, inflated=F}

```
LEVEL 4 Compute all the candidate and frequent itemsets for level 4. Use both the join and the subset pruning criteria to make the process more efficient.
```SUPPORT  ITEMSETS

Solutions:

Note that no pair of frequent 3-itemsets satisfies the join condition.
Hence there are no candidate and no frequent itemsets in Level 4.

```

2. (5 points) Consider the association rule:
```      act=stretch -> age=adult, inflated=T
```
Compute the confidence of this rule. Show the steps of your calculations.
``` Solution:

confidence( act=stretch -> age=adult, inflated=T )
= support (act=stretch, age=adult, inflated=T) / support (act=stretch)
= 2/4
= 0.5 (or 50%)

```

#### Problem IV. Evaluation (15 points)

Given a data mining classification technique (e.g., decision trees, classification rules) and a given dataset D, consider the problem of testing the accuracy of the technique over the dataset D.
1. (10 points) Explain in detail the procedure followed by n-fold cross-validation for the above purpose.
```Solution:
For n-fold cross-validation, the dataset instances are divided into n
subsets, roughly of the same size. Let D1,...,Dn denote
those n subsets. Then the following procedure is followed:

For k := 1 to n do
Construct a model using the union of D1,...,Dk-1,Dk+1,...,Dn
as the training set
Calculate the accuracy of the model using Dk as the test set.
Let's call that accuracy Ak
end-For

Return the average of A1,...,An as the accuracy
of the data mining technique over the given dataset.

```

2. (5 points) Assume that we construct a model by applying the data mining technique to the full dataset D. Let T be the accuracy obtained by testing this model on a given, separate test dataset. On the other hand, let A be the accuracy obtained by using n-fold cross-validation on the dataset D. How do A and T compare with each other? Is A always higher than T ? Is A always the same as T ? Is A always lower than T ? Or none of the above? Explain your answer in detail.
```Solution:

n-fold cross-validation won't always return a higher (similarly, lower, equal)
classification accuracy than that of testing over a separate test dataset.
It just produces a more reliable estimate of the accuracy of a data mining
technique over a given dataset.

```