## CS444X Data Mining and Knowledge Discovery in Databases. D-2003Exam 2 - April 29, 2003

### Prof. Carolina RuizDepartment of Computer ScienceWorcester Polytechnic Institute

Solutions by Carolina Ruiz.

#### Instructions

• Ask in case of doubt

#### Problem I. Data Mining Techniques (40 points)

For each of the following data mining applications, state what data mining task best matches the objectives described (possible tasks are classification, clustering, association, and regression). Also state what the instances are and what the attributes are in each case. Note that the training and test data are not described explicitly, so you will have to infer what these would be like from the statement.

1. (10 Points) Predicting the approximate number of olympic medals won by athletes of a given country based on information about the country such as population, per capita income, type of government, and geographic location.
```Solution:

The objective is similar to classification,
but there is a large number of possible "classes", and the numerical
distance between them plays a role in evaluating prediction quality.

(2 points) Instances:

The instances are countries.

(2 points) Attributes

The attributes are population, per capita
income, type of government, geographic location, etc.

```

2. (10 Points) Discovering subspecies within an animal species based on characteristics of individual animals like size, speed, longevity, and fur color.
```
Solution:

This is clustering.

The subspecies are not known in advance, thus we have an unsupervised learning
problem. Instances are to be grouped into subsets within which similarity is
higher than it is between different subgroups.

(2 points) Instances:

The instances are individual animals.

(2 points) Attributes

The attributes are the characteristics listed (size, speed, ...)

```

3. (10 Points) Describing relationships among people's personality traits and/or their interests, for example "introverted people like action movies", or "outgoing people are analytical".
```
Solution:
Solution:

Association.

There is no particular target attribute, and no grouping by
similarity is sought. Instead co-occurrences of attribute values
are the objective.

(2 points) Instances:

The instances are people.

(2 points) Attributes

The attributes are the personality traits and interests described
in the statement.

```
Solution: Solution:

4. (10 Points) Predicting whether a person will buy T-shirts or not based on the person's answers to survey questions.
```
Solution:

Classification.

There is a clear class attribute (buys-T-shirts) that must be predicted.
There is no numeric prediction nor notion of distance between class values
that is used to penalize some class errors less than others (as would be
the case in regression).

(2 points) Instances:

The instances are people.

(2 points) Attributes

The attributes are (answers to) the survey questions.

```

#### Problem II. Association Rules (30 + 10 extra points)

This problem refers to the credit dataset discussed in class, which is shown below in ARFF format. The instances of this dataset may be interpreted as transactions as discussed in class. Each transaction is a list of items. Each item is an attribute-value pair of the form attribute=value.
```@relation credit-data

@attribute debt {minor, major}
@attribute risk {low, moderate, high}

@data
%
% 14 instances
%

unknown,  minor,  none,      moderate
unknown,  minor,  none,      low
unknown,  major,  none,      high
unknown,  major,  none,      high
good,     minor,  none,      low
good,     major,  none,      high
good,     major,  none,      moderate
good,     major,  none,      low
```

1. (10 Points) a) Compute the support and the confidence of the association rule credit-history=bad & collateral=none => risk=high relative to the dataset shown. You may leave your answers in the form of fractions if you wish.
```Solution:

Support of a rule is the percentage of instances in that dataset that
contain all the items in the rule:

support = P(credit-history=bad & collateral=none & risk=high) = 2/14 = 1/7

Confidence of a rule is the percentage of instances that contain all the items
in the left-hand side of the rule among those instances that contain  all the
items in the right-hand side of the rule:

confidence = P(risk=high | credit-history=bad & collateral=none) = 2/3

```

2. (10 Points) How many *different* 1-itemsets are there for this dataset?
```Solution:

As many as there are constraints of the form attribute=value, that is,
as many as the total number of different attribute values:

debt:           2  {minor, major}
risk:           3  {low, moderate, high}

The total number is 10.
```

3. (10 Points) Find all frequent 1-itemsets for a minimum support threshold of 0.40 (i.e. 40%). Express each itemset in the form attribute=value. In each case, write the number of instances in which the 1-itemset appears next to the itemset. Note the approximate values 5/14=.36, 6/14=.43.
```Solution:

The minimum support value of 0.40 requires that a frequent 1-itemset
appear in 6 or more instances. Here are the counts and the winners:

Attribute-value pair          support   frequent?

credit_history=unknown       5/14    no
credit_history=good          5/14    no
debt=minor                   7/14    yes
debt=major                   7/14    yes
collateral=none             11/14    yes
risk=low                     5/14    no
risk=moderate                4/14    no
risk=high                    5/14    no

```

4. (10 extra points) Based on your answer for the number of frequent 1-itemsets, how many *candidate* 2-itemsets would the Apriori algorithm consider when seeking frequent 2-itemsets? You may assume that only 2-itemsets corresponding to two different attributes are considered here.
```Solution:

the candidate 2-itemsets are pairs of frequent 1-itemsets.
There are 2 such pairs that combine 1-itemsets from different attributes:

{debt=minor, collateral=none}
{debt=major, collateral=none}

```

#### Problem III. Numeric Predictions (35 points + 10 extra credit points)

Consider the following dataset consisting of two attributes age and weight (measured in pounds). Note that instances have been labeled with a number in parentheses in case you need to refer to them in your solutions.
```@relation how-much-you-weigh

@attribute weight numeric

@data
(1) young,    110
(2) young,     90
(5) elderly,  130
(6) elderly,  160

```
The purpose of this problem is to predict the weight of a person based on his/her age.

1. (15 Points)Nominal to Binary Translate the nominal attribute age into as many binary attributes as needed. Show your work.
```
Solution:

In order to convert the nominal attribute age into binary ones, we calculate the
average weight for each age value:

young:  average weight = 100 = (110 + 90)/2
adult:  average weight = 160 = (150 + 170)/2
eldery: average weight = 145 = (130 + 160)/2

Now, we sort those nominal values in decreasing order of average weight:

We intruce two binary attributes:

```
Show below the dataset that results from converting the age attribute into the binary attribute(s) (each of which will be treated as numeric).
```@relation modified-how-much-you-weigh

@attribute weight numeric

@data

(1) __ 0,  0, _____________________    110

(2) __ 0,  0, _____________________     90

(3) __ 1,  1, _____________________    150

(4) __ 1,  1, _____________________    170

(5) __ 0,  1, _____________________    130

(6) __ 0,  1, _____________________    160

```

2. (10 Points)Root of Model/Regression Tree:
Assume that we want to construct a model (or a regression) tree to predict weight in terms of the binary attributes you contructed above. Identify the split points that you need to consider and select from them the one for the root of the model/regression tree. Show your work.
```
Note that there are two split points (one for each binary attribute)
that we need to consider:

Let's denote the 1st of them as p1 (i.e. p1 = "age=adult = 0.5") and the
second one of them as p2 (i.e. p2 = "age=elderly,adult = 0.5") .

We select as the condition for the root node of our tree
the split point that maximizes the value of the following formula:

SDR = sd(weight over all instances)
- [(k1/n)*sd(weight of instances with attribute value below split point)
+ (k2/n)*sd(weight of instances with attribute value above spl it point)]

where sd stands for standard deviation.
k1 is the number of instances with attribute value below split point.
k2 is the number of instances with attribute value above split point.
n is the number of instances.

Using the standard deviations sd) calculated by Ethan (thanks Ethan!!)
we calculated the standard deviation reduction of each of these split points:

p1: STR = sd(110,90,150,170,130,160) - [(4/6)*sd(110,90,130,160) + (2/6)*sd(150,170)]

= 30.82207-  [(4/6)* 29.86079 + (2/6)* 14.14214]

= 6.200833

p2: STR = sd(110,90,150,170,130,160) - [(2/6)*sd(110,90) + (4/6)*sd(150,170,130,160)]

= 30.82207-  [(2/6)* 14.14214 + (4/6)* 17.07825]

= 14.72252

Hence, p2 is used for the root of the model/regression tree:

-------------------
-------------------
/                     \
< 0.5/                       \ > 0.5
/                         \
```

3. (10 Points)Regression Tree:
Continue the construction of the tree to obtain a REGRESSION TREE to predict weight. Show what the remaining internal nodes are and show the precise function used to make predictions in each of the leaves.
```
Solution:

Let's construct the right-most and the left-most children of the root node.

Note that instances (1) and (2) lie in this child. Both of them
are such that age=adult is < 0.5, so there is no need to add more
"internal" nodes (i.e. tests) to this branch.

So the left-most child of the root is a leaf node with predicting
function equal to the average weight of instances (1) and (2).

Predicting function = (90 + 110)/2 = 100

Note that instances (3), (4), (5), and (6) lie in this child. The only
remaining split point is p1, so we need to use it:

-------------------
-------------------
/                     \
< 0.5/                       \ > 0.5
/                         \

~~~~~~~                    -----------
|  100  |                  | age=adult |
~~~~~~~                    -----------
instances 1, 2              /             \
< 0.5/               \ > 0.5
/                 \
~~~~~~~~           ~~~~~~~~
|  145   |         |  160   |
~~~~~~~~           ~~~~~~~~
instances 5, 6     instances 3, 4

The predictive functions of each of the last two leaves is the
average weight of the instances that belong to the leaf.
```

4. (10 extra credit Points)Model Tree:
Now, modify the leaves of your regression tree above to obtain a MODEL TREE to predict weight. Show the precise function used to make predictions in each of the leaves.
```Solution:

The structure of the model tree is identical to that of the regression
tree constructed above. The only difference is the predictive function
in each of the nodes. They are linear regressions that predict the
target attribute (weight) in terms of the attributes age=adult and

-------------------
-------------------
/                     \
< 0.5/                       \ > 0.5
/                         \

~~~~~~~                    -----------
|  L1   |                  | age=adult |
~~~~~~~                    -----------
instances 1, 2              /             \
< 0.5/               \ > 0.5
/                 \
~~~~~~~~           ~~~~~~~~
|   L2   |         |   L3   |
~~~~~~~~           ~~~~~~~~
instances 5, 6     instances 3, 4

Where:

with w0, w1, and w2 computed using linear regression over instances 1, 2.

with w3, w4, and w5 computed using linear regression over instances 5, 6.

with w6, w7, and w8 computed using linear regression over instances 3, 4.

-------------------

For comparison purposes, I include below Weka's output to this problem.

=== Run information ===

Scheme:       weka.classifiers.m5.M5Prime -O m -U -F 10.0 -V 5
Relation:     how-much-you-weigh
Instances:    6
Attributes:   2
age
weight
Test mode:    evaluate on training data

=== Classifier model (full training set) ===

Unpruned training model tree:

age=elderly,adult <= 0.5 : LM1 (2/40%)
|   age=adult <= 0.5 : LM2 (2/60%)
|   age=adult >  0.5 : LM3 (2/40%)

Models at the leaves:

LM1:  weight = 100
LM2:  weight = 145
LM3:  weight = 160

Pruned training model tree:

LM1 (6/53.5%)

Models at the leaves:

Unsmoothed (simple):

LM1:  weight = 100 + 52.5age=elderly,adult

Number of Rules : 1

Time taken to build model: 0 seconds

=== Evaluation on training set ===
=== Summary ===

Correlation coefficient                  0.8796
Mean absolute error                     11.6667
Root mean squared error                 13.3853
Relative absolute error                 46.6667 %
Root relative squared error             47.5727 %
Total Number of Instances                6

```