CS 4445 Data Mining and Knowledge Discovery in Databases  A Term 2008
Homework and Project 4: Instance Based Learning and Clustering
DUE DATE:
 The individual homework assignment is due on Tuesday, Oct. 7th 2008 at 1:00 pm
 The group project is due on Friday, Oct. 10th 2008 at 12:00 noon.
HOMEWORK AND PROJECT OBJECTIVES
The purpose of this project is to construct the most accurate
models of different aspects of the datasets under consideration
using the following data mining techniques:
 Instance Based Learning:
 IBk: knearest neighbors
 Locally Weighted Regression. This is implemented in
the Weka system. Look for LWL (Locally Weighted Learning) under
the "lazy" classifiers and use "Linear Regression" as the
"classifier" parameter for this LWL.
 Clustering:
 Simple kmeans
 Hierarchical Clustering: COBWEB/CLASSIT
Also, to gain close understanding of how those methods work,
this project also include following those methods by hand
on a toy dataset.
INDIVIDUAL HOMEWORK ASSIGNMENT
See Solutions by Piotr Mardziel and Amro Khasawneh.
Consider the dataset below.
This dataset is an adaptation of the
IQ Brain Size Dataset.
@relation small_twin_dataset
@attribute CCMIDSA numeric % Corpus Collasum Surface Area (cm2)
@attribute GENDER {male,female}
@attribute TOTVOL numeric % Total Brain Volume (cm3)
@attribute WEIGHT numeric % Body Weight (kg)
@attribute FIQ numeric % FullScale IQ
@data
6.08, female, 1005, 57.607, 96
5.73, female, 963, 58.968, 89
7.99, female, 1281, 63.958, 101
8.42, female, 1272, 61.69, 103
6.84, female, 1079, 107.503, 96
6.43, female, 1070, 83.009, 126
7.6, male, 1347, 97.524, 94
6.03, male, 1029, 81.648, 97
7.52, male, 1204, 79.38, 113
7.67, male, 1160, 72.576, 124
 (40 points + 10 extra credits) Instance Based Learning
Assume that we want to predict the FIQ attribute (prediction target)
from the other predicting attributes
CCMIDSA, GENDER, TOTVOL, and WEIGHT.
 (10 points) Compute the 4 nearest neighbors of each of the two test instances below.
For this part, use the Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute GENDER as given:
two different values of this attribute (e.g., "female" and "male")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "female" and "female")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, ?
6.59, male, 1100, 88.452, ?
 (5 points) Predict the FIQ attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the FIQ
value of each instance is provided so that you can compute error values).
Show your work in your report.
4NN
PREDICTION
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127 ___________
6.59, male, 1100, 88.452, 114 ___________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Predict the target attribute FIQ of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the FIQ values of the 4 nearestneighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4NN
PREDICTION
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127 __________
6.59, male, 1100, 88.452, 114 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 Now preprocess each of the predicting numeric attributes
(CCMIDSA, TOTVOL, and WEIGHT.)
so that they range from 0 to 1. That is, for each numeric attribute (except for
FIQ), replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
 (5 points)
Show your work in your report and show the resulting dataset.
 (10 points) Compute the 4 nearest neighbors of each of the two test instances below
with respect to this normalized dataset.
For this part, use the Euclidean distance without attribute weights.
Also, use the nominal attribute GENDER as given:
two different values of this attribute (e.g., "female" and "male")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "female" and "female")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, ?
6.59, male, 1100, 88.452, ?
 (5 points) Predict the FIQ attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the FIQ
value of each instance is provided so that you can compute error values).
Show your work in your report.
4NN
PREDICTION
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127 __________
6.59, male, 1100, 88.452, 114 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Predict the target attribute FIQ of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the FIQ values of the 4 nearestneighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4NN
PREDICTION
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127 __________
6.59, male, 1100, 88.452, 114 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Which of the methods above produced the best results? Explain.
 (40 points) Clustering  Simple Kmeans

Assume that we want to cluster the instances in the smalltwindataset
above in 2 clusters.
Use the Simple Kmeans algorithm with Euclidean distance without
modifying the numeric attributes nor using attribute weights.
Also, use the nominal attribute GENDER as given:
two different values of this attribute (e.g., "female" and "male")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "female" and "female")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 2 randomly selected initial centroids are:
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127
6.59, male, 1100, 88.452, 114
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 2 centroids.
 (5 points) Compute new centroids for the 2 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 2 new centroids.
 (5 points) Compute new centroids for these 2 new resulting clusters.
Show your work in your report.

Repeat the process above but now starting with a normalized dataset. That is,
assume that we want to cluster the instances in the smalltwindataset
above in 2 clusters
using the Simple Kmeans algorithm with Euclidean distance without attribute weights,
but preprocessing each of the numeric attributes
so that they range from 0 to 1. That is, for each numeric attribute,
including FIQ,
replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
Use the nominal attribute GENDER as given:
two different values of this attribute (e.g., "female" and "male")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "female" and "female")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 2 randomly selected initial centroids are:
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.48, female, 1034, 62.143, 127
6.59, male, 1100, 88.452, 114
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 2 centroids.
 (5 points) Compute new centroids for the 2 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 2 new centroids.
 (5 points) Compute new centroids for these 2 new resulting clusters.
Show your work in your report.
 (20 points) Clustering  Hierarchical Clustering
Assume that we want to cluster a dataset using
the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we
have created a partial clustering containing just the first 4 instances.
The following tree shows the current partial clustering, where the
numbers in parenthesis (1), (2), (3), (4) represent the 1st, 2nd, 3rd, and
4th data instances
respectively and "o" denotes just an internal node.
o
/ \
o (2)
/ \
o (3)
/ \
(1) (4)
Your job is to describe all the alternatives that are considered by the
COBWEB/CLASSIT algorithm when adding instance (5) to this clustering tree.
 (2 points)
How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
 (15 points)
List each one of those alternatives and show what the resulting
clustering tree would be for each of those alternatives.
 (3 points)
Explain how the COBWEB/CLASSIT algorithm would choose which of those
alternatives to select as the best one. Note: you don't need to
compute the category utility value of each of those alternatives. You
just need to explain how the selection of the best alternative is
made by the algorithm.
PROJECT ASSIGNMENT
[800 points: 100 points per new data mining technique (instance based learning,
clustering); 80 points per previously studied data mining technique;
40 points for meaningful interpretation of the resulting clusters for
each of the clustering methods dataset.
See below for the distribution of these points.]