CS 4445 Data Mining and Knowledge Discovery in Databases  A Term 2004
Homework and Project 4: Instance Based Learning and Clustering
DUE DATE:
Part I (the individual homework assignment) is due on Friday, October 8 at 1:00 pm
(NO LATE HOMEWORK SUBMISSIONS WILL BE ACCEPTED) and
Parts II.1 and II.2 (the individual+group project) are due on Monday, October 11 2004
at 10 pm.
HOMEWORK AND PROJECT DESCRIPTION
The purpose of this project is to construct the most accurate
models of different aspects of the two datasets under consideration
using the following data mining techniques:
 Instance Based Learning:
 IBk: knearest neighbors
 Locally Weighted Regression. This is implemented in
the Weka system. Look for LWL (Locally Weighted Learning) under
the "lazy" classifiers and use "Linear Regression" as the
"classifier" parameter for this LWL.
 Clustering:
 Simple kmeans
 Hierarchical Clustering: COBWEB/CLASSIT
 Clustering using Expectation Maximization: EM
Also, to gain close understanding of how those methods work,
this project also include following those methods by hand
on a toy dataset.
HOMEWORK ASSIGNMENT
See
Solutions to this HW assignment by James Martineau.
Consider the dataset below.
This dataset is a subset of the
Auto Milespergalon (MPG) dataset that is available at the
The University of California Irvine (UCI) Data Repository.
@relation subsetautompg
@attribute mpg numeric
@attribute cylinders numeric
@attribute horsepower numeric
@attribute weight numeric
@attribute acceleration numeric
@attribute modelyear numeric
@attribute carname {chevrolet,toyota,volkswagen,ford}
@data
18, 8, 130, 3504, 12, 70, chevrolet
27, 4, 90, 2950, 17.3, 82, chevrolet
34, 4, 88, 2395, 18, 82, chevrolet
24, 4, 95, 2372, 15, 70, toyota
28, 4, 75, 2155, 16.4, 76, toyota
32, 4, 96, 2665, 13.9, 82, toyota
26, 4, 46, 1835, 20.5, 70, volkswagen
29, 4, 70, 1937, 14.2, 76, volkswagen
44, 4, 52, 2130, 24.6, 82, volkswagen
10, 8, 215, 4615, 14, 70, ford
28, 4, 79, 2625, 18.6, 82, ford
16, 8, 149, 4335, 14.5, 77, ford
 (40 points + 10 extra credits) Instance Based Learning
For this part, we want to predict the mpg attribute (prediction target)
from the other predicting attributes
cylinders,
horsepower,
weight,
acceleration,
modelyear, and
carname
 (10 points) Compute the 5 nearest neighbors of each of the four test instances below.
For this part, use the Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute carname as given:
two different values of this attribute (e.g. "toyota" and "ford")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g. "toyota" and "toyota")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
?, 8, 150, 4464, 12, 73, chevrolet
?, 6, 122, 2807, 13.5, 73, toyota
?, 4, 60, 1834, 19, 71, volkswagen
?, 4, 66, 1800, 14.4, 78, ford
 (5 points) Predict the mpg attribute of the test instances using their
5 nearest neighbors without any distance weighting. (Note that now the mpg
value of each instance is provided so that you can compute error values).
Show your work in your report.
5NN
PREDICTION
13, 8, 150, 4464, 12, 73, chevrolet __________
20, 6, 122, 2807, 13.5, 73, toyota __________
27, 4, 60, 1834, 19, 71, volkswagen __________
36.1, 4, 66, 1800, 14.4, 78, ford __________
5NN
ERROR
root meansquare error (see p. 148) __________
mean absolute error (see p. 148) __________
 (5 points) Predict the target attribute mpg of each test instance
using its 5 nearest neighbors weighted by the inverse of the distance.
That is, if the mph values of the 5 nearestneighbors of a test instance
are m1, m2, m3, m4, and m5 and their respective distances to the test
instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5)
________________________________________________
w1 + w2 + w3 + w4 + w5
Show your work in your report.
5NN with weight
PREDICTION
13, 8, 150, 4464, 12, 73, chevrolet __________
20, 6, 122, 2807, 13.5, 73, toyota __________
27, 4, 60, 1834, 19, 71, volkswagen __________
36.1, 4, 66, 1800, 14.4, 78, ford __________
5NN with weight
ERROR
root meansquare error (see p. 148) __________
mean absolute error (see p. 148) __________
 Now preprocess each of the predicting numeric attributes
(i.e. cylinders, horsepower, weight, acceleration, and modelyear)
so that they range from 0 to 1. That is, for each numeric attribute (except for
mpg), replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
 (5 points)
Show your work in your report and show the resulting dataset.
 (10 points) Compute the 5 nearest neighbors of each of the four test instances below
with respect to this normalized dataset.
For this part, use the Euclidean distance without attribute weights.
Also, use the nominal attribute carname as given:
two different values of this attribute (e.g. "toyota" and "ford")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g. "toyota" and "toyota")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
?, 8, 150, 4464, 12, 73, chevrolet
?, 6, 122, 2807, 13.5, 73, toyota
?, 4, 60, 1834, 19, 71, volkswagen
?, 4, 66, 1800, 14.4, 78, ford
 (5 points) Predict the mpg attribute of the test instances using their
5 nearest neighbors without any distance weighting. (Note that now the mpg
value of each instance is provided so that you can compute error values).
Show your work in your report.
5NN
PREDICTION
13, 8, 150, 4464, 12, 73, chevrolet __________
20, 6, 122, 2807, 13.5, 73, toyota __________
27, 4, 60, 1834, 19, 71, volkswagen __________
36.1, 4, 66, 1800, 14.4, 78, ford __________
5NN
ERROR
root meansquare error (see p. 148) __________
mean absolute error (see p. 148) __________
 (5 points) Predict the target attribute mpg of each test instance
using its 5 nearest neighbors weighted by the inverse of the distance.
That is, if the mph values of the 5 nearestneighbors of a test instance
are m1, m2, m3, m4, and m5 and their respective distances to the test
instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5)
________________________________________________
w1 + w2 + w3 + w4 + w5
Show your work in your report.
5NN with weight
PREDICTION
13, 8, 150, 4464, 12, 73, chevrolet __________
20, 6, 122, 2807, 13.5, 73, toyota __________
27, 4, 60, 1834, 19, 71, volkswagen __________
36.1, 4, 66, 1800, 14.4, 78, ford __________
5NN with weight
ERROR
root meansquare error (see p. 148) __________
mean absolute error (see p. 148) __________
 (5 points) Which of the methods above produced the best results? Explain.
 (40 points) Clustering  Simple Kmeans

Assume that we want to cluster the instances in the dataset subsetautompg above in 3 clusters.
Use the Simple Kmeans algorithm with Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute carname as given:
two different values of this attribute (e.g. "toyota" and "ford")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g. "toyota" and "toyota")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 3 randomly selected initial centroids are:
18, 8, 130, 3504, 12, 70, chevrolet
32, 4, 96, 2665, 13.9, 82, toyota
26, 4, 46, 1835, 20.5, 70, volkswagen
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 3 centroids.
 (5 points) Compute new centroids for the 3 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 3 new centroids.
 (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.

Repeat the process above but now starting with a normalized
dataset. That is,
assume that we want to cluster the instances in the dataset above in 3 clusters
using the Simple Kmeans algorithm with Euclidean distance without attribute weights,
but preprocessing each of the predicting numeric attributes
(i.e. mpg, cylinders, horsepower, weight, acceleration, and modelyear)
so that they range from 0 to 1. That is, for each numeric attribute,
replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
Use the nominal attribute carname as given:
two different values of this attribute (e.g. "toyota" and "ford")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g. "toyota" and "toyota")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report and show the resulting dataset.
Assume that the 3 randomly selected initial centroids are:
18, 8, 130, 3504, 12, 70, chevrolet
32, 4, 96, 2665, 13.9, 82, toyota
26, 4, 46, 1835, 20.5, 70, volkswagen
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 3 centroids.
 (5 points) Compute new centroids for the 3 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 3 new centroids.
 (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.
 (20 points) Clustering  Hierarchical Clustering
Assume that we want to cluster the dataset subsetautompg above using
the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we
have created a partial clustering containing just the first 3 instances.
The following tree shows the current partial clustering, where the
numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances
respectively and "0" denotes just an internal node. Note that in this partial
clustering, instances (1) and (3) are clustered together and instance (2) forms
a cluster on its own.
O
/ \
0 (2)
/ \
(1) (3)
Your job is to describe all the alternatives that are considered by the
when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.
 (2 points)
How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
 (15 points)
List each one of those alternatives and show what the resulting
clustering tree would be for each of those alternatives.
 (3 points)
Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives
to select as the best one. Note: you don't need to compute the category utility
value of each of those alternatives. You just need to explain how the selection
of the best alternative is made by the algorithm.
PROJECT ASSIGNMENT
 Datasets: Consider the following sets of data:
 The
Synthetic Control Chart Time Series Dataset
available from the
The University of California Irvine (UCI) KDD Data Repository.
When needed, use the last attribute as the prediction target.
After you run experiments predicting this attribute you may, if you wish,
run additional experiments using a different predicting target of your choice.
GIVEN THAT GRAPHICAL REPRESENTATIONS OF THE TIME SERIES PROVIDE INTUITION AND
CONVEY VERY USEFUL INFORMATION AND
it is required in this project that you use a graphic tool (e.g. excel) to
plot the resulting models that you construct over this time series dataset.
For example, the set of knearest neighbors
of a given test data instance, in the case of instance based learning; and the
set of resulting clusters; in the case of clusterings.
Modify the Weka code as needed so that you can gain access to the necessary data for those plots
(i.e knearest neighbor of a test instance, and instances in each resulting cluster).
 Choose one of the following two datasets (please coordinate with your
group partner so that you both choose the same one):
 The
Abalone Dataset
available from the
The University of California Irvine (UCI) Data Repository.
Use the attribute age as the prediction target.
After you run experiments predicting this attribute you may, if you wish,
run additional experiments using a different predicting target of your choice.
 The
Automobile Dataset
available from the
The University of California Irvine (UCI) Data Repository.
Use the attribute price as the prediction target.
After you run experiments predicting this attribute you may, if you wish,
run additional experiments using a different predicting target of your choice.
Meaningful graphical representation of the results (i.e. knearest neighbors
of a given test data instance, in the case of instance based learning; and the
set of resulting clusters; in the case of clusterings) for the chosen dataset
will be rewarded with extra credit.
 Readings:
 Textbook:
Read in great detail the following Sections from your textbook:
 Instance Based Learning: Sections 4.7, 6.4.
 Clustering: Sections 6.6
 Weka Code:
Read the code of the relevant techniques implemented in the Weka system.
Some of those techniques are enumerated below:
 Instance Based Learning:
 IBk: knearest neighbors
 LWR: Locally Weighted Regression
 Clustering:
 Simple kmeans
 OPTIONAL (extra points): Hierarchical Clustering: COBWEB/CLASSIT
 OPTIONAL (extra points): Clustering using Expectation Maximization: EM
 Experiments:
For each of the datasets,
use the "Explorer" option of the Weka system to perform the
following operations:

A main part of the project is the PREPROCESSING of your dataset.
You should apply relevant filters to your dataset
before doing the mining and/or using the results of previous mining tasks.
For instance, you may decide to remove apparently irrelevant attributes,
replace missing values if any, etc.
Missing values in arff files are represented with the character "?".
See the weka.filter.ReplaceMissingValuesFilter in Weka.
Play with the filter and read the Java code implementing it.
Your report should contain a detailed description of the preprocessing of
your dataset and justifications of the steps you followed.
If Weka does not provide the functionality you need to preprocess your data
as you need to obtain useful patterns, preprocess the data yourself either
by writing the necessary filters (you can incorporate them in Weka if you wish).

You may restrict your experiments to a subset of the instances in the input data IF
Weka cannot handle your whole dataset (this is unlikely). But remember that the
more accurate your models, the better.
 INDIVIDUAL PROJECT AND WRITTEN REPORT.
Your individual report should contain discussions of all the parts of the individual
work you do for this project.
In particular, it should elaborate on the the following topics:
 Code Description:
Describe algorithmicly the Weka code of the following 3 methods:
IBk: knearest neighbors, Locally Weighted Regression (LWL), and Simple Kmeans
(optional for extra points: COBWEB/CLASSIT and EM);
and filters that you use in the project.
More precisely, explain the ALGORITHM underlying the code in terms of the input
it receives, the output it produces, and the main steps it follows to produce this
output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED
NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.
 Experiments:
For EACH EXPERIMENT YOU RAN describe:
 Instances: What data did you use for the experiment?
That is, did you use the entire dataset of just a subset of it? Why?
 Any preprocessing done to the data. That is, did you remove
any attributes?
Did you replace missing values?
If so, what strategy did you use to select a replacement of
the missing values?
 Your system parameters.
 For each of the methods covered by this project
IBk: knearest neighbors, Locally Weighted Regression (LWL), and Simple Kmeans
(optional for extra points: COBWEB/CLASSIT and EM);
 Results and detail ANALYSIS of results of the experiments you ran
using different ways of testing (split ratio and Nfold
crossvalidation) the model.
 Error/accuracy measures of the resulting models.
 NEW PART:
Meaningful graphical representation of the results (i.e. knearest neighbors
of a given test data instance, in the case of instance based learning; and the
set of resulting clusters; in the case of clusterings).
This is required for the time series dataset, and optional (extra credit) for
for the second dataset.
 Comparison of the results across the methods
used in this project.
 Summary of Results
 For each of the datasets, what were the best results obtained
in your project?
Include (the first 100 lines or so of) this model in your report.
 Strengths and weaknesses of your project.
 GROUP PROJECT AND WRITTEN REPORT.
Your group report should contain discussions of all the parts of the group project.
In particular, it should elaborate on the the following topics:
 Experiments: as described in the individual part above.
For the group part, start by running experiments that build upon
the experience that you gained with the individual projects.
 Summary of Results as described in the individual part above.
 GROUP ORAL REPORT.
We will discuss the results from the individual projects during the class
on Tuesday, October 12. Your oral report should summarize the different sections of
your written report as described above.
Each group will have about 4 minutes to explain your results and to discuss
your project in class. Be prepared!
PROJECT SUBMISSION AND DUE DATE
Part II is due Monday, Oct. 11 at 10:00 pm.
Submissions received on Monday, Oct. 11
between 10:01 pm and 12:00 midnight will be penalized with 30% off the grade and
submissions after Oct. 11 won't be accepted.
Please submit the following files using the
myWpi digital drop box:
 [lastname]_proj4_report.[ext]
containing your individual written reports.
This file should be either a PDF file (ext=pdf),
a Word file (ext=doc), or a PostScript file (ext=ps).
For instance my file would be named
(note the use of lower case letters only):
If you are taking this course for grad. credit, state this fact
at the beginning of your report. In this case you submit only
an individual report containing both the "individual" and the
"group" parts, as you are working
all by yourself on the projects.
 [lastname1_lastname2]_proj4_report.[ext]
containing your group written reports.
This file should be either a PDF file (ext=pdf),
a Word file (ext=doc), or a PostScript file (ext=ps).
For instance my file would be named
(note the use of lower case letters only):
 ruiz_smith_proj4_report.pdf if I worked with Joe Smith on this project.
 [lastname1_lastname2]_proj4_slides.[ext]
(or [lastname]_proj4_slides.[ext] in the case of
students taking this course for graduate credit)
containing your slides for your oral reports.
This file should be either a PDF file (ext=pdf)
or a PowerPoint file (ext=ppt).
Your group will have only 4 minutes in class to discuss
the entire project (both individual and group parts, and
classification and association rules).
GRADING CRITERIA
(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the ALGORITHM underlying the data mining
methods used in this project.
(up to 10 extra credit points for an outstanding job)
(extra credit points for description of the optional methods: COBWEB/CLASSIT and EM)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)
(TOTAL: 5 points) PREPROCESSING OF THE DATASET:
Preprocess attributes as needed and dealing with missing values appropriately
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)
(TOTAL: 160 points: 80 points for the individual part and 80 points for the group part)
EXPERIMENTS
(TOTAL: 40 points each dataset) FOR EACH DATASET:
(05 points) ran a good number of experiments to get familiar with the
data mining methods in this project
(10 points) good description of the experiment setting and the results
(10 points) good analysis of the results of the experiments
INCLUDING discussion of evaluations statistics returned by
the Weka systems (accuracy and/or errors) and discussion of
particularly interesting results
(10 points) good graphical representations of the results for the 1st dataset
(10 extra points) good graphical representations of the results for the 2st dataset
(05 points) comparison of the results with those obtained using the other
methods in this project.
Argumentation of weknesses and/or strengths of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 10 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments
(TOTAL 5 points) SLIDES  how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.
(up to 6 extra credit points) for excellent summary and presentation of results
in the slides.
(TOTAL 15 points) Class presentation  how well your oral presentation summarized
concisely the results of the project and how focus your presentation was
on the more creative/interesting/useful of your experiments and results.
This grade is given individually to each team member.