CS 4445 Data Mining and Knowledge Discovery in Databases  B Term 2006
Homework and Project 4: Instance Based Learning and Clustering
DUE DATE:
 Part I (the individual homework assignment) is due on Tuesday, Dec. 5 at 11:50 am
 Part II (the group project) is due on Friday, Dec. 8 2006
at 11:50 am.
HOMEWORK AND PROJECT DESCRIPTION
The purpose of this project is to construct the most accurate
models of different aspects of the two datasets under consideration
using the following data mining techniques:
 Instance Based Learning:
 IBk: knearest neighbors
 Locally Weighted Regression. This is implemented in
the Weka system. Look for LWL (Locally Weighted Learning) under
the "lazy" classifiers and use "Linear Regression" as the
"classifier" parameter for this LWL.
 Clustering:
 Simple kmeans
 Hierarchical Clustering: COBWEB/CLASSIT
 Clustering using Expectation Maximization: EM
Also, to gain close understanding of how those methods work,
this project also include following those methods by hand
on a toy dataset.
HOMEWORK ASSIGNMENT
See Piotr Mardziel's Homework Solutions.
Consider the dataset below.
This dataset is an adaptation of the
World Happiness Dataset.
@relation world_happiness
%  Life Expectancy from UN Human Development Report (2003)
%  GDP per capita from figure published by the CIA (2006), figure in US$.
%  Access to secondary education rating from UNESCO (2002)
%  SWL (satisfaction with life) index calculated from data published
% by New Economics Foundation (2006).
@attribute country string
@attribute continent {Americas,Africa,Asia,Europe}
@attribute lifeexpectancy numeric
@attribute GDPpercapita numeric
@attribute accesstoeducationscore numeric
@attribute SWLindex numeric
@data
Switzerland, Europe, 80.5, 32.3, 99.9, 273.33
Canada, Americas, 80, 34, 102.6, 253.33
Usa, Americas, 77.4, 41.8, 94.6, 246.67
Germany, Europe, 78.7, 30.4, 99, 240
Mexico, Americas, 75.1, 10, 73.4, 230
France, Europe, 79.5, 29.9, 108.7, 220
Thailand, Asia, 70, 8.3, 79, 216.67
Brazil, Americas, 70.5, 8.4, 103.2, 210
Japan, Asia, 82, 31.5, 102.1, 206.67
India, Asia, 63.3, 3.3, 49.9, 180
Ethiopia, Africa, 47.6, 0.9, 5.2, 156.67
Russia, Asia, 65.3, 11.1, 81.9, 143.3
Note that the attribute country identifies each data instance uniquely
and as such will be disregarded in our analysis. It is provided just for context.
 (40 points + 10 extra credits) Instance Based Learning
Assume that we want to predict the SWLindex attribute (prediction target)
from the other predicting attributes
continent, lifeexpectancy, GDPpercapita, accesstoeducationscore.
 (10 points) Compute the 4 nearest neighbors of each of the three test instances below.
For this part, use the Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
Costa_Rica, Americas, 78.2, 11.1, 50.9, ?
United_Kingdom, Europe, 78.4, 30.3, 157.2, ?
South_Africa, Africa, 48.4, 12, 90.2, ?
 (5 points) Predict the SWLindex attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the SWLindex
value of each instance is provided so that you can compute error values).
Show your work in your report.
4NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Predict the target attribute SWLindex of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the SWLindex values of the 4 nearestneighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 Now preprocess each of the predicting numeric attributes
(lifeexpectancy, GDPpercapita, accesstoeducationscore)
so that they range from 0 to 1. That is, for each numeric attribute (except for
SWLindex), replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
 (5 points)
Show your work in your report and show the resulting dataset.
 (10 points) Compute the 4 nearest neighbors of each of the three test instances below
with respect to this normalized dataset.
For this part, use the Euclidean distance without attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
Costa_Rica, Americas, 78.2, 11.1, 50.9, ?
United_Kingdom, Europe, 78.4, 30.3, 157.2, ?
South_Africa, Africa, 48.4, 12, 90.2, ?
 (5 points) Predict the SWLindex attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the SWLindex
value of each instance is provided so that you can compute error values).
Show your work in your report.
4NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Predict the target attribute SWLindex of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the SWLindex values of the 4 nearestneighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4NN
ERROR
root meansquare error (see p. 178) __________
mean absolute error (see p. 178) __________
 (5 points) Which of the methods above produced the best results? Explain.
 (40 points) Clustering  Simple Kmeans

Assume that we want to cluster the instances in the dataset worldhappiness
above in 3 clusters.
Use the Simple Kmeans algorithm with Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 3 randomly selected initial centroids are:
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67
South_Africa, Africa, 48.4, 12, 90.2, 190
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 3 centroids.
 (5 points) Compute new centroids for the 3 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 3 new centroids.
 (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.

Repeat the process above but now starting with a normalized dataset. That is,
assume that we want to cluster the instances in the dataset worldhappiness
above in 3 clusters
using the Simple Kmeans algorithm with Euclidean distance without attribute weights,
but preprocessing each of the numeric attributes
so that they range from 0 to 1. That is, for each numeric attribute,
replace each value with (value  min. value)/(max. value  min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
Use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 3 randomly selected initial centroids are:
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67
South_Africa, Africa, 48.4, 12, 90.2, 190
Show the first 2 iterations of the Simple Kmeans clustering algorithm.
That is:
 (5 points) Assign data instances to the clusters represented by the 3 centroids.
 (5 points) Compute new centroids for the 3 resulting clusters.
 (5 points) Assign data instances to the clusters represented by the 3 new centroids.
 (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.
 (20 points) Clustering  Hierarchical Clustering
Assume that we want to cluster a dataset using
the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we
have created a partial clustering containing just the first 3 instances.
The following tree shows the current partial clustering, where the
numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances
respectively and "0" denotes just an internal node. Note that in this partial
clustering, instances (1) and (3) are clustered together and instance (2) forms
a cluster on its own.
O
/ \
0 (2)
/ \
(1) (3)
Your job is to describe all the alternatives that are considered by the
when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.
 (2 points)
How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
 (15 points)
List each one of those alternatives and show what the resulting
clustering tree would be for each of those alternatives.
 (3 points)
Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives
to select as the best one. Note: you don't need to compute the category utility
value of each of those alternatives. You just need to explain how the selection
of the best alternative is made by the algorithm.
PROJECT ASSIGNMENT
 Dataset: Consider the following dataset:
Meaningful graphical representation of the results (i.e. knearest neighbors
of a given test data instance, in the case of instance based learning; and the
set of resulting clusters; in the case of clusterings) for the chosen dataset
will be rewarded with extra credit.
 Readings:
 Textbook:
Read in great detail the following Sections from your textbook:
 Instance Based Learning: Sections 4.7, 6.4.
 Clustering: Sections 6.6
 Weka Code:
Read the code of the relevant techniques implemented in the Weka system.
Some of those techniques are enumerated below:
 Instance Based Learning:
 IBk: knearest neighbors
 LWL: Locally Weighted Linear Regression
 Clustering:
 Simple kmeans
 OPTIONAL (extra points): Hierarchical Clustering: COBWEB/CLASSIT
 OPTIONAL (extra points): Clustering using Expectation Maximization: EM
 Experiments:
Use the "Explorer" option of the Weka system to perform the
following operations:

A main part of the project is the PREPROCESSING of your dataset.
You should apply relevant filters to your dataset
before doing the mining and/or using the results of previous mining tasks.
For instance, you may decide to remove apparently irrelevant attributes,
replace missing values if any, etc.
Missing values in arff files are represented with the character "?".
See the weka.filter.ReplaceMissingValuesFilter in Weka.
Play with the filter and read the Java code implementing it.
Your report should contain a detailed description of the preprocessing of
your dataset and justifications of the steps you followed.
If Weka does not provide the functionality you need to preprocess your data
as you need to obtain useful patterns, preprocess the data yourself either
by writing the necessary filters (you can incorporate them in Weka if you wish).

You may restrict your experiments to a subset of the instances in the input data IF
Weka cannot handle your whole dataset (this is unlikely). But remember that the
more accurate your models, the better.
 Before you start running experiments, look at the raw data in detail.
Figure out 5 or more specific, interesting questions that you want
to answer with your experiments. These questions may be phrased as
conjectures that you want to confirm/refute with your experimental
results.
Sample questions include: What's the percentage of knearest neighbors
of an instance that belong to the same continent of the instance?
When clustering is performed, do countries in the same continent tend
to be grouped together?
Do properties of countries on the same continent tend to be similar?
 Design your preprocessing and experiments around answering these
5 questions.
 Analyze your resulting models in the light of your 5 questions.
 (Extra Credit) You may want to modify the Weka code so that
it outputs:
 (15 points) the knearest neighbors of the each test instance
 (15 points for each of the 3 clustering methods)
the specific training instances assigned to each cluster
 GROUP PROJECT AND WRITTEN REPORT.
Your report should contain discussions of all the
work that you do for this project.
In particular, it should elaborate on the the following topics:
 Code Description:
Describe algorithmicly the Weka code of the following 3 methods:
IBk: knearest neighbors, Locally Weighted Regression (LWL), and Simple Kmeans
(optional for extra points: COBWEB/CLASSIT and EM);
and filters that you use in the project.
More precisely, explain the ALGORITHM underlying the code in terms of the input
it receives, the output it produces, and the main steps it follows to produce this
output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED
NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.
 Experiments:
For EACH EXPERIMENT YOU RAN describe:
 Which of the 5 or more specific questions/conjectures about
the dataset domain
you aim to answer/validate with your experiment?
 Instances: What data did you use for the experiment?
That is, did you use the entire dataset of just a subset of it? Why?
 Any preprocessing done to the data. That is, did you remove
any attributes?
Did you replace missing values?
If so, what strategy did you use to select a replacement of
the missing values?
 Your system parameters.
 For each of the methods covered by this project
IBk: knearest neighbors, Locally Weighted Regression (LWL), and Simple Kmeans
(optional for extra points: COBWEB/CLASSIT and EM);
 Results and detail ANALYSIS of results of the experiments you ran.
10fold crossvalidation is recommended.
 Error/accuracy measures of the resulting models.
 Detailed analysis of the models obtained.
Elaborate on interesting characteristics of those models.
Do these models answer any of your 5 questions?
If so, what's the answer? If not, why not and what modifications
to the experiments are needed to answer those questions?
 Comparison of the results across the methods
used in this project.
 OPTIONAL (20 EXTRA CREDIT POINTS) PART:
Meaningful graphical representation of the results (i.e. knearest neighbors
of a given test data instance, in the case of instance based learning; and the
set of resulting clusters; in the case of clustering).
 Summary of Results
 For each of the datasets, what were the best results obtained
in your project?
Include (the first 100 lines or so of) this model in your report.
 Strengths and weaknesses of your project.
 GROUP ORAL REPORT.
We will discuss the results from the individual projects during the class
on Thursday, Dec. 14th. Your oral report should summarize the different sections of
your written report as described above.
Each group will have about 4 minutes to explain your results and to discuss
your project in class. Be prepared!
PROJECT SUBMISSION AND DUE DATE
Part II is due Friday, Dec. 8th at 11:50 am.
BRING A HARDCOPY OF THE WRITTEN REPORT WITH YOU TO CLASS.
In addition, you must submit your report electronically as specified below.
Submissions received on Friday, Dec. 8th between 11:51 am and 12:00 midnight
will be penalized with 30% off the grade, submissions received on Saturday
Dec. 9th between 12:01 am (early morning) and 8:00 am will be penalized
with 60% off the grade; and submissions received after Saturday Dec. 9th at
8:00 am won't be accepted.
Please submit the following files using the
myWpi digital drop box:
 [lastname]_proj4_report.[ext]
containing your individual written reports.
This file should be either a PDF file (ext=pdf),
a Word file (ext=doc), or a PostScript file (ext=ps).
For instance my file would be named
(note the use of lower case letters only):
If you are taking this course for grad. credit, state this fact
at the beginning of your report. In this case you submit only
an individual report containing both the "individual" and the
"group" parts, as you are working
all by yourself on the projects.
 [lastname1_lastname2]_proj4_report.[ext]
containing your group written reports.
This file should be either a PDF file (ext=pdf),
a Word file (ext=doc), or a PostScript file (ext=ps).
For instance my file would be named
(note the use of lower case letters only):
 ruiz_smith_proj4_report.pdf if I worked with Joe Smith on this project.
 [lastname1_lastname2]_proj4_slides.[ext]
(or [lastname]_proj4_slides.[ext] in the case of
students taking this course for graduate credit)
containing your slides for your oral reports.
This file should be either a PDF file (ext=pdf)
or a PowerPoint file (ext=ppt).
Your group will have only 4 minutes in class to discuss
the entire project (both individual and group parts, and
classification and association rules).
GRADING CRITERIA
(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
NOTES:
Along
with scoring criteria, provided are lists of suggestions. Positive
suggestions (marked with green "+") include aspects of a project that
have positive impacts on the scoring criteria. Negative suggestions
(marked with red "") have negative impacts. Finally, suggestions
marked with "!" result in extra points.
+this has a positive impact
+ +this has a greater positive impact
+ + +this has a significant positive impact
+ + + +this has very significant positive impact
+ + + + +this has huge positive impact
this has a negative impact
 this has a greater negative impact
  this has a significant negative impact
   this has very significant negative impact
    this has huge negative impact
METHODS for this project are:
IBk,
knearest neighbors,
locally weighted regression (LWL \w linear regression), and
simple kmeans.
(OPTIONAL) METHODS for this project are:
COBWEB/CLASSIT and
EM
DATASETS for this project are:
dataset1: world happiness dataset
TOTAL: 15 points: Class Presentation
(15
points) How well your oral presentation summarized concisely the
results of the project and how focus your presentation was on the more
creative/interesting/useful of your experiments and results. This grade
is given individually to each team member.
TOTAL: 185 points: Project 4 Report
TOTAL: 20 points: Algorithmic Description of the Code
(05
points) Description of the algorithms underlying the WEKA filters used.
If you have already described some used filters in previous projects
then copying and pasting your work would be a good idea. If issues were
raised about the old descriptions, consider revising them.
(TOTAL:
15 points) Description of the ALGORITHMS underlying the data mining
methods used in this project (see METHODS). This includes:
(10 points) Descriptions of model construction
(5 points) Descriptions of instance classification/prediction
+(very) highlevel pseudo code with explanation and justification
 raw pseudo code with no justification or reasoning behind the steps involved
    structural description of WEKA code, i.e. classes/members/methods
TOTAL: 5 points: Slides
(5
points) How well do they summarize concisely the results of the
project? We suggest you summarize the setting of your experiments and
their results in a tabular manner.
+ (potential) unanswered questions presented to class
+ visual aids
+ +main ideas and observations summarized
 nothing but accuracy measures
 lack of visual aids
TOTAL: 5 points: PreProcessing of the Datasets
(5 points) Preprocess attributes as needed and dealing with missing values appropriately
TOTAL: 155 points: Experiments Goals
(30
points) Good description of the experiment setting and the results.
This applies to all the experiments done for this project.
+ + + motivation for experiments described
+ specify testing method used (crossvalidation / % split / etc.)
+ + overall summary of your experiments in one concise list that includes relevant parameters
 included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)
  ambiguous or unclear setting and/or results
(30
points) Good analysis of the results of the experiments INCLUDING
discussion of evaluations statistics returned by the WEKA systems
(accuracy and/or errors) and discussion of particularly interesting
results. This applies to all the experiments done for this project.
    results rewritten in prose used in place of discussion and analysis
 looking only at accuracy/error measures
TOTAL: 35 points: MethodOriented Goals
(35
points) ran a good number of experiments to get familiar with the data
mining methods in this project (see METHODS) using a variety of
datasets (see DATASETS)
+explore variation in nontrivial method parameters (debug/verbose mode IS a trivial parameter)
 looking only at accuracy/error measures resulting from parameter changes
TOTAL: 40 points: DatasetOriented Goals
(20
points) 5 or more specific questions/conjectures about the dataset
domain that you aim to answer/validate with your experiments
questions not specific to the dataset
(20
points) Comparison of the results across all the methods used in this
project (see METHODS). That is, compare the results of one method on
this dataset to the results of the other methods on the same dataset.
 comparing only accuracy/error measures
TOTAL: 20 points: Holistic Goals
(20 points) Argumentation of weaknesses and/or strengths of each of the
methods on this dataset, and argumentation of which method should be
preferred for this dataset and why.
TOTAL: 60 points: Extra Credit
You may want to modify the Weka code so that it outputs:
(15 points) the knearest neighbors of the each test instance
(15 points for each of the 3 clustering methods) the specific training instances assigned to each cluster