HOMEWORK ASSIGNMENT
See Piotr Mardziel's Homework Solutions.
Consider the dataset below.
This dataset is an adaptation of the
World Happiness Dataset.
@relation world_happiness
% - Life Expectancy from UN Human Development Report (2003)
% - GDP per capita from figure published by the CIA (2006), figure in US$.
% - Access to secondary education rating from UNESCO (2002)
% - SWL (satisfaction with life) index calculated from data published
% by New Economics Foundation (2006).
@attribute country string
@attribute continent {Americas,Africa,Asia,Europe}
@attribute life-expectancy numeric
@attribute GDP-per-capita numeric
@attribute access-to-education-score numeric
@attribute SWL-index numeric
@data
Switzerland, Europe, 80.5, 32.3, 99.9, 273.33
Canada, Americas, 80, 34, 102.6, 253.33
Usa, Americas, 77.4, 41.8, 94.6, 246.67
Germany, Europe, 78.7, 30.4, 99, 240
Mexico, Americas, 75.1, 10, 73.4, 230
France, Europe, 79.5, 29.9, 108.7, 220
Thailand, Asia, 70, 8.3, 79, 216.67
Brazil, Americas, 70.5, 8.4, 103.2, 210
Japan, Asia, 82, 31.5, 102.1, 206.67
India, Asia, 63.3, 3.3, 49.9, 180
Ethiopia, Africa, 47.6, 0.9, 5.2, 156.67
Russia, Asia, 65.3, 11.1, 81.9, 143.3
Note that the attribute country identifies each data instance uniquely
and as such will be disregarded in our analysis. It is provided just for context.
- (40 points + 10 extra credits) Instance Based Learning
Assume that we want to predict the SWL-index attribute (prediction target)
from the other predicting attributes
continent, life-expectancy, GDP-per-capita, access-to-education-score.
- (10 points) Compute the 4 nearest neighbors of each of the three test instances below.
For this part, use the Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
Costa_Rica, Americas, 78.2, 11.1, 50.9, ?
United_Kingdom, Europe, 78.4, 30.3, 157.2, ?
South_Africa, Africa, 48.4, 12, 90.2, ?
- (5 points) Predict the SWL-index attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the SWL-index
value of each instance is provided so that you can compute error values).
Show your work in your report.
4-NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4-NN
ERROR
root mean-square error (see p. 178) __________
mean absolute error (see p. 178) __________
- (5 points) Predict the target attribute SWL-index of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the SWL-index values of the 4 nearest-neighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4-NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4-NN
ERROR
root mean-square error (see p. 178) __________
mean absolute error (see p. 178) __________
- Now preprocess each of the predicting numeric attributes
(life-expectancy, GDP-per-capita, access-to-education-score)
so that they range from 0 to 1. That is, for each numeric attribute (except for
SWL-index), replace each value with (value - min. value)/(max. value - min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
- (5 points)
Show your work in your report and show the resulting dataset.
- (10 points) Compute the 4 nearest neighbors of each of the three test instances below
with respect to this normalized dataset.
For this part, use the Euclidean distance without attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Show your work in your report.
Costa_Rica, Americas, 78.2, 11.1, 50.9, ?
United_Kingdom, Europe, 78.4, 30.3, 157.2, ?
South_Africa, Africa, 48.4, 12, 90.2, ?
- (5 points) Predict the SWL-index attribute of the test instances using their
4 nearest neighbors without any distance weighting. (Note that now the SWL-index
value of each instance is provided so that you can compute error values).
Show your work in your report.
4-NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4-NN
ERROR
root mean-square error (see p. 178) __________
mean absolute error (see p. 178) __________
- (5 points) Predict the target attribute SWL-index of each test instance
using its 4 nearest neighbors weighted by the inverse of the distance.
That is, if the SWL-index values of the 4 nearest-neighbors of a test instance
are m1, m2, m3, and m4 and their respective distances to the test
instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors
are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and
the predicted value for the test instance is:
(w1*m1) + (w2*m2) + (w3*m3) + (w4*m4)
_______________________________________
w1 + w2 + w3 + w4
Show your work in your report.
4-NN
PREDICTION
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________
South_Africa, Africa, 48.4, 12, 90.2, 190 __________
4-NN
ERROR
root mean-square error (see p. 178) __________
mean absolute error (see p. 178) __________
- (5 points) Which of the methods above produced the best results? Explain.
- (40 points) Clustering - Simple K-means
-
Assume that we want to cluster the instances in the dataset world-happiness
above in 3 clusters.
Use the Simple K-means algorithm with Euclidean distance without
modifying the numeric predicting attributes nor using attribute weights.
Also, use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 3 randomly selected initial centroids are:
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67
South_Africa, Africa, 48.4, 12, 90.2, 190
Show the first 2 iterations of the Simple K-means clustering algorithm.
That is:
- (5 points) Assign data instances to the clusters represented by the 3 centroids.
- (5 points) Compute new centroids for the 3 resulting clusters.
- (5 points) Assign data instances to the clusters represented by the 3 new centroids.
- (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.
-
Repeat the process above but now starting with a normalized dataset. That is,
assume that we want to cluster the instances in the dataset world-happiness
above in 3 clusters
using the Simple K-means algorithm with Euclidean distance without attribute weights,
but preprocessing each of the numeric attributes
so that they range from 0 to 1. That is, for each numeric attribute,
replace each value with (value - min. value)/(max. value - min. value),
where min. value and max. value are the minimum and maximum values of that
attribute respectively.
Use the nominal attribute continent as given:
two different values of this attribute (e.g., "Africa" and "Asia")
will contribute 1 towards the Euclidean distance, and
two "same values" of this attribute (e.g., "Africa" and "Africa")
will contribute 0 towards the Euclidean distance of two data instances.
Assume that the 3 randomly selected initial centroids are:
Costa_Rica, Americas, 78.2, 11.1, 50.9, 250
United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67
South_Africa, Africa, 48.4, 12, 90.2, 190
Show the first 2 iterations of the Simple K-means clustering algorithm.
That is:
- (5 points) Assign data instances to the clusters represented by the 3 centroids.
- (5 points) Compute new centroids for the 3 resulting clusters.
- (5 points) Assign data instances to the clusters represented by the 3 new centroids.
- (5 points) Compute new centroids for these 3 new resulting clusters.
Show your work in your report.
- (20 points) Clustering - Hierarchical Clustering
Assume that we want to cluster a dataset using
the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we
have created a partial clustering containing just the first 3 instances.
The following tree shows the current partial clustering, where the
numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances
respectively and "0" denotes just an internal node. Note that in this partial
clustering, instances (1) and (3) are clustered together and instance (2) forms
a cluster on its own.
O
/ \
0 (2)
/ \
(1) (3)
Your job is to describe all the alternatives that are considered by the
when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.
- (2 points)
How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
- (15 points)
List each one of those alternatives and show what the resulting
clustering tree would be for each of those alternatives.
- (3 points)
Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives
to select as the best one. Note: you don't need to compute the category utility
value of each of those alternatives. You just need to explain how the selection
of the best alternative is made by the algorithm.
GRADING CRITERIA
(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
NOTES:
Along
with scoring criteria, provided are lists of suggestions. Positive
suggestions (marked with green "+") include aspects of a project that
have positive impacts on the scoring criteria. Negative suggestions
(marked with red "-") have negative impacts. Finally, suggestions
marked with "!" result in extra points.
+this has a positive impact
+ +this has a greater positive impact
+ + +this has a significant positive impact
+ + + +this has very significant positive impact
+ + + + +this has huge positive impact
-this has a negative impact
- -this has a greater negative impact
- - -this has a significant negative impact
- - - -this has very significant negative impact
- - - - -this has huge negative impact
METHODS for this project are:
IBk,
k-nearest neighbors,
locally weighted regression (LWL \w linear regression), and
simple k-means.
(OPTIONAL) METHODS for this project are:
COBWEB/CLASSIT and
EM
DATASETS for this project are:
dataset1: world happiness dataset
TOTAL: 15 points: Class Presentation
(15
points) How well your oral presentation summarized concisely the
results of the project and how focus your presentation was on the more
creative/interesting/useful of your experiments and results. This grade
is given individually to each team member.
TOTAL: 185 points: Project 4 Report
TOTAL: 20 points: Algorithmic Description of the Code
(05
points) Description of the algorithms underlying the WEKA filters used.
If you have already described some used filters in previous projects
then copying and pasting your work would be a good idea. If issues were
raised about the old descriptions, consider revising them.
(TOTAL:
15 points) Description of the ALGORITHMS underlying the data mining
methods used in this project (see METHODS). This includes:
(10 points) Descriptions of model construction
(5 points) Descriptions of instance classification/prediction
+(very) high-level pseudo code with explanation and justification
- -raw pseudo code with no justification or reasoning behind the steps involved
- - - - -structural description of WEKA code, i.e. classes/members/methods
TOTAL: 5 points: Slides
(5
points) How well do they summarize concisely the results of the
project? We suggest you summarize the setting of your experiments and
their results in a tabular manner.
+ (potential) unanswered questions presented to class
+ visual aids
+ +main ideas and observations summarized
- nothing but accuracy measures
- lack of visual aids
TOTAL: 5 points: Pre-Processing of the Datasets
(5 points) Preprocess attributes as needed and dealing with missing values appropriately
TOTAL: 155 points: Experiments Goals
(30
points) Good description of the experiment setting and the results.
This applies to all the experiments done for this project.
+ + + motivation for experiments described
+ specify testing method used (cross-validation / % split / etc.)
+ + overall summary of your experiments in one concise list that includes relevant parameters
- included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)
- - ambiguous or unclear setting and/or results
(30
points) Good analysis of the results of the experiments INCLUDING
discussion of evaluations statistics returned by the WEKA systems
(accuracy and/or errors) and discussion of particularly interesting
results. This applies to all the experiments done for this project.
- - - - results rewritten in prose used in place of discussion and analysis
- -looking only at accuracy/error measures
TOTAL: 35 points: Method-Oriented Goals
(35
points) ran a good number of experiments to get familiar with the data
mining methods in this project (see METHODS) using a variety of
datasets (see DATASETS)
+explore variation in non-trivial method parameters (debug/verbose mode IS a trivial parameter)
- -looking only at accuracy/error measures resulting from parameter changes
TOTAL: 40 points: Dataset-Oriented Goals
(20
points) 5 or more specific questions/conjectures about the dataset
domain that you aim to answer/validate with your experiments
-questions not specific to the dataset
(20
points) Comparison of the results across all the methods used in this
project (see METHODS). That is, compare the results of one method on
this dataset to the results of the other methods on the same dataset.
- -comparing only accuracy/error measures
TOTAL: 20 points: Holistic Goals
(20 points) Argumentation of weaknesses and/or strengths of each of the
methods on this dataset, and argumentation of which method should be
preferred for this dataset and why.
TOTAL: 60 points: Extra Credit
You may want to modify the Weka code so that it outputs:
(15 points) the k-nearest neighbors of the each test instance
(15 points for each of the 3 clustering methods) the specific training instances assigned to each cluster