WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 
Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ 

DUE DATE: ------------------------------------------


HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration using the following data mining techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

HOMEWORK ASSIGNMENT

See Piotr Mardziel's Homework Solutions.

Consider the dataset below. This dataset is an adaptation of the World Happiness Dataset.

@relation world_happiness

% - Life Expectancy from UN Human Development Report (2003)
% - GDP per capita from figure published by the CIA (2006), figure in US$.
% - Access to secondary education rating from UNESCO (2002)
% - SWL (satisfaction with life) index calculated from data published 
%   by New Economics Foundation (2006).

@attribute country string
@attribute continent {Americas,Africa,Asia,Europe}
@attribute life-expectancy numeric
@attribute GDP-per-capita numeric
@attribute access-to-education-score numeric
@attribute SWL-index numeric

@data
Switzerland,	Europe,		80.5,	32.3,	99.9,	273.33
Canada,		Americas,	80,	34,	102.6,	253.33
Usa,		Americas,	77.4,	41.8,	94.6,	246.67
Germany,	Europe,		78.7,	30.4,	99,	240
Mexico,		Americas,	75.1,	10,	73.4,	230
France,		Europe,		79.5,	29.9,	108.7,	220
Thailand,	Asia,		70,	8.3,	79,	216.67
Brazil,		Americas,	70.5,	8.4,	103.2,	210
Japan,		Asia,		82,	31.5,	102.1,	206.67
India,		Asia,		63.3,	3.3,	49.9,	180
Ethiopia,	Africa,		47.6,	0.9,	5.2,	156.67
Russia,		Asia,		65.3,	11.1,	81.9,	143.3

Note that the attribute country identifies each data instance uniquely and as such will be disregarded in our analysis. It is provided just for context.

  1. (40 points + 10 extra credits) Instance Based Learning

    Assume that we want to predict the SWL-index attribute (prediction target) from the other predicting attributes continent, life-expectancy, GDP-per-capita, access-to-education-score.

    1. (10 points) Compute the 4 nearest neighbors of each of the three test instances below. For this part, use the Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute continent as given: two different values of this attribute (e.g., "Africa" and "Asia") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "Africa" and "Africa") will contribute 0 towards the Euclidean distance of two data instances.

      Show your work in your report.

      
      Costa_Rica,	Americas, 78.2,	11.1,	50.9,	?
      
      United_Kingdom,	Europe,   78.4,	30.3,	157.2,	?
      
      South_Africa,	Africa,   48.4,	12,	90.2,	?
      
      

      1. (5 points) Predict the SWL-index attribute of the test instances using their 4 nearest neighbors without any distance weighting. (Note that now the SWL-index value of each instance is provided so that you can compute error values).

        Show your work in your report.

                                                                 4-NN          
                                                                 PREDICTION   
        
        Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	__________   
        
        United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67	__________  
        
        South_Africa,	Africa,   48.4,	12,	90.2,	190	__________ 
        
        
                                                                 4-NN        
                                                                 ERROR      
        
        root mean-square error (see p. 178)                      __________
        
        mean absolute error (see p. 178)                         __________
        
        

      2. (5 points) Predict the target attribute SWL-index of each test instance using its 4 nearest neighbors weighted by the inverse of the distance. That is, if the SWL-index values of the 4 nearest-neighbors of a test instance are m1, m2, m3, and m4 and their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and the predicted value for the test instance is:
               
                 (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) 
                 _______________________________________
                           w1 + w2 + w3 + w4 
              
             

        Show your work in your report.

                                                                 4-NN          
                                                                 PREDICTION   
        
        Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	__________ 
        
        United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67	__________  
        
        South_Africa,	Africa,   48.4,	12,	90.2,	190	__________ 
        
        
                                                                 4-NN        
                                                                 ERROR      
        
        root mean-square error (see p. 178)                      __________
        
        mean absolute error (see p. 178)                         __________
        
        

    2. Now preprocess each of the predicting numeric attributes (life-expectancy, GDP-per-capita, access-to-education-score) so that they range from 0 to 1. That is, for each numeric attribute (except for SWL-index), replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively.

      1. (5 points) Show your work in your report and show the resulting dataset.

      2. (10 points) Compute the 4 nearest neighbors of each of the three test instances below with respect to this normalized dataset. For this part, use the Euclidean distance without attribute weights. Also, use the nominal attribute continent as given: two different values of this attribute (e.g., "Africa" and "Asia") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "Africa" and "Africa") will contribute 0 towards the Euclidean distance of two data instances.

        Show your work in your report.

        
        Costa_Rica,	Americas, 78.2,	11.1,	50.9,	?
        
        United_Kingdom,	Europe,   78.4,	30.3,	157.2,	?
        
        South_Africa,	Africa,   48.4,	12,	90.2,	?
        
        

        1. (5 points) Predict the SWL-index attribute of the test instances using their 4 nearest neighbors without any distance weighting. (Note that now the SWL-index value of each instance is provided so that you can compute error values).

          Show your work in your report.

                                                                   4-NN          
                                                                   PREDICTION   
          
          Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	__________ 
          
          United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67	__________
          
          South_Africa,	Africa,   48.4,	12,	90.2,	190	__________ 
          
          
                                                                   4-NN        
                                                                   ERROR      
          
          root mean-square error (see p. 178)                      __________
          
          mean absolute error (see p. 178)                         __________
          
          

        2. (5 points) Predict the target attribute SWL-index of each test instance using its 4 nearest neighbors weighted by the inverse of the distance. That is, if the SWL-index values of the 4 nearest-neighbors of a test instance are m1, m2, m3, and m4 and their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and the predicted value for the test instance is:
                 
                   (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) 
                   _______________________________________
                             w1 + w2 + w3 + w4 
                
               

          Show your work in your report.

                                                                   4-NN          
                                                                   PREDICTION   
          
          Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	__________ 
          
          United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67	__________  
          
          South_Africa,	Africa,   48.4,	12,	90.2,	190	__________ 
          
          
                                                                   4-NN        
                                                                   ERROR      
          
          root mean-square error (see p. 178)                      __________
          
          mean absolute error (see p. 178)                         __________
          
          

      3. (5 points) Which of the methods above produced the best results? Explain.

  2. (40 points) Clustering - Simple K-means

    1. Assume that we want to cluster the instances in the dataset world-happiness above in 3 clusters. Use the Simple K-means algorithm with Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute continent as given: two different values of this attribute (e.g., "Africa" and "Asia") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "Africa" and "Africa") will contribute 0 towards the Euclidean distance of two data instances.

      Assume that the 3 randomly selected initial centroids are:

      
      Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	
      
      United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67
      
      South_Africa,	Africa,   48.4,	12,	90.2,	190
      
      
      Show the first 2 iterations of the Simple K-means clustering algorithm. That is:

      1. (5 points) Assign data instances to the clusters represented by the 3 centroids.

      2. (5 points) Compute new centroids for the 3 resulting clusters.

      3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.

      4. (5 points) Compute new centroids for these 3 new resulting clusters.

      Show your work in your report.

    2. Repeat the process above but now starting with a normalized dataset. That is, assume that we want to cluster the instances in the dataset world-happiness above in 3 clusters using the Simple K-means algorithm with Euclidean distance without attribute weights, but preprocessing each of the numeric attributes so that they range from 0 to 1. That is, for each numeric attribute, replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Use the nominal attribute continent as given: two different values of this attribute (e.g., "Africa" and "Asia") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "Africa" and "Africa") will contribute 0 towards the Euclidean distance of two data instances.

      Assume that the 3 randomly selected initial centroids are:

      
      Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	
      
      United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67
      
      South_Africa,	Africa,   48.4,	12,	90.2,	190
      
      
      Show the first 2 iterations of the Simple K-means clustering algorithm. That is:

      1. (5 points) Assign data instances to the clusters represented by the 3 centroids.

      2. (5 points) Compute new centroids for the 3 resulting clusters.

      3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.

      4. (5 points) Compute new centroids for these 3 new resulting clusters.

      Show your work in your report.


  3. (20 points) Clustering - Hierarchical Clustering
    Assume that we want to cluster a dataset using the COBWEB/CLASSIT clustering algorithm.

    Assume that we have followed the COBWEB/CLASSIT algorithm and so far we have created a partial clustering containing just the first 3 instances. The following tree shows the current partial clustering, where the numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances respectively and "0" denotes just an internal node. Note that in this partial clustering, instances (1) and (3) are clustered together and instance (2) forms a cluster on its own.

                                 O
                              /     \ 
                             0      (2)
                           /   \
                          (1) (3)
    
    Your job is to describe all the alternatives that are considered by the when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.

    1. (2 points) How many such alternatives are considered by the COBWEB/CLASSIT algorithm?

    2. (15 points) List each one of those alternatives and show what the resulting clustering tree would be for each of those alternatives.

    3. (3 points) Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives to select as the best one. Note: you don't need to compute the category utility value of each of those alternatives. You just need to explain how the selection of the best alternative is made by the algorithm.

PROJECT ASSIGNMENT

  1. Dataset: Consider the following dataset:

    Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings) for the chosen dataset will be rewarded with extra credit.

  2. Readings:

    • Textbook: Read in great detail the following Sections from your textbook:
      • Instance Based Learning: Sections 4.7, 6.4.
      • Clustering: Sections 6.6

    • Weka Code: Read the code of the relevant techniques implemented in the Weka system. Some of those techniques are enumerated below:
      • Instance Based Learning:
        • IBk: k-nearest neighbors
        • LWL: Locally Weighted Linear Regression
      • Clustering:
        • Simple k-means
        • OPTIONAL (extra points): Hierarchical Clustering: COBWEB/CLASSIT
        • OPTIONAL (extra points): Clustering using Expectation Maximization: EM

  3. Experiments: Use the "Explorer" option of the Weka system to perform the following operations:

    • A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, etc. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

    • You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

    • Before you start running experiments, look at the raw data in detail. Figure out 5 or more specific, interesting questions that you want to answer with your experiments. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results. Sample questions include: What's the percentage of k-nearest neighbors of an instance that belong to the same continent of the instance? When clustering is performed, do countries in the same continent tend to be grouped together? Do properties of countries on the same continent tend to be similar?

    • Design your preprocessing and experiments around answering these 5 questions.

    • Analyze your resulting models in the light of your 5 questions.

    • (Extra Credit) You may want to modify the Weka code so that it outputs:
      • (15 points) the k-nearest neighbors of the each test instance
      • (15 points for each of the 3 clustering methods) the specific training instances assigned to each cluster

  4. GROUP PROJECT AND WRITTEN REPORT.

    Your report should contain discussions of all the work that you do for this project. In particular, it should elaborate on the the following topics:

    1. Code Description: Describe algorithmicly the Weka code of the following 3 methods: IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM); and filters that you use in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.

    2. Experiments: For EACH EXPERIMENT YOU RAN describe:
      • Which of the 5 or more specific questions/conjectures about the dataset domain you aim to answer/validate with your experiment?
      • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
      • Any pre-processing done to the data. That is, did you remove any attributes? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
      • Your system parameters.
      • For each of the methods covered by this project IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM);
        • Results and detail ANALYSIS of results of the experiments you ran. 10-fold crossvalidation is recommended.
        • Error/accuracy measures of the resulting models.
        • Detailed analysis of the models obtained. Elaborate on interesting characteristics of those models. Do these models answer any of your 5 questions? If so, what's the answer? If not, why not and what modifications to the experiments are needed to answer those questions?
        • Comparison of the results across the methods used in this project.
        • OPTIONAL (20 EXTRA CREDIT POINTS) PART: Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clustering).

    3. Summary of Results
      • For each of the datasets, what were the best results obtained in your project? Include (the first 100 lines or so of) this model in your report.
      • Strengths and weaknesses of your project.

  5. GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Thursday, Dec. 14th. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

    PROJECT SUBMISSION AND DUE DATE

    Part II is due Friday, Dec. 8th at 11:50 am. BRING A HARDCOPY OF THE WRITTEN REPORT WITH YOU TO CLASS. In addition, you must submit your report electronically as specified below. Submissions received on Friday, Dec. 8th between 11:51 am and 12:00 midnight will be penalized with 30% off the grade, submissions received on Saturday Dec. 9th between 12:01 am (early morning) and 8:00 am will be penalized with 60% off the grade; and submissions received after Saturday Dec. 9th at 8:00 am won't be accepted.

    Please submit the following files using the myWpi digital drop box:

    1. [lastname]_proj4_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
      • ruiz_proj4_report.pdf

      If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.

    2. [lastname1_lastname2]_proj4_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
      • ruiz_smith_proj4_report.pdf if I worked with Joe Smith on this project.

    3. [lastname1_lastname2]_proj4_slides.[ext] (or [lastname]_proj4_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.

(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:

NOTES:
Along with scoring criteria, provided are lists of suggestions. Positive suggestions (marked with green "+") include aspects of a project that have positive impacts on the scoring criteria. Negative suggestions (marked with red "-") have negative impacts. Finally, suggestions marked with "!" result in extra points.
!this can earn a few extra points (0-8)

! !this can earn a more extra points (0-16)

! ! !this can earn even more extra points (0-24)

! ! ! !this can earn a large extra points (0-32)

! ! ! ! !this can earn a very large amount of extra points (0-40)

+this has a positive impact

+ +this has a greater positive impact

+ + +this has a significant positive impact

+ + + +this has very significant positive impact

+ + + + +this has huge positive impact

-this has a negative impact

- -this has a greater negative impact

- - -this has a significant negative impact

- - - -this has very significant negative impact

- - - - -this has huge negative impact


METHODS for this project are:
IBk,
k-nearest neighbors,
locally weighted regression (LWL \w linear regression), and
simple k-means.

(OPTIONAL) METHODS for this project are:
COBWEB/CLASSIT and
EM

DATASETS for this project are:
dataset1: world happiness dataset

TOTAL: 15 points: Class Presentation
(15 points) How well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.

TOTAL: 185 points: Project 4 Report

TOTAL: 20 points: Algorithmic Description of the Code

(05 points) Description of the algorithms underlying the WEKA filters used. If you have already described some used filters in previous projects then copying and pasting your work would be a good idea. If issues were raised about the old descriptions, consider revising them.

(TOTAL: 15 points) Description of the ALGORITHMS underlying the data mining methods used in this project (see METHODS). This includes:
(10 points) Descriptions of model construction

(5 points) Descriptions of instance classification/prediction


!extra credit points for description of the optional methods (see OPTIONAL METHODS)
+(very) high-level pseudo code with explanation and justification

- -raw pseudo code with no justification or reasoning behind the steps involved

- - - - -structural description of WEKA code, i.e. classes/members/methods


TOTAL: 5 points: Slides
(5 points) How well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.
! excellent summary and presentation of results in the slides

+ (potential) unanswered questions presented to class

+ visual aids

+ +main ideas and observations summarized

- nothing but accuracy measures

- lack of visual aids


TOTAL: 5 points: Pre-Processing of the Datasets
(5 points) Preprocess attributes as needed and dealing with missing values appropriately
!Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.)

TOTAL: 155 points: Experiments Goals
(30 points) Good description of the experiment setting and the results. This applies to all the experiments done for this project.
! ! ! meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings)

+ + + motivation for experiments described

+ specify testing method used (cross-validation / % split / etc.)

+ + overall summary of your experiments in one concise list that includes relevant parameters

- included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)

- - ambiguous or unclear setting and/or results


(30 points) Good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the WEKA systems (accuracy and/or errors) and discussion of particularly interesting results. This applies to all the experiments done for this project.
! ! ! !excellent analysis of the results and comparisons

! ! ! !running additional interesting experiments

- - - - results rewritten in prose used in place of discussion and analysis

- -looking only at accuracy/error measures

TOTAL: 35 points: Method-Oriented Goals
(35 points) ran a good number of experiments to get familiar with the data mining methods in this project (see METHODS) using a variety of datasets (see DATASETS)
+explore variation in non-trivial method parameters (debug/verbose mode IS a trivial parameter)

- -looking only at accuracy/error measures resulting from parameter changes

TOTAL: 40 points: Dataset-Oriented Goals
(20 points) 5 or more specific questions/conjectures about the dataset domain that you aim to answer/validate with your experiments
-questions not specific to the dataset

(20 points) Comparison of the results across all the methods used in this project (see METHODS). That is, compare the results of one method on this dataset to the results of the other methods on the same dataset.
! !comparison with methods from past projects
- -comparing only accuracy/error measures

TOTAL: 20 points: Holistic Goals
(20 points) Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why.
TOTAL: 60 points: Extra Credit
You may want to modify the Weka code so that it outputs:
(15 points) the k-nearest neighbors of the each test instance
(15 points for each of the 3 clustering methods) the specific training instances assigned to each cluster