WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004 
Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ 

DUE DATE: Part I (the individual homework assignment) is due on Friday, October 8 at 1:00 pm (NO LATE HOMEWORK SUBMISSIONS WILL BE ACCEPTED) and Parts II.1 and II.2 (the individual+group project) are due on Monday, October 11 2004 at 10 pm. 
------------------------------------------


HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration using the following data mining techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

HOMEWORK ASSIGNMENT

See
Solutions to this HW assignment by James Martineau.

Consider the dataset below. This dataset is a subset of the Auto Miles-per-galon (MPG) dataset that is available at the The University of California Irvine (UCI) Data Repository.


@relation subset-auto-mpg

@attribute mpg numeric
@attribute cylinders numeric
@attribute horsepower numeric
@attribute weight numeric
@attribute acceleration numeric
@attribute model-year numeric
@attribute car-name {chevrolet,toyota,volkswagen,ford}

@data

18,  8,  130,  3504,  12,    70,  chevrolet
27,  4,  90,   2950,  17.3,  82,  chevrolet
34,  4,  88,   2395,  18,    82,  chevrolet
24,  4,  95,   2372,  15,    70,  toyota
28,  4,  75,   2155,  16.4,  76,  toyota
32,  4,  96,   2665,  13.9,  82,  toyota
26,  4,  46,   1835,  20.5,  70,  volkswagen
29,  4,  70,   1937,  14.2,  76,  volkswagen
44,  4,  52,   2130,  24.6,  82,  volkswagen
10,  8,  215,  4615,  14,    70,  ford
28,  4,  79,   2625,  18.6,  82,  ford
16,  8,  149,  4335,  14.5,  77,  ford

  1. (40 points + 10 extra credits) Instance Based Learning
    For this part, we want to predict the mpg attribute (prediction target) from the other predicting attributes cylinders, horsepower, weight, acceleration, model-year, and car-name

    1. (10 points) Compute the 5 nearest neighbors of each of the four test instances below. For this part, use the Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.

      Show your work in your report.

      
      ?,    8,  150,  4464,  12,    73,  chevrolet    
      
      ?,    6,  122,  2807,  13.5,  73,  toyota      
      
      ?,    4,  60,   1834,  19,    71,  volkswagen  
      
      ?,    4,  66,   1800,  14.4,  78,  ford       
      
      

      1. (5 points) Predict the mpg attribute of the test instances using their 5 nearest neighbors without any distance weighting. (Note that now the mpg value of each instance is provided so that you can compute error values).

        Show your work in your report.

                                                         5-NN          
                                                         PREDICTION   
        
        13,    8,  150,  4464,  12,    73,  chevrolet    __________
        
        20,    6,  122,  2807,  13.5,  73,  toyota       __________
        
        27,    4,  60,   1834,  19,    71,  volkswagen   __________
        
        36.1,  4,  66,   1800,  14.4,  78,  ford         __________
        
        
                                                         5-NN        
                                                         ERROR      
        
        root mean-square error (see p. 148)              __________
        
        mean absolute error (see p. 148)                 __________
        
        

      2. (5 points) Predict the target attribute mpg of each test instance using its 5 nearest neighbors weighted by the inverse of the distance. That is, if the mph values of the 5 nearest-neighbors of a test instance are m1, m2, m3, m4, and m5 and their respective distances to the test instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and the predicted value for the test instance is:
               
                 (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5) 
                 ________________________________________________
                           w1 + w2 + w3 + w4 + w5
              
             

        Show your work in your report.

                                                         5-NN with weight 
                                                         PREDICTION
        
        13,    8,  150,  4464,  12,    73,  chevrolet    __________
        
        20,    6,  122,  2807,  13.5,  73,  toyota       __________
        
        27,    4,  60,   1834,  19,    71,  volkswagen   __________
        
        36.1,  4,  66,   1800,  14.4,  78,  ford         __________
        
        
                                                         5-NN with weight 
                                                         ERROR
        
        root mean-square error (see p. 148)              __________
        
        mean absolute error (see p. 148)                 __________
        
        

    2. Now preprocess each of the predicting numeric attributes (i.e. cylinders, horsepower, weight, acceleration, and model-year) so that they range from 0 to 1. That is, for each numeric attribute (except for mpg), replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively.

      1. (5 points) Show your work in your report and show the resulting dataset.

      2. (10 points) Compute the 5 nearest neighbors of each of the four test instances below with respect to this normalized dataset. For this part, use the Euclidean distance without attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.

        Show your work in your report.

        
        ?,    8,  150,  4464,  12,    73,  chevrolet    
        
        ?,    6,  122,  2807,  13.5,  73,  toyota      
        
        ?,    4,  60,   1834,  19,    71,  volkswagen  
        
        ?,    4,  66,   1800,  14.4,  78,  ford       
        
        

        1. (5 points) Predict the mpg attribute of the test instances using their 5 nearest neighbors without any distance weighting. (Note that now the mpg value of each instance is provided so that you can compute error values).

          Show your work in your report.

                                                           5-NN          
                                                           PREDICTION   
          
          13,    8,  150,  4464,  12,    73,  chevrolet    __________
          
          20,    6,  122,  2807,  13.5,  73,  toyota       __________
          
          27,    4,  60,   1834,  19,    71,  volkswagen   __________
          
          36.1,  4,  66,   1800,  14.4,  78,  ford         __________
          
          
                                                           5-NN        
                                                           ERROR      
          
          root mean-square error (see p. 148)              __________
          
          mean absolute error (see p. 148)                 __________
          
          

        2. (5 points) Predict the target attribute mpg of each test instance using its 5 nearest neighbors weighted by the inverse of the distance. That is, if the mph values of the 5 nearest-neighbors of a test instance are m1, m2, m3, m4, and m5 and their respective distances to the test instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and the predicted value for the test instance is:
                 
                   (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5) 
                   ________________________________________________
                             w1 + w2 + w3 + w4 + w5
                
               

          Show your work in your report.

                                                           5-NN with weight 
                                                           PREDICTION
          
          13,    8,  150,  4464,  12,    73,  chevrolet    __________
          
          20,    6,  122,  2807,  13.5,  73,  toyota       __________
          
          27,    4,  60,   1834,  19,    71,  volkswagen   __________
          
          36.1,  4,  66,   1800,  14.4,  78,  ford         __________
          
          
                                                           5-NN with weight 
                                                           ERROR
          
          root mean-square error (see p. 148)              __________
          
          mean absolute error (see p. 148)                 __________
          
          

    3. (5 points) Which of the methods above produced the best results? Explain.

  2. (40 points) Clustering - Simple K-means

    1. Assume that we want to cluster the instances in the dataset subset-auto-mpg above in 3 clusters. Use the Simple K-means algorithm with Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.

      Assume that the 3 randomly selected initial centroids are:

      18,  8,  130,  3504,  12,    70,  chevrolet
      
      32,  4,  96,   2665,  13.9,  82,  toyota
      
      26,  4,  46,   1835,  20.5,  70,  volkswagen
      
      Show the first 2 iterations of the Simple K-means clustering algorithm. That is:

      1. (5 points) Assign data instances to the clusters represented by the 3 centroids.

      2. (5 points) Compute new centroids for the 3 resulting clusters.

      3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.

      4. (5 points) Compute new centroids for these 3 new resulting clusters.

      Show your work in your report.

    2. Repeat the process above but now starting with a normalized dataset. That is, assume that we want to cluster the instances in the dataset above in 3 clusters using the Simple K-means algorithm with Euclidean distance without attribute weights, but preprocessing each of the predicting numeric attributes (i.e. mpg, cylinders, horsepower, weight, acceleration, and model-year) so that they range from 0 to 1. That is, for each numeric attribute, replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.

      Show your work in your report and show the resulting dataset.

      Assume that the 3 randomly selected initial centroids are:

      18,  8,  130,  3504,  12,    70,  chevrolet
      
      32,  4,  96,   2665,  13.9,  82,  toyota
      
      26,  4,  46,   1835,  20.5,  70,  volkswagen
      
      Show the first 2 iterations of the Simple K-means clustering algorithm. That is:

      1. (5 points) Assign data instances to the clusters represented by the 3 centroids.

      2. (5 points) Compute new centroids for the 3 resulting clusters.

      3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.

      4. (5 points) Compute new centroids for these 3 new resulting clusters.

      Show your work in your report.

  3. (20 points) Clustering - Hierarchical Clustering
    Assume that we want to cluster the dataset subset-auto-mpg above using the COBWEB/CLASSIT clustering algorithm.

    Assume that we have followed the COBWEB/CLASSIT algorithm and so far we have created a partial clustering containing just the first 3 instances. The following tree shows the current partial clustering, where the numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances respectively and "0" denotes just an internal node. Note that in this partial clustering, instances (1) and (3) are clustered together and instance (2) forms a cluster on its own.

                                 O
                              /     \ 
                             0      (2)
                           /   \
                          (1) (3)
    
    Your job is to describe all the alternatives that are considered by the when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.

    1. (2 points) How many such alternatives are considered by the COBWEB/CLASSIT algorithm?

    2. (15 points) List each one of those alternatives and show what the resulting clustering tree would be for each of those alternatives.

    3. (3 points) Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives to select as the best one. Note: you don't need to compute the category utility value of each of those alternatives. You just need to explain how the selection of the best alternative is made by the algorithm.

PROJECT ASSIGNMENT

  1. Datasets: Consider the following sets of data:

    1. The Synthetic Control Chart Time Series Dataset available from the The University of California Irvine (UCI) KDD Data Repository.

      When needed, use the last attribute as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

      GIVEN THAT GRAPHICAL REPRESENTATIONS OF THE TIME SERIES PROVIDE INTUITION AND CONVEY VERY USEFUL INFORMATION AND it is required in this project that you use a graphic tool (e.g. excel) to plot the resulting models that you construct over this time series dataset. For example, the set of k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings. Modify the Weka code as needed so that you can gain access to the necessary data for those plots (i.e k-nearest neighbor of a test instance, and instances in each resulting cluster).

    2. Choose one of the following two datasets (please coordinate with your group partner so that you both choose the same one):

      1. The Abalone Dataset available from the The University of California Irvine (UCI) Data Repository.

        Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

      2. The Automobile Dataset available from the The University of California Irvine (UCI) Data Repository.

        Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

      Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings) for the chosen dataset will be rewarded with extra credit.

  2. Readings:

  3. Experiments: For each of the datasets, use the "Explorer" option of the Weka system to perform the following operations:

  4. INDIVIDUAL PROJECT AND WRITTEN REPORT.

    Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:

    1. Code Description: Describe algorithmicly the Weka code of the following 3 methods: IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM); and filters that you use in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.

    2. Experiments: For EACH EXPERIMENT YOU RAN describe:
      • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
      • Any pre-processing done to the data. That is, did you remove any attributes? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
      • Your system parameters.
      • For each of the methods covered by this project IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM);
        • Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the model.
        • Error/accuracy measures of the resulting models.
        • NEW PART: Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings). This is required for the time series dataset, and optional (extra credit) for for the second dataset.
        • Comparison of the results across the methods used in this project.

    3. Summary of Results
      • For each of the datasets, what were the best results obtained in your project? Include (the first 100 lines or so of) this model in your report.
      • Strengths and weaknesses of your project.

  5. GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:

    1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.

    2. Summary of Results as described in the individual part above.

  6. GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Tuesday, October 12. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

    PROJECT SUBMISSION AND DUE DATE

    Part II is due Monday, Oct. 11 at 10:00 pm. Submissions received on Monday, Oct. 11 between 10:01 pm and 12:00 midnight will be penalized with 30% off the grade and submissions after Oct. 11 won't be accepted.

    Please submit the following files using the myWpi digital drop box:

    1. [lastname]_proj4_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
      • ruiz_proj4_report.pdf

      If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.

    2. [lastname1_lastname2]_proj4_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
      • ruiz_smith_proj4_report.pdf if I worked with Joe Smith on this project.

    3. [lastname1_lastname2]_proj4_slides.[ext] (or [lastname]_proj4_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.

(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:

  (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE 
  (05 points) Description of the algorithm underlying the Weka filters used
  (10 points) Description of the ALGORITHM underlying the data mining 
              methods used in this project.
  (up to 10 extra credit points for an outstanding job) 
  (extra credit points for description of the optional methods: COBWEB/CLASSIT and EM) 
  (providing just a structural description of the code, i.e. a list of 
  classes and methods, will receive 0 points)
  
  (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
  Preprocess attributes as needed and dealing with missing values appropriately
  (up to 5 extra credit points) 
             Trying to do "fancier" things with attributes
             (i.e. combining two attributes highly correlated
              into one, using background knowledge, etc.)
    
  (TOTAL: 160 points: 80 points for the individual part and 80 points for the group part) 
  EXPERIMENTS
  (TOTAL: 40 points each dataset) FOR EACH DATASET:
         (05 points) ran a good number of experiments to get familiar with the 
                     data mining methods in this project
         (10 points) good description of the experiment setting and the results 
         (10 points) good analysis of the results of the experiments
                     INCLUDING discussion of evaluations statistics returned by
                   the Weka systems (accuracy and/or errors) and discussion of
                     particularly interesting results 
         (10 points) good graphical representations of the results for the 1st dataset 
         (10 extra points) good graphical representations of the results for the 2st dataset 
         (05 points) comparison of the results with those obtained using the other
                     methods in this project.
                     Argumentation of weknesses and/or strengths of each of the
                     methods on this dataset, and argumentation of which method
                     should be preferred for this dataset and why. 
       (up to 10 extra credit points) excellent analysis of the results and 
                                       comparisons
         (up to 10 extra credit points) running additional interesting experiments
  
  (TOTAL 5 points) SLIDES - how well do they summarize concisely
          the results of the project? We suggest you summarize the
          setting of your experiments and their results in a tabular manner.
     (up to 6 extra credit points) for excellent summary and presentation of results 
     in the slides.

  (TOTAL 15 points) Class presentation - how well your oral presentation summarized
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/useful of your experiments and results.
        This grade is given individually to each team member.