CS 4445 A Term 2004 - Homework 4 and Project 4

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004
Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

DUE DATE: Part I (the individual homework assignment) is due on Friday, October 8 at 1:00 pm (NO LATE HOMEWORK SUBMISSIONS WILL BE ACCEPTED) and Parts II.1 and II.2 (the individual+group project) are due on Monday, October 11 2004 at 10 pm.

Homework and Project Description
Homework Assignment
Project Assignment
Report Submission and Due Date
Grading Criteria

HOMEWORK AND PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration using the following data mining techniques:

Instance Based Learning:
- IBk: k-nearest neighbors
- Locally Weighted Regression. This is implemented in the Weka system. Look for LWL (Locally Weighted Learning) under the "lazy" classifiers and use "Linear Regression" as the "classifier" parameter for this LWL.
Clustering:
- Simple k-means
- Hierarchical Clustering: COBWEB/CLASSIT
- Clustering using Expectation Maximization: EM

Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

HOMEWORK ASSIGNMENT

See Solutions to this HW assignment by James Martineau.

Consider the dataset below. This dataset is a subset of the Auto Miles-per-galon (MPG) dataset that is available at the The University of California Irvine (UCI) Data Repository.


@relation subset-auto-mpg

@attribute mpg numeric
@attribute cylinders numeric
@attribute horsepower numeric
@attribute weight numeric
@attribute acceleration numeric
@attribute model-year numeric
@attribute car-name {chevrolet,toyota,volkswagen,ford}

@data

18,  8,  130,  3504,  12,    70,  chevrolet
27,  4,  90,   2950,  17.3,  82,  chevrolet
34,  4,  88,   2395,  18,    82,  chevrolet
24,  4,  95,   2372,  15,    70,  toyota
28,  4,  75,   2155,  16.4,  76,  toyota
32,  4,  96,   2665,  13.9,  82,  toyota
26,  4,  46,   1835,  20.5,  70,  volkswagen
29,  4,  70,   1937,  14.2,  76,  volkswagen
44,  4,  52,   2130,  24.6,  82,  volkswagen
10,  8,  215,  4615,  14,    70,  ford
28,  4,  79,   2625,  18.6,  82,  ford
16,  8,  149,  4335,  14.5,  77,  ford

(40 points + 10 extra credits) Instance Based Learning
For this part, we want to predict the mpg attribute (prediction target) from the other predicting attributes cylinders, horsepower, weight, acceleration, model-year, and car-name

(10 points) Compute the 5 nearest neighbors of each of the four test instances below. For this part, use the Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.

Show your work in your report.


?,    8,  150,  4464,  12,    73,  chevrolet    

?,    6,  122,  2807,  13.5,  73,  toyota      

?,    4,  60,   1834,  19,    71,  volkswagen  

?,    4,  66,   1800,  14.4,  78,  ford

(5 points) Predict the mpg attribute of the test instances using their 5 nearest neighbors without any distance weighting. (Note that now the mpg value of each instance is provided so that you can compute error values).

Show your work in your report.

                                                 5-NN          
                                                 PREDICTION   

13,    8,  150,  4464,  12,    73,  chevrolet    __________

20,    6,  122,  2807,  13.5,  73,  toyota       __________

27,    4,  60,   1834,  19,    71,  volkswagen   __________

36.1,  4,  66,   1800,  14.4,  78,  ford         __________


                                                 5-NN        
                                                 ERROR      

root mean-square error (see p. 148)              __________

mean absolute error (see p. 148)                 __________

(5 points) Predict the target attribute mpg of each test instance using its 5 nearest neighbors weighted by the inverse of the distance. That is, if the mph values of the 5 nearest-neighbors of a test instance are m1, m2, m3, m4, and m5 and their respective distances to the test instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and the predicted value for the test instance is:

       
         (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5) 
         ________________________________________________
                   w1 + w2 + w3 + w4 + w5

Show your work in your report.

                                                 5-NN with weight 
                                                 PREDICTION

13,    8,  150,  4464,  12,    73,  chevrolet    __________

20,    6,  122,  2807,  13.5,  73,  toyota       __________

27,    4,  60,   1834,  19,    71,  volkswagen   __________

36.1,  4,  66,   1800,  14.4,  78,  ford         __________


                                                 5-NN with weight 
                                                 ERROR

root mean-square error (see p. 148)              __________

mean absolute error (see p. 148)                 __________

Now preprocess each of the predicting numeric attributes (i.e. cylinders, horsepower, weight, acceleration, and model-year) so that they range from 0 to 1. That is, for each numeric attribute (except for mpg), replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively.
1. (5 points) Show your work in your report and show the resulting dataset.
2. (10 points) Compute the 5 nearest neighbors of each of the four test instances below with respect to this normalized dataset. For this part, use the Euclidean distance without attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.
  Show your work in your report.
```
?,    8,  150,  4464,  12,    73,  chevrolet    

?,    6,  122,  2807,  13.5,  73,  toyota      

?,    4,  60,   1834,  19,    71,  volkswagen  

?,    4,  66,   1800,  14.4,  78,  ford       
```
  1. (5 points) Predict the mpg attribute of the test instances using their 5 nearest neighbors without any distance weighting. (Note that now the mpg value of each instance is provided so that you can compute error values).
    Show your work in your report.
```
                                                 5-NN          
                                                 PREDICTION   

13,    8,  150,  4464,  12,    73,  chevrolet    __________

20,    6,  122,  2807,  13.5,  73,  toyota       __________

27,    4,  60,   1834,  19,    71,  volkswagen   __________

36.1,  4,  66,   1800,  14.4,  78,  ford         __________


                                                 5-NN        
                                                 ERROR      

root mean-square error (see p. 148)              __________

mean absolute error (see p. 148)                 __________
```
  2. (5 points) Predict the target attribute mpg of each test instance using its 5 nearest neighbors weighted by the inverse of the distance. That is, if the mph values of the 5 nearest-neighbors of a test instance are m1, m2, m3, m4, and m5 and their respective distances to the test instance are d1, d2, d3, d4, and d5, then the weights of the 5 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, w4 = 1/d4, and w5 = 1/d5, and the predicted value for the test instance is:
```
       
         (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) + (w5*m5) 
         ________________________________________________
                   w1 + w2 + w3 + w4 + w5
      
     
```
    Show your work in your report.
```
                                                 5-NN with weight 
                                                 PREDICTION

13,    8,  150,  4464,  12,    73,  chevrolet    __________

20,    6,  122,  2807,  13.5,  73,  toyota       __________

27,    4,  60,   1834,  19,    71,  volkswagen   __________

36.1,  4,  66,   1800,  14.4,  78,  ford         __________


                                                 5-NN with weight 
                                                 ERROR

root mean-square error (see p. 148)              __________

mean absolute error (see p. 148)                 __________
```
(5 points) Which of the methods above produced the best results? Explain.

(40 points) Clustering - Simple K-means
1. Assume that we want to cluster the instances in the dataset subset-auto-mpg above in 3 clusters. Use the Simple K-means algorithm with Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.
  Assume that the 3 randomly selected initial centroids are:
```
18,  8,  130,  3504,  12,    70,  chevrolet

32,  4,  96,   2665,  13.9,  82,  toyota

26,  4,  46,   1835,  20.5,  70,  volkswagen
```
  Show the first 2 iterations of the Simple K-means clustering algorithm. That is:
  1. (5 points) Assign data instances to the clusters represented by the 3 centroids.
  2. (5 points) Compute new centroids for the 3 resulting clusters.
  3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.
  4. (5 points) Compute new centroids for these 3 new resulting clusters.
  Show your work in your report.
2. Repeat the process above but now starting with a normalized dataset. That is, assume that we want to cluster the instances in the dataset above in 3 clusters using the Simple K-means algorithm with Euclidean distance without attribute weights, but preprocessing each of the predicting numeric attributes (i.e. mpg, cylinders, horsepower, weight, acceleration, and model-year) so that they range from 0 to 1. That is, for each numeric attribute, replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Use the nominal attribute car-name as given: two different values of this attribute (e.g. "toyota" and "ford") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g. "toyota" and "toyota") will contribute 0 towards the Euclidean distance of two data instances.
  Show your work in your report and show the resulting dataset.
  Assume that the 3 randomly selected initial centroids are:
```
18,  8,  130,  3504,  12,    70,  chevrolet

32,  4,  96,   2665,  13.9,  82,  toyota

26,  4,  46,   1835,  20.5,  70,  volkswagen
```
  Show the first 2 iterations of the Simple K-means clustering algorithm. That is:
  1. (5 points) Assign data instances to the clusters represented by the 3 centroids.
  2. (5 points) Compute new centroids for the 3 resulting clusters.
  3. (5 points) Assign data instances to the clusters represented by the 3 new centroids.
  4. (5 points) Compute new centroids for these 3 new resulting clusters.
  Show your work in your report.
(20 points) Clustering - Hierarchical Clustering
Assume that we want to cluster the dataset subset-auto-mpg above using the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we have created a partial clustering containing just the first 3 instances. The following tree shows the current partial clustering, where the numbers in parenthesis (1), (2), (3) represent the 1st, 2nd, and 3rd data instances respectively and "0" denotes just an internal node. Note that in this partial clustering, instances (1) and (3) are clustered together and instance (2) forms a cluster on its own.
```
                             O
                          /     \ 
                         0      (2)
                       /   \
                      (1) (3)
```
Your job is to describe all the alternatives that are considered by the when COBWEB/CLASSIT algorithm when adding instance (4) to this clustering tree.
1. (2 points) How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
2. (15 points) List each one of those alternatives and show what the resulting clustering tree would be for each of those alternatives.
3. (3 points) Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives to select as the best one. Note: you don't need to compute the category utility value of each of those alternatives. You just need to explain how the selection of the best alternative is made by the algorithm.

PROJECT ASSIGNMENT

Datasets: Consider the following sets of data:

The Synthetic Control Chart Time Series Dataset available from the The University of California Irvine (UCI) KDD Data Repository.
When needed, use the last attribute as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
GIVEN THAT GRAPHICAL REPRESENTATIONS OF THE TIME SERIES PROVIDE INTUITION AND CONVEY VERY USEFUL INFORMATION AND it is required in this project that you use a graphic tool (e.g. excel) to plot the resulting models that you construct over this time series dataset. For example, the set of k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings. Modify the Weka code as needed so that you can gain access to the necessary data for those plots (i.e k-nearest neighbor of a test instance, and instances in each resulting cluster).

Choose one of the following two datasets (please coordinate with your group partner so that you both choose the same one):

The Abalone Dataset available from the The University of California Irvine (UCI) Data Repository.
Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

The Automobile Dataset available from the The University of California Irvine (UCI) Data Repository.
Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings) for the chosen dataset will be rewarded with extra credit.
Readings:
- Textbook: Read in great detail the following Sections from your textbook:
  - Instance Based Learning: Sections 4.7, 6.4.
  - Clustering: Sections 6.6
- Weka Code: Read the code of the relevant techniques implemented in the Weka system. Some of those techniques are enumerated below:
  - Instance Based Learning:
    - IBk: k-nearest neighbors
    - LWR: Locally Weighted Regression
  - Clustering:
    - Simple k-means
    - OPTIONAL (extra points): Hierarchical Clustering: COBWEB/CLASSIT
    - OPTIONAL (extra points): Clustering using Expectation Maximization: EM
Experiments: For each of the datasets, use the "Explorer" option of the Weka system to perform the following operations:
- A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, etc. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
- You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.
INDIVIDUAL PROJECT AND WRITTEN REPORT.
Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:
1. Code Description: Describe algorithmicly the Weka code of the following 3 methods: IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM); and filters that you use in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.
2. Experiments: For EACH EXPERIMENT YOU RAN describe:
  - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
  - Any pre-processing done to the data. That is, did you remove any attributes? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
  - Your system parameters.
  - For each of the methods covered by this project IBk: k-nearest neighbors, Locally Weighted Regression (LWL), and Simple K-means (optional for extra points: COBWEB/CLASSIT and EM);
    - Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the model.
    - Error/accuracy measures of the resulting models.
    - NEW PART: Meaningful graphical representation of the results (i.e. k-nearest neighbors of a given test data instance, in the case of instance based learning; and the set of resulting clusters; in the case of clusterings). This is required for the time series dataset, and optional (extra credit) for for the second dataset.
    - Comparison of the results across the methods used in this project.
3. Summary of Results
  - For each of the datasets, what were the best results obtained in your project? Include (the first 100 lines or so of) this model in your report.
  - Strengths and weaknesses of your project.
GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:
1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.
2. Summary of Results as described in the individual part above.
GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Tuesday, October 12. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

PROJECT SUBMISSION AND DUE DATE
Part II is due Monday, Oct. 11 at 10:00 pm. Submissions received on Monday, Oct. 11 between 10:01 pm and 12:00 midnight will be penalized with 30% off the grade and submissions after Oct. 11 won't be accepted.
Please submit the following files using the myWpi digital drop box:
1. [lastname]_proj4_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
  - ruiz_proj4_report.pdf
  If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.
2. [lastname1_lastname2]_proj4_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
  - ruiz_smith_proj4_report.pdf if I worked with Joe Smith on this project.
3. [lastname1_lastname2]_proj4_slides.[ext] (or [lastname]_proj4_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

GRADING CRITERIA

(TOTAL: 100 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.

(TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:

  (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE 
  (05 points) Description of the algorithm underlying the Weka filters used
  (10 points) Description of the ALGORITHM underlying the data mining 
              methods used in this project.
  (up to 10 extra credit points for an outstanding job) 
  (extra credit points for description of the optional methods: COBWEB/CLASSIT and EM) 
  (providing just a structural description of the code, i.e. a list of 
  classes and methods, will receive 0 points)
  
  (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
  Preprocess attributes as needed and dealing with missing values appropriately
  (up to 5 extra credit points) 
             Trying to do "fancier" things with attributes
             (i.e. combining two attributes highly correlated
              into one, using background knowledge, etc.)
    
  (TOTAL: 160 points: 80 points for the individual part and 80 points for the group part) 
  EXPERIMENTS
  (TOTAL: 40 points each dataset) FOR EACH DATASET:
         (05 points) ran a good number of experiments to get familiar with the 
                     data mining methods in this project
         (10 points) good description of the experiment setting and the results 
         (10 points) good analysis of the results of the experiments
                     INCLUDING discussion of evaluations statistics returned by
                   the Weka systems (accuracy and/or errors) and discussion of
                     particularly interesting results 
         (10 points) good graphical representations of the results for the 1st dataset 
         (10 extra points) good graphical representations of the results for the 2st dataset 
         (05 points) comparison of the results with those obtained using the other
                     methods in this project.
                     Argumentation of weknesses and/or strengths of each of the
                     methods on this dataset, and argumentation of which method
                     should be preferred for this dataset and why. 
       (up to 10 extra credit points) excellent analysis of the results and 
                                       comparisons
         (up to 10 extra credit points) running additional interesting experiments
  
  (TOTAL 5 points) SLIDES - how well do they summarize concisely
          the results of the project? We suggest you summarize the
          setting of your experiments and their results in a tabular manner.
     (up to 6 extra credit points) for excellent summary and presentation of results 
     in the slides.

  (TOTAL 15 points) Class presentation - how well your oral presentation summarized
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/useful of your experiments and results.
        This grade is given individually to each team member.

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004 Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT DESCRIPTION

HOMEWORK ASSIGNMENT

PROJECT ASSIGNMENT

PROJECT SUBMISSION AND DUE DATE

GRADING CRITERIA

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004
Homework and Project 4: Instance Based Learning and Clustering