CS 4445 A Term 2008 - Homework 4 and Project 4

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

DUE DATE:

The individual homework assignment is due on Tuesday, Oct. 7th 2008 at 1:00 pm

The group project is due on Friday, Oct. 10th 2008 at 12:00 noon.

Homework and Project Objectives
Homework Assignment
Project Assignment

HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is to construct the most accurate models of different aspects of the datasets under consideration using the following data mining techniques:

Instance Based Learning:
- IBk: k-nearest neighbors
- Locally Weighted Regression. This is implemented in the Weka system. Look for LWL (Locally Weighted Learning) under the "lazy" classifiers and use "Linear Regression" as the "classifier" parameter for this LWL.
Clustering:
- Simple k-means
- Hierarchical Clustering: COBWEB/CLASSIT

Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

INDIVIDUAL HOMEWORK ASSIGNMENT

See Solutions by Piotr Mardziel and Amro Khasawneh.

Consider the dataset below. This dataset is an adaptation of the IQ Brain Size Dataset.

@relation small_twin_dataset

@attribute CCMIDSA numeric 		% Corpus Collasum Surface Area (cm2)
@attribute GENDER {male,female}
@attribute TOTVOL numeric 		% Total Brain Volume (cm3)
@attribute WEIGHT numeric 		% Body Weight (kg)
@attribute FIQ numeric 			% Full-Scale IQ

@data
6.08,	female,	1005,	57.607,		96
5.73,	female,	963,	58.968,		89
7.99,	female,	1281,	63.958,		101
8.42,	female,	1272,	61.69,		103
6.84,	female,	1079,	107.503,	96
6.43,	female,	1070,	83.009,		126
7.6,	male,	1347,	97.524,		94
6.03,	male,	1029,	81.648,		97
7.52,	male,	1204,	79.38,		113
7.67,	male,	1160,	72.576,		124

(40 points + 10 extra credits) Instance Based Learning

Assume that we want to predict the FIQ attribute (prediction target) from the other predicting attributes CCMIDSA, GENDER, TOTVOL, and WEIGHT.
1. (10 points) Compute the 4 nearest neighbors of each of the two test instances below. For this part, use the Euclidean distance without modifying the numeric predicting attributes nor using attribute weights. Also, use the nominal attribute GENDER as given: two different values of this attribute (e.g., "female" and "male") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "female" and "female") will contribute 0 towards the Euclidean distance of two data instances.
  Show your work in your report.
```
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		?

6.59,	male,	1100,	88.452,		?
```
  1. (5 points) Predict the FIQ attribute of the test instances using their 4 nearest neighbors without any distance weighting. (Note that now the FIQ value of each instance is provided so that you can compute error values).
    Show your work in your report.
```
                                                         4-NN          
                                                         PREDICTION   
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127		___________ 

6.59,	male,	1100,	88.452,		114		___________ 


                                                         4-NN        
                                                         ERROR      

root mean-square error (see p. 178)                      __________

mean absolute error (see p. 178)                         __________
```
  2. (5 points) Predict the target attribute FIQ of each test instance using its 4 nearest neighbors weighted by the inverse of the distance. That is, if the FIQ values of the 4 nearest-neighbors of a test instance are m1, m2, m3, and m4 and their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and the predicted value for the test instance is:
```
       
         (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) 
         _______________________________________
                   w1 + w2 + w3 + w4 
      
     
```
    Show your work in your report.
```
                                                         4-NN          
                                                         PREDICTION   
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127		__________

6.59,	male,	1100,	88.452,		114		__________


                                                         4-NN        
                                                         ERROR      

root mean-square error (see p. 178)                      __________

mean absolute error (see p. 178)                         __________
```
2. Now preprocess each of the predicting numeric attributes (CCMIDSA, TOTVOL, and WEIGHT.) so that they range from 0 to 1. That is, for each numeric attribute (except for FIQ), replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively.
  1. (5 points) Show your work in your report and show the resulting dataset.
  2. (10 points) Compute the 4 nearest neighbors of each of the two test instances below with respect to this normalized dataset. For this part, use the Euclidean distance without attribute weights. Also, use the nominal attribute GENDER as given: two different values of this attribute (e.g., "female" and "male") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "female" and "female") will contribute 0 towards the Euclidean distance of two data instances.
    Show your work in your report.
```
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		?

6.59,	male,	1100,	88.452,		?
```
    1. (5 points) Predict the FIQ attribute of the test instances using their 4 nearest neighbors without any distance weighting. (Note that now the FIQ value of each instance is provided so that you can compute error values).
      Show your work in your report.
```
                                                         4-NN          
                                                         PREDICTION   

CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127		__________

6.59,	male,	1100,	88.452,		114		__________ 

                                                         4-NN        
                                                         ERROR      

root mean-square error (see p. 178)                      __________

mean absolute error (see p. 178)                         __________
```
    2. (5 points) Predict the target attribute FIQ of each test instance using its 4 nearest neighbors weighted by the inverse of the distance. That is, if the FIQ values of the 4 nearest-neighbors of a test instance are m1, m2, m3, and m4 and their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4, and the predicted value for the test instance is:
```
       
         (w1*m1) + (w2*m2) + (w3*m3) + (w4*m4) 
         _______________________________________
                   w1 + w2 + w3 + w4 
      
     
```
      Show your work in your report.
```
                                                         4-NN          
                                                         PREDICTION   

CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127		__________

6.59,	male,	1100,	88.452,		114		__________ 

                                                         4-NN        
                                                         ERROR      

root mean-square error (see p. 178)                      __________

mean absolute error (see p. 178)                         __________
```
  3. (5 points) Which of the methods above produced the best results? Explain.
(40 points) Clustering - Simple K-means
1. Assume that we want to cluster the instances in the small-twin-dataset above in 2 clusters. Use the Simple K-means algorithm with Euclidean distance without modifying the numeric attributes nor using attribute weights. Also, use the nominal attribute GENDER as given: two different values of this attribute (e.g., "female" and "male") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "female" and "female") will contribute 0 towards the Euclidean distance of two data instances.
  Assume that the 2 randomly selected initial centroids are:
```
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127

6.59,	male,	1100,	88.452,		114
```
  Show the first 2 iterations of the Simple K-means clustering algorithm. That is:
  1. (5 points) Assign data instances to the clusters represented by the 2 centroids.
  2. (5 points) Compute new centroids for the 2 resulting clusters.
  3. (5 points) Assign data instances to the clusters represented by the 2 new centroids.
  4. (5 points) Compute new centroids for these 2 new resulting clusters.
  Show your work in your report.
2. Repeat the process above but now starting with a normalized dataset. That is, assume that we want to cluster the instances in the small-twin-dataset above in 2 clusters using the Simple K-means algorithm with Euclidean distance without attribute weights, but preprocessing each of the numeric attributes so that they range from 0 to 1. That is, for each numeric attribute, including FIQ, replace each value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Use the nominal attribute GENDER as given: two different values of this attribute (e.g., "female" and "male") will contribute 1 towards the Euclidean distance, and two "same values" of this attribute (e.g., "female" and "female") will contribute 0 towards the Euclidean distance of two data instances.
  Assume that the 2 randomly selected initial centroids are:
```
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ

6.48,	female,	1034,	62.143,		127

6.59,	male,	1100,	88.452,		114
```
  Show the first 2 iterations of the Simple K-means clustering algorithm. That is:
  1. (5 points) Assign data instances to the clusters represented by the 2 centroids.
  2. (5 points) Compute new centroids for the 2 resulting clusters.
  3. (5 points) Assign data instances to the clusters represented by the 2 new centroids.
  4. (5 points) Compute new centroids for these 2 new resulting clusters.
  Show your work in your report.
(20 points) Clustering - Hierarchical Clustering
Assume that we want to cluster a dataset using the COBWEB/CLASSIT clustering algorithm.
Assume that we have followed the COBWEB/CLASSIT algorithm and so far we have created a partial clustering containing just the first 4 instances. The following tree shows the current partial clustering, where the numbers in parenthesis (1), (2), (3), (4) represent the 1st, 2nd, 3rd, and 4th data instances respectively and "o" denotes just an internal node.
```
                             o
                          /     \ 
                         o      (2)
                       /   \
                      o    (3)
                    /   \
                   (1) (4)
```
Your job is to describe all the alternatives that are considered by the COBWEB/CLASSIT algorithm when adding instance (5) to this clustering tree.
1. (2 points) How many such alternatives are considered by the COBWEB/CLASSIT algorithm?
2. (15 points) List each one of those alternatives and show what the resulting clustering tree would be for each of those alternatives.
3. (3 points) Explain how the COBWEB/CLASSIT algorithm would choose which of those alternatives to select as the best one. Note: you don't need to compute the category utility value of each of those alternatives. You just need to explain how the selection of the best alternative is made by the algorithm.

PROJECT ASSIGNMENT
[800 points: 100 points per new data mining technique (instance based learning, clustering); 80 points per previously studied data mining technique; 40 points for meaningful interpretation of the resulting clusters for each of the clustering methods dataset. See below for the distribution of these points.]

Readings:

Textbook: Read in great detail the following Sections from your textbook:
- Instance Based Learning: Sections 4.7, 6.4.
- Clustering: Sections 6.6

Project Instructions: THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
For this final project, no individual part is included. That is, all the work on this project (except for the individual homework above) is to be done with your group partner.
Datasets: In this project, you will use one dataset, which will be analyzed thoroughly using all the techniques covered in this course. This dataset is to be chosen by you and your partner depending on your and your group partner's own insterests. The better your choice of dataset, the more interesting your project. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes. I include below some links to Data Repositories containing multiple datasets to choose from:
You can use other data repositories if you wish.
Data Mining Technique(s): The main purpose of this project is to thoroughly analyze your dataset using Instance Based Learning, Clustering, and all the other techniques studied in this course. Your objectives of the analysis, see Experiments' Guidelines item 3, should be the same for all methods, so choose those objectives wisely. Run experiments using the following techniques:
- Instance Based Learning: Follow the Experiments' Guidelines (items 1-8).
  - (100 points) IBk: k-nearest neighbors
  - (100 points) Locally Weighted Regression. This is implemented in the Weka system. Look for LWL (Locally Weighted Learning) under the "lazy" classifiers and use "Linear Regression" as the "classifier" parameter for this LWL.
- Clustering: Follow the Experiments' Guidelines (items 1-8).
  - (100 points) Simple k-means
  - (100 points) Hierarchical Clustering: COBWEB/CLASSIT
- Previously studied data mining methods: Follow the Experiments' Guidelines (items 4-8).
  - (80 points) Decision trees: ID3, J4.8
  - (80 points) Classification Rules: Prism
  - (80 points) Association Rules: Apriori
  - (80 points) Regression: Linear Regression, Model trees, Regression Trees.
Performance Metric(s):
- Instance Based Learning:
  Use classification accuracy, if the target of an experiment is nominal; or the numeric prediction evalution metrics, if the target is numeric.
- Clustering:
  (80 points: 40 per clustering method) An important part of this project is to find meaningful ways of evaluating and interpreting the resulting clusters. Devise a variety of approaches to do so, including but not limited to visualization of the resulting clusters, inspection of the clusters' members to find commonalities, etc. The more creative/ingenious your approaches, the better. You might want to extend the Weka code to provide the evaluation/interpretation functionality you need.
- Previously studied data mining methods:
  Use the appropriate performace metrics.
General Comments
- Focus on experimenting with different ways of preprocessing the data, varying the parameters of all algorithms (including instance based learning and clustering), and providing your own methods to evaluating and interpreting the results.
- Provide meaningful analysis of results and a thorough comparison of the quantitative and qualitative traits of the models constructed.

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 Homework and Project 4: Instance Based Learning and Clustering

PROF. CAROLINA RUIZ

HOMEWORK AND PROJECT OBJECTIVES

INDIVIDUAL HOMEWORK ASSIGNMENT

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Homework and Project 4: Instance Based Learning and Clustering