### CS 444X Data Mining and Knowledge Discovery in Databases  D Term 2004 Project 4: Numeric Predictions, Instance Based Learning, and Clustering

#### PROF. CAROLINA RUIZ

DUE DATE: This project is due on Wednesday, April 28, 2004 at 12 noon

#### PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration using the following data mining techniques:
• Numeric Predictions:
• Linear Regression
• Regression Trees
• Model Trees
• Instance Based Learning:
• IBk: k-nearest neighbors
• [Optional - Locally Weighted Regression. This is implemented in the Weka system. Look for LWL (Locally Weighted Learning) under the "lazy" classifiers and use "Linear Regression" as the "classifier" parameter for this LWL.]
• Clustering:
• Simple k-means
Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

#### PROJECT ASSIGNMENT

1. Part I.
Consider the dataset
training_cpu.arff. This dataset is a subset of the cpu dataset dataset that is available with the Weka system. See page 15 of your textbook for a description of this dataset.
1. (40 points) Numeric Predictions
Follow the procedure described in the textbook to construct a model tree and a regression tree of that uses predictive attributes vendor, MYCT, MMIN, MMAX, CACH, CHMIN, and CHMAX in order to predict the attribute CLASS in the training_cpu.arff dataset. Remember to:
1. (5 points) Start by translating the nominal attribute vendor into boolean/numeric attributes. This is done by taking the average of the CLASS values associated with each of the vendor values burroughs, dec, hp, honeywell, ibm. Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these five nominal values in the order listed. After this translation, all the predicting attributes are numeric.
2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.
3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
```        SDR = sd(CLASS over all instances)
- ((k1/n)*sd(CLASS of instances with attribute value below split point)
+ (k2/n)*sd(CLASS of instances with attribute value above split point))

where sd stands for standard deviation.
k1 is the number of instances with attribute value below split point.
k2 is the number of instances with attribute value above split point.
n is the number of instances.
```
4. (10 points) Continue the construction of the tree following the same procedure recursively. Remember that for each internal node, the procedure above is applied only to the data instances that belong that that node. You can stop splitting a node when the node contains less than 4 data instances and/or when the standard deviation of the CLASS value of the node's instances is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute over the entire input dataset. See Figure 6.11 on p. 206 of your textbook.
5. (10 points) For each leaf node in the tree:
• Compute the value that would be predicted by that leaf in the case of a Regression Tree.
• Compute the linear regression formula that would be used by that leaf to predict the CLASS value in the case of a Model Tree. In order to find the coefficients of the linear regression formula, run the linear regression method implemented in the Weka system for the appropriate data instances (those that belong to the leaf).
6. (5 extra points) For each of the trees constructed (model tree and regression tree), predict the CLASS values for each of the test instances in testing_cpu.arff. That is, complete the following table:
```%
% As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction
% using instance-based learning with encoding length selection. In Progress
% in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
%
@relation 'cpu'
@attribute vendor {burroughs, dec, hp, honeywell, ibm}
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute class real
@data
CLASS   MODEL TREE    REGRESSION TREE
PREDICTION    PREDICTION

burroughs, 143,  512, 5000,  0, 7, 32,  29    _________    ___________

burroughs, 110, 3100, 6200,  0, 6, 64,  45    _________    ___________

dec,       200,  512, 8000,  8, 1,  8,  38    _________    ___________

dec,       810,  512,  512,  8, 1,  1,  18    _________    ___________

hp,         75, 3000, 8000,  8, 3, 48,  54    _________    ___________

hp,        175,  256, 2000,  0, 3, 24,  20    _________    ___________

honeywell, 330, 1000, 4000,  0, 3,  6,  25    _________    ___________

honeywell, 140, 2000, 4000,  8, 1, 20,  32    _________    ___________

ibm,        26, 8000, 32000, 0, 8, 24, 220    _________    ___________

ibm,       225, 2000, 4000,  8, 3,  6,  31    _________    ___________

MODEL TREE    REGRESSION TREE
ERROR         ERROR

root mean-square error (see p. 148)           __________    ___________

mean absolute error (see p. 148)              __________    ___________

```
SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

2. (20 points) Instance Based Learning
Assume now that we want to predict the attribute CLASS using as predicting attributes in the training_cpu.arff dataset.
1. (10 points) Find the 4 nearest neighbors of each of the test instances. Show your work in your report.
2. (5 points) Classify the test instances using these 4 nearest neighbors without any weighting.
3. (5 points) Classify the test instances using these 4 nearest neighbors weighted by the inverse of the distance.
```%
% As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction
% using instance-based learning with encoding length selection. In Progress
% in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
%
@relation 'cpu'
@attribute vendor {burroughs, dec, hp, honeywell, ibm}
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute class real
@data
CLASS   4-NN         4-NN with weight
PREDICTION   PREDICTION

burroughs, 143,  512, 5000,  0, 7, 32,  29    _________    ___________

dec,       200,  512, 8000,  8, 1,  8,  38    _________    ___________

hp,         75, 3000, 8000,  8, 3, 48,  54    _________    ___________

honeywell, 330, 1000, 4000,  0, 3,  6,  25    _________    ___________

ibm,        26, 8000, 32000, 0, 8, 24, 220    _________    ___________

4-NN         4-NN with weight
ERROR         ERROR

root mean-square error (see p. 148)           __________    ___________

mean absolute error (see p. 148)              __________    ___________

```
SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

2. Part II.

1. Datasets: Consider the following sets of data:

1. The Titanic Dataset. Look at the dataset description and the Data instances.

I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:

```Class (0 = crew, 1 = first, 2 = second, 3 = third)
Age   (1 = adult, 0 = child)
Sex   (1 = male, 0 = female)
Survived (1 = yes, 0 = no)
```

2. The entire cpu dataset that is available with the Weka system. See page 15 of your textbook for a description of this dataset.
For the classifications methods (numberic predictions and instance based learning) select appropriate (nominal or numeric) classification targets (i.e. "class attributes") for your experiments.

• Textbook: Read in great detail the following Sections from your textbook:
• Numeric Predictions: Sections 4.6, 6.5, 5.8.
• Instance Based Learning: Sections 4.7, 6.4.
• Clustering: Sections 6.6

• Weka Code: Read the code of the relevant techiques implemented in the Weka system. Some of those techniques are enumerated below:
• Numeric Predictions:
• M5PRIME: Linear Regression, Regression Trees, Model Trees
• Instance Based Learning:
• IBk: k-nearest neighbors
• (Optional - LWR: Locally Weighted Regression)
• Clustering:
• Simple k-means

3. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

1. Load the data. Note that you need to translate the dataset into the arff format first.

2. Preprocessing of the Data:

A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

In particular,

• explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.

3. Mining of Models: The following are guidelines for the construction of your classification rules:

• Code: Use the Weka's algorithms listed above to generate models of each of the two datasets.

• Training and Testing Instances:

You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

4. Evaluation and Testing: Supply input data to Weka and use an appropriate split ratio (say 66% vs 33%) for training and testing data.

Analyze in detail the results obtained. For classification models, analyze the accuracy of the resulting models. For numeric predictions analyze the errors reported by Weka and explain their meaning.

#### REPORTS AND DUE DATE

```
---------------------------------------------------------------------------
(TOTAL: 60 points) FOR PART I OF THE PROJECT

(TOTAL: 70 points) FOR PART II OF THE PROJECT

(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the ALGORITHM undelying the data mining
methods used in this project.
(up to 10 extra credit points for an outanding job)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)

(TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
Discretizing attributes IF needed, and dealing with missing values appropriately
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)

(TOTAL: 46 points) EXPERIMENTS
(TOTAL: 23 points each dataset) FOR EACH DATASET:
(05 points) ran a good number of experiments to get familiar with the
data mining methods in this project
(05 points) good description of the experiment setting and the results
(08 points) good analysis of the results of the experiments
INCLUDING discussion of evaluations statistics returned by
the Weka systems (accuracy and/or errors) and discussion of
particularly interesting results
(05 points) comparison of the results with those obtained using other
methods in this and previous projects
Argumentation of weknesses and/or strenghts of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 10 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments

(TOTAL 4 points) SLIDES - how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.
(up to 6 extra credit points) for excellent summary and presentation of results
in the slides.

```