CS 4445 Data Mining and Knowledge Discovery in Databases  A Term 2008
Homework and Project 3: Numeric Predictions
DUE DATES:

The individual homework assignment is due on Tuesday, Sept. 30 at 1:00 pm, and

The individual + group project is due on Friday, Oct. 3rd at 12 noon.
HOMEWORK AND PROJECT OBJECTIVES
The purpose of this project is to construct accurate
numeric prediction models for the two datasets under consideration
using the following techniques:
 Linear Regression
 Regression Trees
 Model Trees
Also, to gain close understanding of how those methods work,
this project also include following those methods by hand
on a toy dataset.
Readings:
 Textbook:
Read in great detail the following Sections from your textbook:
 Numeric Predictions: Sections 4.6, 6.5, 5.8.
INDIVIDUAL HOMEWORK ASSIGNMENT
See Solutions by Piotr Mardziel and Amro Khasawneh.
Consider the dataset below.
This dataset is an adaptation of the
IQ Brain Size Dataset.
@relation small_twin_dataset
@attribute CCMIDSA numeric % Corpus Collasum Surface Area (cm2)
@attribute GENDER {male,female}
@attribute TOTVOL numeric % Total Brain Volume (cm3)
@attribute WEIGHT numeric % Body Weight (kg)
@attribute FIQ numeric % FullScale IQ
@data
6.08, female, 1005, 57.607, 96
5.73, female, 963, 58.968, 89
7.99, female, 1281, 63.958, 101
8.42, female, 1272, 61.69, 103
6.84, female, 1079, 107.503, 96
6.43, female, 1070, 83.009, 126
7.6, male, 1347, 97.524, 94
6.03, male, 1029, 81.648, 97
7.52, male, 1204, 79.38, 113
7.67, male, 1160, 72.576, 124
For this homework, we want to predict the FIQ attribute (prediction target)
from the other predicting attributes
CCMIDSA,
GENDER,
TOTVOL, and
WEIGHT.
 (15 points) Linear Regression
 (5 points)
Write down the general form of the linear regression equation that would result from using linear
regression to solve this numeric prediction problem.
 (5 points)
Describe in detail the procedure that would be followed by linear regression to
find appropriate parameters for this equation.
 (5 points)
Run linear regression in the Weka system over this dataset and provide the precise
linear regression equation output by Weka (you'll need it for the testing part below).
Set the "attributeSelectionMethod" parameter to No Attribute Selection and the
"eliminateColinearAttributes" parameter of linear regression to False
so that your linear regression formula includes all the predicting attributes.
 (45 points) Regression Trees and Model Trees
Follow the procedure described in the textbook to construct a model tree and a
regression tree to solve this numeric prediction problem.
Remember to:
 (5 points)
Start by translating the nominal attribute GENDER into
boolean/numeric attributes.
This is done by taking the average of the CLASS values associated
with each of the gender values Female, Male.
Sort them in decresing order by average. Now, create new boolean
attributes, one for each possible split of these nominal values
in the order listed.
After this translation, all the predicting attributes are numeric.
 (5 points) Sort the values of each attribute in say increasing order.
Define a "split point" of an attribute as the midpoint between two
subsequent values of the attribute.
 (5 points) Consider the set of split points of all attributes.
Select as the condition for the root node on your tree,
the split point that maximizes the value of the following formula:
SDR = sd(CLASS over all instances)
 ((k1/n)*sd(CLASS of instances with attribute value below split point)
+ (k2/n)*sd(CLASS of instances with attribute value above split point))
where sd stands for standard deviation.
k1 is the number of instances with attribute value below split point.
k2 is the number of instances with attribute value above split point.
n is the number of instances.
To reduce the number of calculations that you need to perform,
your HW solutions can be limited to the following split points:
binary attributes CCMIDSA TOTVOL WEIGHT
0.5 (6.08+6.43)/2 (963+1005)/2 (58.968+61.69)/2
(6.84+7.52)/2 (1079+1160)/2 (72.576+79.38)/2
(7.67+7.99)/2 (1160+1204)/2 (83.009+97.524)/2
If during the construction of the tree you encounter an attribute such that
none of the split points listed above apply to the instances in the node, then use
instead all the attribute's split points that apply to that collection of instances.
 (20 points) Continue the construction of the tree following the same procedure
recursively. Remember that for each internal node, the procedure above is
applied only to the data instances that belong that that node.
You can stop splitting a node when the node contains less than 4 data instances
and/or when the standard deviation of the CLASS value of the node's instances
is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute
over the entire input dataset. See Figure 6.15 on p. 248 of your textbook.
 (10 points) For each leaf node in the tree:
 Compute the value that would be predicted by that leaf in the case of
a Regression Tree.
 Compute the linear regression formula that would be used by that leaf
to predict the CLASS value in the case of a Model Tree.
In order to find the coefficients of the linear regression formula,
run the linear regression method implemented in the Weka system for
the appropriate data instances (those that belong to the leaf).
 (15 points) Testing
Use each of the three numeric predicting models constructed
(linear regression equation, model tree and regression tree),
to predict the CLASS values (i.e., the FIQ)
for each of the test instances below.
That is, complete the following table:
LINEAR MODEL TREE REGRESSION TREE
REGRESSION PREDICTION PREDICTION
PREDICTION
CCMIDSA GENDER TOTVOL WEIGHT FIQ
6.22, female, 1035, 64.184, 87 __________ ___________ __________
6.48, female, 1034, 62.143, 127 __________ ___________ __________
7.99, male, 1173, 61.236, 101 __________ ___________ __________
6.59, male, 1100, 88.452, 114 __________ ___________ __________
LINEAR MODEL TREE REGRESSION TREE
REGRESSION ERROR ERROR
ERROR
root meansquare error (see p. 178) __________ ___________ __________
mean absolute error (see p. 178) __________ ___________ __________
SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
INDIVIDUAL + GROUP PROJECT ASSIGNMENT
[600 points: 100 points per data mining technique per dataset per individual/group parts.
See
Project Guidelines
for the detailed distribution of these points]
 Project Instructions:
THOROUGHLY READ AND FOLLOW THE
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
 Individual part and group part of this project:
 Individual Part:
[100 points per technique per dataset]
Using the dataset for the individual part (see below),
follow the
Experiments' Guidelines described
in the
Project Guidelines and record your observations and results as described there
in the individual section of your written report.
 Group Part:
[100 points per technique per dataset]
Using the dataset for the group part (see below), and in collaboration with your group partner,
follow the
Experiments' Guidelines (items 38) described
in the
Project Guidelines and record your joint observations and results as described there
in the group section of your written report.
 Data Mining Technique(s):
We will run experiment using regression techniques.
You need to use:
 Linear Regression (under "functions" in Weka)
 Regression Trees: M5P (under "trees" in Weka) with BuildRegressionTree=True
 Model Trees: M5P (under "trees" in Weka) with BuildRegressionTree=False
on each dataset.
 Datasets:
In this project, we will use two datasets,
one for the individual part and one for the group part of the project:
 Dataset for the Individual Part of the Project:
The
Automobile Dataset
available from the
The University of California Irvine (UCI) Data Repository.
Use the attribute price as the prediction target.
After you run experiments predicting this attribute you may, if you wish,
run additional experiments using other numeric predicting targets of your choice.
 Dataset for the Group Part of the Project:
A dataset that you choose depending on your and your group partner's
own insterests.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes.
I include below some links to Data Repositories containing
multiple datasets to choose from:
You can use other data repositories if you wish.
 Performance Metric(s):
Use the metrics listed in Table 5.8 (page 178) of the textbook
to measure the goodness of your models.
A major part of this project is to try to make sense of these
performance metrics and to become familiar with them.
When comparing the performance of different models, use tables
like Table 5.9 (page 179) of the textbook.
 Miscellaneous:
Remember to experiment with and compare the effect of the different parameters of
the techniques included in this project:
 Linear Regression:
Attribute selection method used; and whether elimination of colinear attributes is
used or not.
 M5P:
Model vs. Regression trees; prunning vs. nonprunning; smoothing vs. nonsmoothing;
and variations in the value of minimum number of instances in a leaf.