 
 

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 
Homework and Project 3: Numeric Predictions
 DUE DATES:
- 
The individual homework assignment is due on Tuesday, Sept. 30 at 1:00 pm, and
- 
The individual + group project is due on Friday, Oct. 3rd at 12 noon. 

 HOMEWORK AND PROJECT OBJECTIVES 
The purpose of this project is to construct accurate
numeric prediction models for the two datasets under consideration 
using the following techniques:
     
     -  Linear Regression
     
-  Regression Trees
     
-  Model Trees
     
Also, to gain close understanding of how those methods work,
this project also include following those methods by hand 
on a toy dataset.Readings:
- Textbook:
Read in great detail the following Sections from your textbook:
-  Numeric Predictions: Sections 4.6, 6.5, 5.8.
 
 INDIVIDUAL HOMEWORK ASSIGNMENT 
See Solutions by Piotr Mardziel and Amro Khasawneh.
Consider the dataset below. 
This dataset is an adaptation of the 
IQ Brain Size Dataset.
@relation small_twin_dataset
@attribute CCMIDSA numeric 		% Corpus Collasum Surface Area (cm2)
@attribute GENDER {male,female}
@attribute TOTVOL numeric 		% Total Brain Volume (cm3)
@attribute WEIGHT numeric 		% Body Weight (kg)
@attribute FIQ numeric 			% Full-Scale IQ
@data
6.08,	female,	1005,	57.607,		96
5.73,	female,	963,	58.968,		89
7.99,	female,	1281,	63.958,		101
8.42,	female,	1272,	61.69,		103
6.84,	female,	1079,	107.503,	96
6.43,	female,	1070,	83.009,		126
7.6,	male,	1347,	97.524,		94
6.03,	male,	1029,	81.648,		97
7.52,	male,	1204,	79.38,		113
7.67,	male,	1160,	72.576,		124
For this homework, we want to predict the FIQ attribute (prediction target)
from the other predicting attributes 
CCMIDSA,
GENDER,
TOTVOL, and
WEIGHT.
- (15 points) Linear Regression 
 
-  (5 points) 
Write down the general form of the linear regression equation that would result from using linear
regression to solve this numeric prediction problem.
-  (5 points) 
Describe in detail the procedure that would be followed by linear regression to
find appropriate parameters for this equation.
-  (5 points) 
Run linear regression in the Weka system over this dataset and provide the precise 
linear regression equation output by Weka (you'll need it for the testing part below). 
Set the "attributeSelectionMethod" parameter to No Attribute Selection and the
 "eliminateColinearAttributes" parameter of linear regression to False
so that your linear regression formula includes all the predicting attributes.
 
- (45 points) Regression Trees and Model Trees 
 Follow the procedure described in the textbook to construct a model tree and a 
regression tree to solve this numeric prediction problem. 
Remember to:
 -  (5 points)
     Start by translating the nominal attribute GENDER into 
     boolean/numeric attributes.
     This is done by taking the average of the CLASS values associated
     with each of the gender values Female, Male.
     Sort them in decresing order by average. Now, create new boolean
     attributes, one for each possible split of these nominal values
     in the order listed.
     After this translation, all the predicting attributes are numeric.
 
-  (5 points) Sort the values of each attribute in say increasing order.
     Define a "split point" of an attribute as the midpoint between two
     subsequent values of the attribute.
 
-  (5 points) Consider the set of split points of all attributes.
     Select as the condition for the root node on your tree,
     the split point that maximizes the value of the following formula:
     
   SDR = sd(CLASS over all instances)
         - ((k1/n)*sd(CLASS of instances with attribute value below split point)
            + (k2/n)*sd(CLASS of instances with attribute value above split point))
   where sd stands for standard deviation.
   k1 is the number of instances with attribute value below split point.
   k2 is the number of instances with attribute value above split point.
   n is the number of instances.
     To reduce the number of calculations that you need to perform, 
  your HW solutions can be limited to the following split points:
  binary attributes	CCMIDSA		TOTVOL		WEIGHT	
  0.5			(6.08+6.43)/2 	(963+1005)/2	(58.968+61.69)/2
			(6.84+7.52)/2	(1079+1160)/2	(72.576+79.38)/2
			(7.67+7.99)/2	(1160+1204)/2	(83.009+97.524)/2
   If during the construction of the tree you encounter an attribute such that 
  none of the split points listed above apply to the instances in the node, then use
  instead all the attribute's split points that apply to that collection of instances.
-  (20 points) Continue the construction of the tree following the same procedure
       recursively. Remember that for each internal node, the procedure above is
       applied only to the data instances that belong that that node. 
       You can stop splitting a node when the node contains less than 4 data instances
       and/or when the standard deviation of the CLASS value of the node's instances
       is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute
       over the entire input dataset. See Figure 6.15 on p. 248 of your textbook.
  
-  (10 points) For each leaf node in the tree:
       
       -  Compute the value that would be predicted by that leaf in the case of
            a Regression Tree.
       
-  Compute the linear regression formula that would be used by that leaf 
            to predict the CLASS value in the case of a Model Tree.
            In order to find the coefficients of the linear regression formula,
            run the linear regression method implemented in the Weka system for
            the appropriate data instances (those that belong to the leaf).
       
 
 
- (15 points) Testing 
 Use each of the three numeric predicting models constructed 
        (linear regression equation, model tree and regression tree),
        to predict the CLASS values (i.e., the FIQ) 
	for each of the test instances below.
        That is, complete the following table:
                                                LINEAR        MODEL TREE    REGRESSION TREE 
                                                REGRESSION    PREDICTION    PREDICTION
						PREDICTION
CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ
6.22,	female,	1035,	64.184,		87	__________    ___________   __________
6.48,	female,	1034,	62.143,		127	__________    ___________   __________
7.99,	male,	1173,	61.236,		101	__________    ___________   __________
6.59,	male,	1100,	88.452,		114	__________    ___________   __________
                                                LINEAR        MODEL TREE    REGRESSION TREE 
                                                REGRESSION    ERROR         ERROR
                                                ERROR
root mean-square error (see p. 178)             __________    ___________   __________
mean absolute error (see p. 178)                __________    ___________   __________
SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
INDIVIDUAL + GROUP PROJECT ASSIGNMENT
[600 points: 100 points per data mining technique per dataset per individual/group parts.
See 
Project Guidelines 
for the detailed distribution of these points]
- Project Instructions:
THOROUGHLY READ AND FOLLOW THE 
PROJECT GUIDELINES.
These guidelines contain detailed information about how to structure your
project, and how to prepare your written and oral reports.
-  Individual part and group part of this project:
- Individual Part: 
 [100 points per technique per dataset]
   Using the dataset for the individual part (see below),
   follow the 
   Experiments' Guidelines described
   in the 
   Project Guidelines and record your observations and results as described there
   in the individual section of your written report.
- Group Part: 
 [100 points per technique per dataset]
   Using the dataset for the group part (see below), and in collaboration with your group partner,
   follow the 
   Experiments' Guidelines (items 3-8) described
   in the 
   Project Guidelines and record your joint observations and results as described there
   in the group section of your written report.
 
- Data Mining Technique(s):
We will run experiment using regression techniques. 
You need to use:
-  Linear Regression (under "functions" in Weka)
-  Regression Trees: M5P (under "trees" in Weka) with BuildRegressionTree=True
-  Model Trees: M5P (under "trees" in Weka) with BuildRegressionTree=False
 on each dataset.
- Datasets:
In this project, we will use two datasets, 
one for the individual part and one for the group part of the project:
 
-  Dataset for the Individual Part of the Project:
 The
Automobile Dataset 
available from the 
The University of California Irvine (UCI) Data Repository.
 Use the attribute price as the prediction target.
After you run experiments predicting this attribute you may, if you wish,
run additional experiments using other numeric predicting targets of your choice.
-  Dataset for the Group Part of the Project:
 A dataset that you choose depending on your and your group partner's
own insterests.
It should contain enough instances (at least 200 instances) and
several attributes (at least 10). Ideally it should contain a good mix of
numeric and nominal attributes. 
I include below some links to Data Repositories containing 
multiple datasets to choose from:
 
You can use other data repositories if you wish.
 
- Performance Metric(s):
Use the metrics listed in Table 5.8 (page 178) of the textbook
to measure the goodness of your models.
 A major part of this project is to try to make sense of these
performance metrics and to become familiar with them. 
When comparing the performance of different models, use tables 
like Table 5.9 (page 179) of the textbook.
- Miscellaneous:
Remember to experiment with and compare the effect of the different parameters of 
the techniques included in this project:
-  Linear Regression: 
Attribute selection method used; and whether elimination of colinear attributes is
used or not.
-  M5P:  
Model vs.  Regression trees; prunning vs. non-prunning; smoothing vs. non-smoothing;
and variations in the value of minimum number of instances in a leaf.