WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 
Homework and Project 3: Numeric Predictions

PROF. CAROLINA RUIZ 

DUE DATES: ------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is to construct accurate numeric prediction models for the two datasets under consideration using the following techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

Readings:


INDIVIDUAL HOMEWORK ASSIGNMENT

See
Solutions by Piotr Mardziel and Amro Khasawneh.

Consider the dataset below. This dataset is an adaptation of the IQ Brain Size Dataset.

@relation small_twin_dataset

@attribute CCMIDSA numeric 		% Corpus Collasum Surface Area (cm2)
@attribute GENDER {male,female}
@attribute TOTVOL numeric 		% Total Brain Volume (cm3)
@attribute WEIGHT numeric 		% Body Weight (kg)
@attribute FIQ numeric 			% Full-Scale IQ

@data
6.08,	female,	1005,	57.607,		96
5.73,	female,	963,	58.968,		89
7.99,	female,	1281,	63.958,		101
8.42,	female,	1272,	61.69,		103
6.84,	female,	1079,	107.503,	96
6.43,	female,	1070,	83.009,		126
7.6,	male,	1347,	97.524,		94
6.03,	male,	1029,	81.648,		97
7.52,	male,	1204,	79.38,		113
7.67,	male,	1160,	72.576,		124
For this homework, we want to predict the FIQ attribute (prediction target) from the other predicting attributes CCMIDSA, GENDER, TOTVOL, and WEIGHT.

  1. (15 points) Linear Regression

  2. (45 points) Regression Trees and Model Trees
    Follow the procedure described in the textbook to construct a model tree and a regression tree to solve this numeric prediction problem. Remember to:

    1. (5 points) Start by translating the nominal attribute GENDER into boolean/numeric attributes. This is done by taking the average of the CLASS values associated with each of the gender values Female, Male. Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these nominal values in the order listed. After this translation, all the predicting attributes are numeric.

    2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.

    3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
         SDR = sd(CLASS over all instances)
               - ((k1/n)*sd(CLASS of instances with attribute value below split point)
                  + (k2/n)*sd(CLASS of instances with attribute value above split point))
      
         where sd stands for standard deviation.
         k1 is the number of instances with attribute value below split point.
         k2 is the number of instances with attribute value above split point.
         n is the number of instances.
           
      To reduce the number of calculations that you need to perform, your HW solutions can be limited to the following split points:
        binary attributes	CCMIDSA		TOTVOL		WEIGHT	
        0.5			(6.08+6.43)/2 	(963+1005)/2	(58.968+61.69)/2
      			(6.84+7.52)/2	(1079+1160)/2	(72.576+79.38)/2
      			(7.67+7.99)/2	(1160+1204)/2	(83.009+97.524)/2
        
      If during the construction of the tree you encounter an attribute such that none of the split points listed above apply to the instances in the node, then use instead all the attribute's split points that apply to that collection of instances.

    4. (20 points) Continue the construction of the tree following the same procedure recursively. Remember that for each internal node, the procedure above is applied only to the data instances that belong that that node. You can stop splitting a node when the node contains less than 4 data instances and/or when the standard deviation of the CLASS value of the node's instances is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute over the entire input dataset. See Figure 6.15 on p. 248 of your textbook.

    5. (10 points) For each leaf node in the tree:
      • Compute the value that would be predicted by that leaf in the case of a Regression Tree.
      • Compute the linear regression formula that would be used by that leaf to predict the CLASS value in the case of a Model Tree. In order to find the coefficients of the linear regression formula, run the linear regression method implemented in the Weka system for the appropriate data instances (those that belong to the leaf).

  3. (15 points) Testing
    Use each of the three numeric predicting models constructed (linear regression equation, model tree and regression tree), to predict the CLASS values (i.e., the FIQ) for each of the test instances below. That is, complete the following table:
                                                    LINEAR        MODEL TREE    REGRESSION TREE 
                                                    REGRESSION    PREDICTION    PREDICTION
    						PREDICTION
    CCMIDSA	GENDER 	TOTVOL 	WEIGHT 		FIQ
    
    6.22,	female,	1035,	64.184,		87	__________    ___________   __________
    
    6.48,	female,	1034,	62.143,		127	__________    ___________   __________
    
    7.99,	male,	1173,	61.236,		101	__________    ___________   __________
    
    6.59,	male,	1100,	88.452,		114	__________    ___________   __________
    
    
    
                                                    LINEAR        MODEL TREE    REGRESSION TREE 
                                                    REGRESSION    ERROR         ERROR
                                                    ERROR
    
    root mean-square error (see p. 178)             __________    ___________   __________
    
    mean absolute error (see p. 178)                __________    ___________   __________
    
    
    SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

INDIVIDUAL + GROUP PROJECT ASSIGNMENT
[600 points: 100 points per data mining technique per dataset per individual/group parts. See
Project Guidelines for the detailed distribution of these points]