CS 525D Spring 2008

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 3: Regression

PROF. CAROLINA RUIZ

DUE DATE: Thursday March 6, 2008.

Slides: Submit by email by 2:30 pm.
Written report: Hand in a hardcopy by 3:30 pm.
Oral Presentation: during class that day.

This assignment consists of two parts:

A homework part in which you will focus on the construction and/or pruning of the models.
A project part in which you will focus on the experimental evaluation and analysis of the models.

I. Homework Part

[100 points] In this part of the assignment, we will investigate the pruning techniques used for regression trees and for model trees.

Consider the dataset below. This dataset is an adaptation of the World Happiness Dataset.

@relation world_happiness

% - Life Expectancy from UN Human Development Report (2003)
% - GDP per capita from figure published by the CIA (2006), figure in US$.
% - Access to secondary education rating from UNESCO (2002)
% - SWL (satisfaction with life) index calculated from data published 
%   by New Economics Foundation (2006).

@attribute country string
@attribute continent {Americas,Africa,Asia,Europe}
@attribute life-expectancy numeric
@attribute GDP-per-capita numeric
@attribute access-to-education-score numeric
@attribute SWL-index numeric

@data
Switzerland,	Europe,		80.5,	32.3,	99.9,	273.33
Canada,		Americas,	80,	34,	102.6,	253.33
Usa,		Americas,	77.4,	41.8,	94.6,	246.67
Germany,	Europe,		78.7,	30.4,	99,	240
Mexico,		Americas,	75.1,	10,	73.4,	230
France,		Europe,		79.5,	29.9,	108.7,	220
Thailand,	Asia,		70,	8.3,	79,	216.67
Brazil,		Americas,	70.5,	8.4,	103.2,	210
Japan,		Asia,		82,	31.5,	102.1,	206.67
India,		Asia,		63.3,	3.3,	49.9,	180
Ethiopia,	Africa,		47.6,	0.9,	5.2,	156.67
Russia,		Asia,		65.3,	11.1,	81.9,	143.3

For this homework, we want to predict the SWL-index attribute (prediction target) from the other predicting attributes continent, life-expectancy, GDP-per-capita, access-to-education-score. Note that the attribute country identifies each data instance uniquely and as such will be disregarded in our analysis. It is provided just for context.

[5 points] Build a regression tree for this dataset in Weka using M5P with the following parameters:

build regression tree: True unpruned: True useUnsmoothed: True default values for the remaining parameters
Record the tree in your report.
[45 points] Build a regression tree for this dataset in Weka using M5P with the following parameters:

build regression tree: True unpruned: False useUnsmoothed: True default values for the remaining parameters
Record the tree in your report [5 points]. Follow the regression tree pruning procedure by hand (read the corresponding Weka code in detail for this) so that you write down in your report each of the steps followed by the pruning procedure. Include in your report all the necessary formulas and calculations done to prune the regression tree in Part 1 above to obtain the resulting pruned regression tree of this part [40 points].
[5 points] Build a model tree for this dataset in Weka using M5P with the following parameters:

build regression tree: False unpruned: True useUnsmoothed: True default values for the remaining parameters
Record the tree in your report.
[45 points] Build a model tree for this dataset in Weka using M5P with the following parameters:

build regression tree: False unpruned: False useUnsmoothed: True default values for the remaining parameters
Record the tree in your report [5 points]. Follow the model tree pruning procedure by hand (read the corresponding Weka code in detail for this) so that you write down in your report each of the steps followed by the pruning procedure. Include in your report all the necessary formulas and calculations done to prune the model tree in Part 3 above to obtain the resulting pruned model tree of this part [40 points].

II. Project Part

[500 points: 50 points for linear regression, 100 points for regression trees, and 100 points for model trees, per dataset. See Project Guidelines for the detailed distribution of these points]

Project Instructions: Thoroughly read and follow the Project Guidelines. These guidelines contain detailed information about how to structure your project, and how to prepare your written and oral reports.
Data Mining Technique(s): We will run experiment using regression techniques. You need to use:
- Linear Regression (under "functions" in Weka)
- Regression Trees: M5P (under "trees" in Weka)
- Model Trees: M5P (under "trees" in Weka)
on each dataset.
Dataset(s): In this project, we will use two datasets:
- The World Happiness Dataset with Continents information added by Paul Sader.
  Use the attribute SWL-index as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice. Since the SWL-ranking can be derived from SWL-index, remove SWL-ranking from consideration. Also, remove the attribute country as each of its values identifies an instance uniquely.
- A dataset that you choose depending on your own insterests. It may be a dataset you are working with for your research or your job. It should contain enough instances (at least 200 instances) and several attributes (at least 10). Ideally it should contain a good mix of numeric and nominal attributes.
  I include below some links to Data Repositories containing multiple datasets to choose from:
Performance Metric(s): Use the metrics listed in Table 5.8 (page 178) of the textbook to measure the goodness of your models.
A major part of this project is to try to make sense of these performance metrics and to become familiar with them. When comparing the performance of different models, use tables like Table 5.9 (page 179) of the textbook.

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008 Project 3: Regression

PROF. CAROLINA RUIZ

I. Homework Part

II. Project Part

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING - Spring 2008
Project 3: Regression