### CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2016 Project 2: Decision Trees, Linear Regression, Model Trees, Regression Trees

#### PROF. CAROLINA RUIZ

DUE DATE: Tuesday October 11th, 2016.
• Slides: Submit via myWPI by 2:00 pm.
• Written report: Hand in a hardcopy by the beginning of class (by 3:59 pm).

### Project Assignment:

1. Study Sections 4.1-4.5 and Appendix D of the textbook in great detail.

2. Study Witten's, Frank's, and Hall textbook (available on reserve in the WPI Library) Sect. 3.3, 4.6 (linear regression), and 6.6.

3. Study all the materials posted on the course Lecture Notes:
In particular, you should know the algorithms to construct decision trees, regression trees, and model trees very well, and be able to use these algorithms to construct trees from data by hand during the test. See examples provided in the Lecture Notes linked above. (Note: for model and regression trees, a software tool will be used to obtain the necessary linear regressions.)

4. THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written summary, and how to study for the test.

*** You must use the Project 2 Template provided for your written report. Do not exceed the page limits stated in the template nor decrease the font size ***. (If you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)

• Data Mining Technique(s): Run experiments in Weka AND in Python using the following techniques:

• Pre-processing Techniques: Feature selection, feature creation, dimensionality reduction, noise reduction, attribute discretization, ...

• Classification Techniques:
• Zero-R (majority class)
• One-R
• Decision trees: Using Weka (J4.8) and Python.
Since these decision tree implementations are able to handle numeric attributes and missing values directly, make sure to run some experiments with no pre-processing and some experiments with pre-processing (discretizing continuous attributes and replacing missing values before hand), and compare the results.

• Regression Techniques:
• Linear Regression: Weka (under "functions") and Python.
• Regression Trees: Weka (M5P under "trees") and Python.
• Model Trees: Weka (M5P under "trees") and Python.

• Dataset: Use the Default of Credit Card Clients Data Set. This dataset is available at the UCI Machine Learning Repository.

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

• X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
• X2: Gender (1 = male; 2 = female).
• X3: Education (1 = graduate school; 2 = university; 3 = high school; 0, 4, 5, 6 = others).
• X4: Marital status (1 = married; 2 = single; 3 = divorce; 0=others).
• X5: Age (year).
• X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -2: No consumption; -1: Paid in full; 0: The use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
• X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
• X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.
• Y: client's behavior; Y=0 then not default, Y=1 then default

Use the following attributes as continuous:

`X1, X5, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23.`
Run some experiments using the following attributes as nominal and some other experiments using them as continuous, and compare the results:
` X2, X3, X4, X6, X7, X8, X9, X10, X11.`

• For classification tasks, use the default payment (Yes = 1, No = 0) attribute (that is, attribute "Y"). Use this attribute as a nominal attribute in all of your experiments.

• For regression tasks, use the PAY_AMT6 attribute (that is, attribute "X23").

Run experiments with and without discretizing the predicting attributes; removing attributes that are too related to the target or that make the trees too long; and with any other pre-processing and post-processing that produce useful and meaningful models.

• Performance Metric(s):
• Use the following metrics or evaluation methods:
1. For classification tasks: use classification accuracy, precision, recall, ROC Area, and confusion matrices.
For regression tasks: use correlation coefficient AND any subset of the following error metrics that you find appropriate: mean-squared error, root mean-squared error, mean absolute error, relative squared error, root relative squared error, and relative absolute error. An important part of the data mining evaluation in this project is to try to make sense of these performance metrics and to become familiar with them.
2. size of the tree,
3. readability of the tree, as separate measures to evaluate the "goodness" of your models, and
4. time it took to construct the tree.
• Compare each accuracy/error you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
• Remember to experiment with pruning of your tree: Experiment with pre- and/or post-prunning of the tree in order to increase the classification accuracy, reduce the prediction error, and/or reduce the size of the tree.

• Advanced Topic(s): Investigate in more depth (experimentally, theoretically, or both) a topic of your choice that is related to decision or model/regression trees and that was not covered already in this project, class lectures, or the textbook. This tree-related topic might be something that was described or mentioned briefly in the textbook or in class; comes from your own research; or is related to your interests. Just a few sample ideas are: The prune functions in Python; C4.5; C4.5 pruning methods (for trees or for rules); any of the additional tree classifiers in Weka: DecisionStump, LMT RandomForest, RandomTree, REPTree; meta-learning applied to decision trees (see Classifier -> Choose -> meta); other useful functionality in Python; an idea from a research paper that you find intriguing; or any other tree-related topic.