### CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2019 Project 2: Decision Trees, Linear Regression, Model Trees, Regression Trees

#### PROF. CAROLINA RUIZ

DUE DATE: Thursday October 10th, 2019 at 2:00 pm.

### Project Assignment:

1. Group Project: This is a group project. Students should work in groups of 2 students. Please do not split the project in a way that each student does only a portion of the work. Instead each student is expected to work on the entire project individually and then meet with the group to clarify doubts, share findings, and combine the project solutions into one group report. Submit just one written report. Help or assistance from other groups, other people, or online resources is NOT allowed.

2. Study Chapter 3, Sections 10.1, 10.3 and Appendix D (online) of the textbook in great detail.

3. Study decision tree prunning using the Weka book: Section 6.1.

4. Study linear regression, model trees and regression trees using the Weka book: Sections 3.2, 3.3, 4.6 and
• If you have Witten's, Frank's, and Hall's 3rd edition of the book (available on reserve in the WPI Library under "CS548"): Section 6.6 (Numeric Predictions with Local Linear Models).
• If you have Witten's, Frank's, Hall's and Pal's 4th edition of the book: Section 7.3 (Numeric Predictions with Local Linear Models).

5. Study all the materials posted on the course Lecture Notes:
In particular, you should know the algorithms to construct decision trees, regression trees, and model trees very well, and be able to use these algorithms to construct trees from data by hand during the test. See examples provided in the Lecture Notes linked above. (Note: for model and regression trees, a software tool will be used to obtain the necessary linear regressions.)

6. THOROUGHLY READ AND FOLLOW THE PROJECT GUIDELINES. These guidelines contain detailed information about how to structure your project, and how to prepare your written summary, and how to study for the test.

*** You must use the Project 2 Template provided here for your written report. Do NOT change the structucture of the report, do NOT exceed the page limits stated in the template and do NOT decrease the font size ***. (If you prefer not to use Word, you can copy and paste this format in a different editor as long as you respect the stated page structure and page limit.)

• Data Mining Technique(s): Run all project experiments in Python, using the following techniques:

• Pre-processing Techniques:
Consider the pre-processing techniques (feature selection, feature creation, dimensionality reduction, noise reduction, attribute transformations, ...) discussed in class, the textbook and used in project 1.
1. Determine which pre-processing techniques are necessary to pre-process the given dataset before you can mine predictive (either classification or regression) models from this data. The least pre-processing at first, the better. List the necessary pre-processing in your report.
2. Determine which pre-processing techniques would be useful (though not necessary) for this dataset in order to construct better prediction models. Do this by running experiments with and without applying these pre-processing techniques and comparing how they affect the performance and readability of the prediction models.

• Classification Techniques:
• "Zero-R" (majority class).
• "One-R".
• Decision trees.
• Random forests.

• Regression Techniques:
• Linear Regression.
• Regression Trees.
• Model Trees (we will discuss in class about implementing this method in Python).

• Dataset:

• Students taking CS548: Use the Census-Income (KDD) Data Set (use the census-income.data.gz data file with k-fold cross-validation, so no need to use the census-income.test.gz data file). This dataset is available at the UCI Machine Learning Repository.
• For classification tasks: Experiment with using income (<\$50K or >\$50K) OR sex as the target attribute. After you run initial experiments, we will decide in class which of these two attributes would be a better target and you will use it for all your remaining classificiation experiments.
• The majority of the class voted to use income as the target.
• For regression tasks: Experiment with using weeks worked in year OR age as the regression target. After you run initial experiments, we will decide in class which of these two attributes would be a better target and you will use it for all your remaining regression experiments.
• The majority of the class voted to use age as the target.

• Students taking BCB503/CS583: Use the Mice Protein Expression Data Set. This dataset is available at the UCI Machine Learning Repository.
• For classification tasks, use attribute Class (that is, attribute #82) as the target attribute. The large number of values of this attribute (8) may make the classification task hard. Run preliminary experiments to decide whether to use this attribute as the target or to convert it into a binary attribute (e.g., stimulated to learn vs. not stimulated to learn; or injected with saline vs. injected with memantine) or anothe reasonable option.
• For regression tasks: Pick a numeric/continuous attribute of your choice as the regression target.
• Note: If you prefer to pick another biological/biomedical dataset for this project, please discuss your proposed dataset with me.

• Performance Metric(s):
• Use the following metrics or evaluation methods:
1. For classification tasks: use classification accuracy, precision, recall, ROC Area, and confusion matrices.
For regression tasks: use correlation coefficient AND any subset of the following error metrics that you find appropriate: mean-squared error, root mean-squared error, mean absolute error, relative squared error, root relative squared error, and relative absolute error. An important part of the data mining evaluation in this project is to try to make sense of these performance metrics and to become familiar with them.
2. size of the tree,
3. readability of the tree, as a separate qualitative criterion to evaluate the "goodness" of your models, and
4. time it took to construct the tree.
• Compare each accuracy/error you obtained against those of benchmarking techniques as ZeroR and OneR over the same (sub-)set of data instances you used in the corresponding experiment.
• Remember to experiment with pruning of your tree: Experiment with pre- and/or post-prunning of the tree in order to increase the classification accuracy, reduce the prediction error, and/or reduce the size of the tree.
If post-pruning is not available in Python, you are not required to experiment with post-pruning. If you choose to do so, you can implement tree post-pruning as part of your advanced topic.

• Advanced Topic(s): Investigate in more depth (experimentally, theoretically, or both) a topic of your choice that is related to decision or model/regression trees and that was not covered already in this project, class lectures, or the textbook. This tree-related topic might be something that was described or mentioned briefly in the textbook or in class; comes from your own research; or is related to your interests. Just a few sample ideas are: The prune functions in Python; C4.5; C4.5 pruning methods (for trees or for rules); advanced topics in random forests; CART; visualization of decision trees and/or regression trees; other useful functionality in Python; an idea from a research paper that you find intriguing; or any other tree-related topic.
Remember that you need to investigate your advanced topic in depth, at a "graduate level".