CS 4445 B Term 2006 - Homework 1 and Project 1

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006
Homework and Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ

DUE DATES:
Part I (the individual homework assignment) is due on Friday, Nov. 3rd at 12:00 noon and
Parts II.1 and II.2 (the individual+group project) are due on Tuesday, Nov. 7th 2006 at 12 noon.

Project Description
Project Assignment
Report Submission and Due Date
Grading Criteria

PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain experience "pre-processing" datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
To gain experience with the construction of decision trees.
To gain experience with the evaluation of the models/patterns constructed with a data mining technique.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).

PROJECT ASSIGNMENT

Readings: Read in great detail Sections 4.3 and 6.1 from your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT
See Solutions to this HW1 by Piotr Mardziel (pdf).
Consider the following dataset, adapted from the Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository. Follow those points and also see Part II of this assignment to learn more about the Car Evaluation Dataset.
ATTRIBUTES: POSSIBLE VALUES: buying-price {vhigh,high,med,low} maintenance {vhigh,high,med,low} persons {2,4,more} % Assumed to be a nominal attribute safety {low,med,high} recommendation {unacc,acc,good}

buying-price maintenance persons safety recommendation
med vhigh more low unacc
med vhigh 2 med unacc
vhigh vhigh more med unacc
med high 4 low unacc
high med 4 high good
low med 2 med unacc
low high 2 high unacc
low vhigh more med acc
med vhigh 4 med acc
med vhigh 4 med acc
vhigh vhigh 4 med unacc
med med more med acc
med vhigh 2 med unacc
med med 4 low unacc
med vhigh more low unacc
med low 4 med acc
high low 2 high unacc
high med 4 low unacc
med low 4 low unacc
high high 4 low unacc
low med 4 high good
low low 2 high unacc

(50 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (buying-price, maintenance, persons, safety) with respect to the target/classification attribute (recommendation).
Show all the steps of the calculations. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's.

(10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:

buying-price maintenance persons safety recommendation YOUR DECISION
TREE PREDICTION
med low 4 med acc
high med 2 low unacc
med low 4 high good
med vhigh more med acc
low high 4 low unacc
med med 2 low unacc
med low 2 low unacc
med med 4 low unacc
med vhigh 2 med unacc
low high 2 low unacc
vhigh low 4 med acc

The accuracy of your decision tree on this test data is: ________________ The confusion matrix of your decision tree on this test data is: ....

Consider the subtree replacement pruning technique to make your decision tree more accurate over the this test data.

(2 points) What node in your decision tree above would be considered first for subtree replacement pruning? Why?

(8 points) Show in detail how this pruning technique works on that one node in your previous answer. Will the node be pruned or not by the technique. Explain your answer.

(30 points) Show in detail how this pruning technique works on the rest nodes in the (transformed) decision tree. Explain your answer.
Part II. INDIVIDUAL + GROUP PROJECT ASSIGNMENT
For this and other course projects, we will use the Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-learning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka sytem, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/).
1. Part II.1 INDIVIDUAL PART OF THE PROJECT
  Each student in the class should complete the following steps on his/her own:
  1. Download and use the latest stable GUI version of the Weka system.
  2. Read Chapters 9, 10, and 12 of your textbook to learn more about the Weka system.
  3. Datasets: Consider the following sets of data:
    1. The Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository. The file car.names contains the description of the attributes. It calls "class" the target attribute. The file car.data contains the data instances.
    2. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
      The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
  4. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:
    1. Load the data. Note that you need to translate the dataset into the arff format first.
    2. Preprocessing of the Data:
      A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, deal with a skewed target attribute if necessary, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
      In particular,
      - explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
      - explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
      - devise ways of dealing with a skewed target attribute, if present.
      To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.
    3. Mining of Patterns:
      1. Use the "ZeroR" and "OneR" classifiers under the "Classify" tab. They will provide you with benchmark classification accuracies to evaluate the accuracy of your decision trees below.
      2. Decision Trees: The following are guidelines for the construction of your decision tree:
        
        Code: Use the decision tree methods implemented in the Weka system: ID3 and J4.8. Read the Weka code implementing ID3 and J4.8 in great detail.
        
        Training and Testing Instances:
        You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.
    4. Evaluation and Testing: Use different ways of testing your results for each of the mining techniques employed (i.e. ZeroR, OneR, ID3, J4.8).
      1. Supply input data and mine and evaluate your model over this same input data.
      2. Supply separate training and testing data to Weka.
      3. Supply input data to Weka and experiment with several split ratios for training and testing data.
      4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.
    5. Pruning of your decision tree:
      Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.
  5. Written Report of Part II.1
    Submit an individual report of the work you have done on your own as described above. See the details of the submission below. Your report should contain the following sections with the corresponding discussions:
    1. Code Description: Describe the Weka code of the classifiers and filters that you used in the project. More precisely, explain the algorithm underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output.
    2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, skewedness of target attribute, and other relevant characteristics.
      Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
    3. Experiments: For each experiment you ran describe:
      - Motivation for and purpose of the experiment.
      - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
      - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values? Is the target attribute skewed? If so, what did you do to deal with it?
      - Your system parameters.
      - For the ZeroR, OneR, ID3, and J4.8 classifiers,
        
        Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
        Accuracy of the resulting models
        Comparison the classification accuracies of the models obtained with the ZeroR, OneR, ID3, and J4.8 classifiers.
    4. Strengths and the weaknesses of your individual project.
2. Part II.2 GROUP PART OF THE PROJECT
  Once that you have completed PartII.1 of the project on your own, work with your project partner analyzing the experiments and results that each of you obtained. This joint analysis of the results should include:
  1. Description of the similarities and differences of the experiments run by each of you, and hence of the results obtained. Explain in detail those similarities and differences that you observed.
  2. Based on the joint analysis and the experience you have gained, design new, interesting/useful experiments to run together with your partner. Run those experiments and analyze the results together. Follow the same guidelines for experimentation described above in Part II.1.
  3. Written Report of Part II.2
    Submit a joint report of the work you have done together described above. Only one of you needs to submit the joint report. See the details of the submission below. Your joint report should contain the following sections with the corresponding discussions:
    1. Group members: Name of the group members and a description of what precisely each group member contributed to the project.
    2. Joint analysis of the individual results as described above.
    3. Joint Experiments: For each joint experiment you ran describe:
      - Motivation for the experiment (based on the joint analysis of the individual results) and purpose of the experiment.
      - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
      - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values? Is the target attribute skewed? If so, what did you do to deal with it?
      - Your system parameters.
      - For the ZeroR, OneR, ID3, and J4.8 classifiers,
        
        Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
        Accuracy of the resulting models
        Comparison the classification accuracies of the models obtained with the ZeroR, OneR, ID3, and J4.8 classifiers.
    4. Summary of Results: For each of the datasets analyzed,
      - What was the accuracy of the most accurate decision tree constructed in your project?
      - Include the most accurate tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
      - What was the most intuitive (easy to understand) decision tree constructed in your project?
      - Include the most intuitive tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
      - Strengths and the weaknesses of your joint project.

REPORTS AND DUE DATE

Oral Report. We will discuss the results from the individual and group projects during the class on Thursday, November 9th. Your oral report should summarize the different sections of your individual and joint written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

Given the short time allowed for presentations, you should have at most 4 to 6 slides. Describe your experiments and results using tables. For instance, you can use (if you want) tables with pre-processing of the dataset used (as rows) vs. mining technique and system parameters (as columns), with the size and accuracy of the resulting tree in the cells of this table. Any other good way of summarizing your results would be fine as well. DURING YOUR PRESENTATION TRY TO FOCUS ON THE MOST INTESTING RESULTS YOU OBTAINED AND/OR THE MOST INTERESTING/UNUSUAL IDEAS THAT YOU TRIED.

Submission and Due Date.

Part I. Part I is due Friday, Nov. 3rd at 12:00 noon. Bring a hardcopy of your answers to the classroom right before the beginning of the class.
Part II. Part II is due Tuesday, Nov. 7th at 12:00 noon. BRING A HARDCOPY OF THE INDIVIDUAL AND GROUP WRITTEN REPORTS WITH YOU TO CLASS. In addition, you must submit your report electronically as specified below. Submissions received on Tuesday, Nov. 7th between 12:01 pm and 12:00 midnight will be penalized with 30% off the grade, submissions received on Wed. Nov. 8th between 12:01 am (early morning) and 8:00 am will will be penalized with 60% off the grade; and submissions received after Wed. Nov. 8th at 8:00 am won't be accepted.
Please submit the following files using the myWpi digital drop box:

[lastname]_proj1_report.[ext] containing your individual written report for Part II.1. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

ruiz_proj1_report.pdf

If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report and not a joint report, as you are working all by yourself on the projects.

[lastname1_lastname2]_proj1_report.[ext] containing your group written report for Part II.2. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

ruiz_smith_proj1_report.pdf if I worked with Joe Smith on this project.

[lastname1_lastname2]_proj1_slides.[ext] (or [lastname]_proj1_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt).

GRADING CRITERIA

TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(04 points) Description of the algorithm underlying the Weka filters used
(02 points) Description of the algorithm underlying Weka's ZeroR and OneR codes
(04 points) Description of the algorithm underlying Weka's ID3 code
(05 points) Description of the algorithm underlying Weka's J4.8 code

(TOTAL 40 points: 20 on the individual part and 20 on the joint part) 
PRE-PROCESSING OF THE DATASET:
(05 points) Translating both input datasets into .arff
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric when appropriate, 
            using as many of them as possible, skewed target attribute, etc.) 
(up to 10 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL 120: 60 points for the individual part and 60 points for the joint part) 
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
   (02 points) ran at least a reasonable number of experiments
               to get familiar with ZeroR and OneR
   (TOTAL: 26 points) For each decision tree method required
       ID3 and J4.8 (13 points each):
       (05 points) ran at least a reasonable number of experiments
                   to get familiar with the decision tree method and
                   different evaluation methods (%split, cross-validation,...)
       (03 points) good description of the motivation and purpose of the 
                   experiment, of experiment setting and the results 
       (05 points) good analysis of the results of the experiments
       (up to 4 extra credit points)
                   excellent analysis of the results
   (02 points) comparison of the results obtained with ZeroR, OneR,
               ID3, and J4.8 and summary of the project

(TOTAL 5 points) SLIDES - how well do they concisely summarize 
        the results of the project?

(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/usuful of your experiments and results.
        This grade is given individually to each team member.

buying-price	maintenance	persons	safety	recommendation
med	vhigh	more	low	unacc
med	vhigh	2	med	unacc
vhigh	vhigh	more	med	unacc
med	high	4	low	unacc
high	med	4	high	good
low	med	2	med	unacc
low	high	2	high	unacc
low	vhigh	more	med	acc
med	vhigh	4	med	acc
med	vhigh	4	med	acc
vhigh	vhigh	4	med	unacc
med	med	more	med	acc
med	vhigh	2	med	unacc
med	med	4	low	unacc
med	vhigh	more	low	unacc
med	low	4	med	acc
high	low	2	high	unacc
high	med	4	low	unacc
med	low	4	low	unacc
high	high	4	low	unacc
low	med	4	high	good
low	low	2	high	unacc