CS 4445 A Term 2004 - Homework 1 and Project 1

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004
Homework and Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ

DUE DATE: Part I (the individual homework assignment) is due on Thursday, September 9th at 1:00 pm and Parts II.1 and II.2 (the individual+group project) are due on Friday, September 10th 2004 at 5 pm.

Project Description
Project Assignment
Report Submission and Due Date
Grading Criteria

PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain experience "pre-processing" datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
To gain experience with the construction of decision trees.
To gain experience with the evaluation of the models/patterns constructed with a data mining technique.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).

PROJECT ASSIGNMENT

Readings: Read in great detail Sections 4.3 and 6.1 from your textbook.

This project consists of two parts:

Part I. INDIVIDUAL HOMEWORK ASSIGNMENT
See Solutions to this HW problem by Abraao Lounrenco.

Consider the following subset of the Mushroom dataset.


@relation sample-mushroom

@attribute cap-surface {fibrous,grooves,scaly,smooth}
@attribute bruises? {bruises,no}
@attribute gill-size {broad,narrow}
@attribute habitat {grasses,leaves,meadows,paths,urban,waste,woods}
@attribute poisonousness {edible,poisonous}

@data

scaly,bruises,broad,waste,edible
smooth,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
scaly,no,narrow,leaves,poisonous
scaly,bruises,broad,paths,edible
smooth,no,broad,leaves,edible
scaly,no,broad,woods,poisonous
scaly,no,narrow,woods,poisonous
smooth,no,broad,leaves,edible
fibrous,no,broad,paths,poisonous
fibrous,bruises,broad,woods,edible
smooth,bruises,narrow,grasses,poisonous
fibrous,no,broad,paths,poisonous
smooth,bruises,narrow,grasses,poisonous
scaly,no,narrow,leaves,poisonous
scaly,no,narrow,woods,poisonous
fibrous,no,broad,grasses,edible
scaly,bruises,broad,woods,edible
fibrous,no,broad,grasses,edible

To learn more about the Mushroom Data Set, see Part II of this assignment.

(50 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (cap-surface, bruises?, gill-size, habitat) with respect to the target/classification attribute (poisonousness). Note that there are repeated instances. Your construction of the decision tree should be affected by this fact.

Show all the steps of the calculations. For your convenience, the logarithm in base 2 of selected values are provided. (For those log_2 values that you need and are not provided here, make sure you compute log_2 correctly as some calculators don't have a log_2 primitive).

x	1/2	1/3	1/4	3/4	1/5	2/5	3/5	1/6	5/6	1/7	2/7	3/7	4/7	1
log2(x)	-1	-1.5	-2	-0.4	-2.3	-1.3	-0.7	-2.5	-0.2	-2.8	-1.8	-1.2	-0.8	0

(10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:


fibrous,no,broad,grasses,poisonous   YOUR DECISION TREE PREDICTS: __________

scaly,bruises,broad,grasses,edible   YOUR DECISION TREE PREDICTS: __________

scaly,no,broad,grasses,poisonous     YOUR DECISION TREE PREDICTS: __________

scaly,no,broad,paths,poisonous       YOUR DECISION TREE PREDICTS: __________

smooth,bruises,broad,grasses,edible  YOUR DECISION TREE PREDICTS: __________

smooth,bruises,broad,waste,edible    YOUR DECISION TREE PREDICTS: __________

smooth,no,broad,grasses,edible       YOUR DECISION TREE PREDICTS: __________

smooth,no,broad,leaves,edible        YOUR DECISION TREE PREDICTS: __________

smooth,no,narrow,leaves,poisonous    YOUR DECISION TREE PREDICTS: __________

smooth,no,narrow,paths,poisonous     YOUR DECISION TREE PREDICTS: __________

The accuracy of your decision tree on this test data is: ________________

Consider the subtree replacement pruning technique to make your decision tree more accurate over the this test data.
- (2 points) What node in your decision tree above would be considered first for subtree replacement pruning? Why?
- (8 points) Show in detail how this pruning technique works on that one node in your previous answer. Will the node be pruned or not by the technique. Explain your answer.
- (30 points) Show in detail how this pruning technique works on the rest nodes in the (transformed) decision tree. Explain your answer.

Part II. INDIVIDUAL + GROUP PROJECT ASSIGNMENT
For this and other course projects, we will use the Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-learning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka sytem, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/).
1. Part II.1 INDIVIDUAL PART OF THE PROJECT
  Each student in the class should complete the following steps on his/her own:
  1. Download and use the latest stable GUI version of the Weka system.
  2. Study the tutorial (Chapter 8 of your textbook) provided with the Weka system. Note that the tutorial uses Weka's command line to illustrate how to run the system, but you can actually use the GUI provided with the system to execute the same commands.
  3. Datasets: Consider the following sets of data:
    1. The Mushroom Data Set.
      - Available at: SGI's Datasets from UCI. See dataset at mushroom.all and related information at mushroomLoss.names
      - FYI, this dataset is also available at The University of California Irvine (UCI) Data Repository See Mushroom Database.
      The classification target is the "editable/poisonous" attribute.
    2. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
      The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
  4. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:
    1. Load the data. Note that you need to translate the dataset into the arff format first.
    2. Preprocessing of the Data:
      A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
      In particular,
      - explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
      - explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
      To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.
    3. Mining of Patterns:
      1. Use the "ZeroR" classifier under the "Classify" tab. This would provide you with a benchmark classification accuracy to compare the accuracy of your decision trees below against.
      2. Decision Trees: The following are guidelines for the construction of your decision tree:
        
        Code: Use the decision tree methods implemented in the Weka system: ID3 and J4.8. Read the Weka code implementing ID3 and J4.8 in great detail.
        
        Training and Testing Instances:
        You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.
    4. Evaluation and Testing: Use different ways of testing your results for each of the mining techniques employed (i.e. ZeroR, ID3, J4.8).
      1. Supply input data and mine and evaluate your model over this same input data.
      2. Supply separate training and testing data to Weka.
      3. Supply input data to Weka and experiment with several split ratios for training and testing data.
      4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.
    5. Pruning of your decision tree:
      Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.
  5. Written Report of Part II.1
    Submit an individual report of the work you have done on your own as described above. See the details of the submission below. Your report should contain the following sections with the corresponding discussions:
    1. Code Description: Describe the Weka code of the classifiers and filters that you used in the project. More precisely, explain the algorithm underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output.
    2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, and other relevant characteristics.
      Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.
    3. Experiments: For each experiment you ran describe:
      - Motivation for and purpose of the experiment.
      - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
      - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
      - Your system parameters.
      - For the ZeroR, ID3, and J4.8 classifiers,
        
        Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
        Accuracy of the resulting models
        Comparison the classification accuracies of the models obtained with the ZeroR, ID3, and J4.8 classifiers.
    4. Strengths and the weaknesses of your individual project.
2. Part II.2 GROUP PART OF THE PROJECT
  Once that you have completed PartII.1 of the project on your own, work with your project partner analyzing the experiments and results that each of you obtained. This joint analysis of the results should include:
  1. Description of the similarities and differences of the experiments run by each of you, and hence of the results obtained. Explain in detail those similarities and differences that you observed.
  2. Based on the joint analysis and the experience you have gained, design new, interesting/useful experiments to run together with your partner. Run those experiments and analyze the results together. Follow the same guidelines for experimentation described above in Part II.1.
  3. Written Report of Part II.2
    Submit a joint report of the work you have done together described above. Only one of you needs to submit the joint report. See the details of the submission below. Your joint report should contain the following sections with the corresponding discussions:
    1. Group members: Name of the group members and a description of what precisely each group member contributed to the project.
    2. Joint analysis of the individual results as described above.
    3. Joint Experiments: For each joint experiment you ran describe:
      - Motivation for the experiment (based on the joint analysis of the individual results) and purpose of the experiment.
      - Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
      - Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
      - Your system parameters.
      - For the ZeroR, ID3, and J4.8 classifiers,
        
        Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
        Accuracy of the resulting models
        Comparison the classification accuracies of the models obtained with the ZeroR, ID3, and J4.8 classifiers.
    4. Summary of Results: For each of the datasets analyzed,
      - What was the accuracy of the most accurate decision tree constructed in your project?
      - Include the most accurate tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
      - What was the most intuitive (easy to understand) decision tree constructed in your project?
      - Include the most intuitive tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
      - Strengths and the weaknesses of your joint project.

REPORTS AND DUE DATE

Oral Report. We will discuss the results from the individual projects during the class on Monday, September 13th. Your oral report should summarize the different sections of your individual and joint written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

Given the short time allowed for presentations, you should have at most 4 to 6 slides. Describe your experiments and results using tables. For instance, you can use (if you want) tables with pre-processing of the dataset used (as rows) vs. mining technique and system parameters (as columns), with the size and accuracy of the resulting tree in the cells of this table. Any other good way of summarizing your results would be fine as well. DURING YOUR PRESENTATION TRY TO FOCUS ON THE MOST INTESTING RESULTS YOU OBTAINED AND/OR THE MOST INTERESTING/UNUSUAL IDEAS THAT YOU TRIED.

Submission and Due Date.

Part I. Part I is due Thursday, Sept. 9th at 1:00 pm. Bring a hardcopy of your answers to the classroom right before the beginning of the class.
Part II. Part II is due Friday, Sept. 10th at 5:00 pm. Submissions received on Friday, Sept 10 between 5:01 pm and 7:00 pm will be penalized with 30% off the grade and submissions after Sept 10th at 7:00 pm won't be accepted.
Please submit the following files using the myWpi digital drop box:

[lastname]_proj1_report.[ext] containing your individual written report for Part II.1. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

ruiz_proj1_report.pdf

If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report and not a joint report, as you are working all by yourself on the projects.

[lastname1_lastname2]_proj1_report.[ext] containing your group written report for Part II.2. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):

ruiz_smith_proj1_report.pdf if I worked with Joe Smith on this project.

[lastname1_lastname2]_proj1_slides.[ext] (or [lastname]_proj1_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral report. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt).

GRADING CRITERIA

TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(04 points) Description of the algorithm underlying the Weka filters used
(02 points) Description of the algorithm underlying Weka's ZeroR code
(04 points) Description of the algorithm underlying Weka's ID3 code
(05 points) Description of the algorithm underlying Weka's J4.8 code

(TOTAL 40 points: 20 on the individual part and 20 on the joint part) 
PRE-PROCESSING OF THE DATASET:
(05 points) Translating both input datasets into .arff
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 10 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL 120: 60 points for the individual part and 60 points for the joint part) 
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
   (02 points) ran at least a reasonable number of experiments
               to get familiar with ZeroR
   (TOTAL: 26 points) For each decision tree method required
       ID3 and J4.8 (13 points each):
       (05 points) ran at least a reasonable number of experiments
                   to get familiar with the decision tree method and
                   different evaluation methods (%split, cross-validation,...)
       (03 points) good description of the motivation and purpose of the experiment,
                   of experiment setting and the results 
       (05 points) good analysis of the results of the experiments
       (up to 4 extra credit points)
                   excellent analysis of the results
   (02 points) comparison of the results obtained with ZeroR,
               ID3, and J4.8 and summary of the project

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project?

(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/usuful of your experiments and results.
        This grade is given individually to each team member.