WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 
Homework and Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ 

DUE DATES:
Part I (the individual homework assignment) is due on Friday, Nov. 3rd at 12:00 noon and
Parts II.1 and II.2 (the individual+group project) are due on Tuesday, Nov. 7th 2006 at 12 noon. 

------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is multi-fold:

PROJECT ASSIGNMENT

Readings: Read in great detail Sections 4.3 and 6.1 from your textbook.

This project consists of two parts:

  1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See Solutions to this HW1 by Piotr Mardziel (pdf).

    Consider the following dataset, adapted from the Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository. Follow those points and also see Part II of this assignment to learn more about the Car Evaluation Dataset.

    ATTRIBUTES:	POSSIBLE VALUES:
    buying-price 	{vhigh,high,med,low}
    maintenance 	{vhigh,high,med,low}
    persons 	{2,4,more}  % Assumed to be a nominal attribute
    safety 		{low,med,high}
    recommendation 	{unacc,acc,good}
    
    buying-price maintenance persons safety recommendation
    med vhigh more low unacc
    medvhigh2medunacc
    vhighvhighmoremedunacc
    medhigh4lowunacc
    highmed4highgood
    lowmed2medunacc
    lowhigh2highunacc
    lowvhighmoremedacc
    medvhigh4medacc
    medvhigh4medacc
    vhighvhigh4medunacc
    medmedmoremedacc
    medvhigh2medunacc
    medmed4lowunacc
    medvhighmorelowunacc
    medlow4medacc
    highlow2highunacc
    highmed4lowunacc
    medlow4lowunacc
    highhigh4lowunacc
    lowmed4highgood
    lowlow2highunacc

    1. (50 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (buying-price, maintenance, persons, safety) with respect to the target/classification attribute (recommendation).

      Show all the steps of the calculations. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's.

    2. (10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:

      buying-price maintenance persons safety recommendation YOUR DECISION
      TREE PREDICTION
      medlow4medacc 
      highmed2lowunacc 
      medlow4highgood 
      medvhighmoremedacc 
      lowhigh4lowunacc 
      medmed2lowunacc 
      medlow2lowunacc 
      medmed4lowunacc 
      medvhigh2medunacc 
      lowhigh2lowunacc 
      vhighlow4medacc 

      The accuracy of your decision tree on this test data is: ________________
      
      The confusion matrix of your decision tree on this test data is: ....
      

    3. Consider the subtree replacement pruning technique to make your decision tree more accurate over the this test data.
      • (2 points) What node in your decision tree above would be considered first for subtree replacement pruning? Why?

      • (8 points) Show in detail how this pruning technique works on that one node in your previous answer. Will the node be pruned or not by the technique. Explain your answer.

      • (30 points) Show in detail how this pruning technique works on the rest nodes in the (transformed) decision tree. Explain your answer.

  2. Part II. INDIVIDUAL + GROUP PROJECT ASSIGNMENT
    For this and other course projects, we will use the Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-learning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka sytem, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/).

    1. Part II.1 INDIVIDUAL PART OF THE PROJECT

      Each student in the class should complete the following steps on his/her own:

      1. Download and use the latest stable GUI version of the Weka system.

      2. Read Chapters 9, 10, and 12 of your textbook to learn more about the Weka system.

      3. Datasets: Consider the following sets of data:

        1. The Car Evaluation Dataset available at the The University of California Irvine (UCI) Machine Learning Data Repository. The file car.names contains the description of the attributes. It calls "class" the target attribute. The file car.data contains the data instances.

        2. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
          The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

      4. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

        1. Load the data. Note that you need to translate the dataset into the arff format first.

        2. Preprocessing of the Data:

          A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, deal with a skewed target attribute if necessary, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

          In particular,

          • explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
          • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.
          • devise ways of dealing with a skewed target attribute, if present.

          To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.

        3. Mining of Patterns:

          1. Use the "ZeroR" and "OneR" classifiers under the "Classify" tab. They will provide you with benchmark classification accuracies to evaluate the accuracy of your decision trees below.

          2. Decision Trees: The following are guidelines for the construction of your decision tree:

            • Code: Use the decision tree methods implemented in the Weka system: ID3 and J4.8. Read the Weka code implementing ID3 and J4.8 in great detail.

            • Training and Testing Instances:

              You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.

        4. Evaluation and Testing: Use different ways of testing your results for each of the mining techniques employed (i.e. ZeroR, OneR, ID3, J4.8).

          1. Supply input data and mine and evaluate your model over this same input data.

          2. Supply separate training and testing data to Weka.

          3. Supply input data to Weka and experiment with several split ratios for training and testing data.

          4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.

        5. Pruning of your decision tree:

          Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.

      5. Written Report of Part II.1

        Submit an individual report of the work you have done on your own as described above. See the details of the submission below. Your report should contain the following sections with the corresponding discussions:

        1. Code Description: Describe the Weka code of the classifiers and filters that you used in the project. More precisely, explain the algorithm underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output.

        2. Data: Describe the dataset that you selected in terms of the attributes present in the data, the number of instances, missing values, skewedness of target attribute, and other relevant characteristics.

          Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining decision trees from it.

        3. Experiments: For each experiment you ran describe:
          • Motivation for and purpose of the experiment.
          • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
          • Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values? Is the target attribute skewed? If so, what did you do to deal with it?
          • Your system parameters.
          • For the ZeroR, OneR, ID3, and J4.8 classifiers,
            • Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
            • Accuracy of the resulting models
            • Comparison the classification accuracies of the models obtained with the ZeroR, OneR, ID3, and J4.8 classifiers.

        4. Strengths and the weaknesses of your individual project.

    2. Part II.2 GROUP PART OF THE PROJECT

      Once that you have completed PartII.1 of the project on your own, work with your project partner analyzing the experiments and results that each of you obtained. This joint analysis of the results should include:

      1. Description of the similarities and differences of the experiments run by each of you, and hence of the results obtained. Explain in detail those similarities and differences that you observed.

      2. Based on the joint analysis and the experience you have gained, design new, interesting/useful experiments to run together with your partner. Run those experiments and analyze the results together. Follow the same guidelines for experimentation described above in Part II.1.

      3. Written Report of Part II.2

        Submit a joint report of the work you have done together described above. Only one of you needs to submit the joint report. See the details of the submission below. Your joint report should contain the following sections with the corresponding discussions:

        1. Group members: Name of the group members and a description of what precisely each group member contributed to the project.

        2. Joint analysis of the individual results as described above.

        3. Joint Experiments: For each joint experiment you ran describe:
          • Motivation for the experiment (based on the joint analysis of the individual results) and purpose of the experiment.
          • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it?
          • Any pre-processing done to the data. That is, did you remove any attributes? Did you discretize any continuous attribute? If so, what strategy did you use to bin the values? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values? Is the target attribute skewed? If so, what did you do to deal with it?
          • Your system parameters.
          • For the ZeroR, OneR, ID3, and J4.8 classifiers,
            • Results and detail ANALYSIS of results of the experiments you ran using different ways of testing the classifier (crossvalidation, etc.).
            • Accuracy of the resulting models
            • Comparison the classification accuracies of the models obtained with the ZeroR, OneR, ID3, and J4.8 classifiers.

        4. Summary of Results: For each of the datasets analyzed,
          • What was the accuracy of the most accurate decision tree constructed in your project?
          • Include the most accurate tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
          • What was the most intuitive (easy to understand) decision tree constructed in your project?
          • Include the most intuitive tree obtained in your report. If your tree is longer than say 100 lines of text, include just the first 100 lines in your report to save space.
          • Strengths and the weaknesses of your joint project.

REPORTS AND DUE DATE


GRADING CRITERIA

TOTAL: 200 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(04 points) Description of the algorithm underlying the Weka filters used
(02 points) Description of the algorithm underlying Weka's ZeroR and OneR codes
(04 points) Description of the algorithm underlying Weka's ID3 code
(05 points) Description of the algorithm underlying Weka's J4.8 code

(TOTAL 40 points: 20 on the individual part and 20 on the joint part) 
PRE-PROCESSING OF THE DATASET:
(05 points) Translating both input datasets into .arff
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric when appropriate, 
            using as many of them as possible, skewed target attribute, etc.) 
(up to 10 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL 120: 60 points for the individual part and 60 points for the joint part) 
EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
   (02 points) ran at least a reasonable number of experiments
               to get familiar with ZeroR and OneR
   (TOTAL: 26 points) For each decision tree method required
       ID3 and J4.8 (13 points each):
       (05 points) ran at least a reasonable number of experiments
                   to get familiar with the decision tree method and
                   different evaluation methods (%split, cross-validation,...)
       (03 points) good description of the motivation and purpose of the 
                   experiment, of experiment setting and the results 
       (05 points) good analysis of the results of the experiments
       (up to 4 extra credit points)
                   excellent analysis of the results
   (02 points) comparison of the results obtained with ZeroR, OneR,
               ID3, and J4.8 and summary of the project

(TOTAL 5 points) SLIDES - how well do they concisely summarize 
        the results of the project?

(TOTAL 15 points) Class presentation - how well your oral presentation summarized 
        concisely the results of the project and how focus your presentation was
        on the more creative/interesting/usuful of your experiments and results.
        This grade is given individually to each team member.