WPI Worcester Polytechnic Institute

Computer Science Department

CS 444X Data Mining and Knowledge Discovery in Databases - D Term 2004 
Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees


DUE DATE: This project is due on Wednesday, March 31 2004 at 12 NOON. 


The purpose of this project is multi-fold:


For this and other course projects, we will use the
Weka system (http://www.cs.waikato.ac.nz/ml/weka/). Weka is an excellent machine-learning/data-mining environment. It provides a large collection of Java-based mining algorithms, data preprocessing filters, and experimentation capabilities. Weka is open source software issued under the GNU General Public License. For more information on the Weka sytem, to download the system and to get its documentation, look at Weka's webpage (http://www.cs.waikato.ac.nz/ml/weka/).

  1. You should download and use the latest stable GUI version of the system.

  2. Study the tutorial (Chapter 8 of your textbook) provided with the Weka system. Note that the tutorial uses Weka's command line to illustrate how to run the system, but you can actually use the GUI provided with the system to execute the same commands.

  3. Datasets: Consider the following sets of data:

    1. The Mushroom Data Set. The classification target is the "editable/poisonous" attribute.

    2. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are: Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.

  4. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

    1. Load the data. Note that you need to translate the dataset into the arff format first.

    2. Preprocessing of the Data:

      A main part of this project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contained a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

      In particular,

      • explore different ways of discretizing continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
      • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

      To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting decision trees are easy to read.

    3. Mining of Patterns:

      1. Use the "ZeroR" classifier under the "Classify" tab. This would provide you with a benchmark classification accuracy to compare the accuracy of your decision trees below against.

      2. Decision Trees: The following are guidelines for the construction of your decision tree:

        • Code: Use the decision tree methods implemented in the Weka system: ID3 and J4.8. Read the Weka code implementing ID3 in detail. Look also at the J4.8 classifier code.

        • Training and Testing Instances:

          You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your decision trees, the better.

    4. Evaluation and Testing: Use different ways of testing your results for each of the mining techniques employed (i.e. ZeroR, ID3, J4.8).

      1. Supply input data and mine and evaluate your model over this same input data.

      2. Supply separate training and testing data to Weka.

      3. Supply input data to Weka and experiment with several split ratios for training and testing data.

      4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.

    5. Pruning of your decision tree:

      Experiment with Weka's J4.8 classifier to see how it performs pre- and/or post-prunning of the decision tree in order to increase the classification accuracy and/or to reduce the size of the decision tree.




(05 points) Translating both input datasets into .arff
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
(04 points) Description of the algorithm underlying the Weka filters used
(02 points) Description of the algorithm underlying Weka's ZeroR code
(04 points) Description of the algorithm underlying Weka's ID3 code
(05 points) Description of the algorithm underlying Weka's J4.8 code

(TOTAL: 28 points each dataset) FOR EACH DATASET:
   (02 points) ran at least a reasonable number of experiments
               to get familiar with ZeroR
   (TOTAL: 26 points) For each decision tree method required
       ID3 and J4.8 (13 points each):
       (05 points) ran at least a reasonable number of experiments
                   to get familiar with the decision tree method and
                   different evaluation methods (%split, cross-validation,...)
       (03 points) good description of the experiment setting and the results 
       (05 points) good analysis of the results of the experiments
       (up to 4 extra credit points)
                   excellent analysis of the results
(04 points) comparison of the results obtained with ZeroR,
            ID3, and J4.8 and summary of the project

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project?