WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 444X Data Mining and Knowledge Discovery in Databases 
D Term 2003
Project 2: Data Pre-processing, Mining, and Evaluation of Classification Rules

PROF. CAROLINA RUIZ 

DUE DATE: This project is due on Wednesday, April 02, 2003 at 12 noon  
------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate set of classification rules possible for each of the following classification tasks: (1) Predict the "editable/poisonous" attribute of the Mushroom Dataset; and (2) Predict the "public/private" attribute of the College Data (see below).

PROJECT ASSIGNMENT

  • Datasets: Consider the following sets of data:

    1. The Mushroom Data Set. The classification target is the "editable/poisonous" attribute.

    2. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are: Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.

  • Readings: Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.

  • Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

    1. Load the data. Note that you need to translate the dataset into the arff format first.

    2. Preprocessing of the Data:

      A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

      In particular,

      • explore different ways of discretizing (if needed) continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
      • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

      To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification rules are easy to read.

    3. Mining of Classification Rules: The following are guidelines for the construction of your classification rules:

      • Code: Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.

      • Training and Testing Instances:

        You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your set of rules, the better.

  • Evaluation and Testing: Use different ways of testing your results of the mining technique employed

    1. Supply input data and mine and evaluate your model over this same input data.

    2. Supply separate training and testing data to Weka.

    3. Supply input data to Weka and experiment with several split ratios for training and testing data.

    4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.

  • Prunning of the rules:

    Determine if/how the PRISM method prunes rules during their construction and/or after each rule is constructed. If pruning is done, determine exactly how it is done.


    REPORTS AND DUE DATE


    GRADING CRITERIA

    TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY
    
    (TOTAL: 15 points) PRE-PROCESSING OF THE DATASET:
    (05 points) Discretizing attributes as needed
    (05 points) Dealing with missing values appropriately
    (05 points) Dealing with attributes appropriately
               (i.e. using nominal values instead of numeric
                when appropriate, using as many of them 
                as possible, etc.) 
    (up to 5 extra credit points) 
               Trying to do "fancier" things with attributes
               (i.e. combining two attributes highly correlated
                into one, using background knowledge, etc.)
        
    (TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
    (05 points) Description of the algorithm underlying the Weka filters used
    (15 points) Description of the algorithm underlying the construction and
                pruning of classication rules in Weka's PRISM code
    (up to 5 extra credit points for an outanding job) 
    (providing just a structural description of the code, i.e. a list of 
    classes and methods, will receive 0 points)
    
    (TOTAL: 60 points) EXPERIMENTS
    (TOTAL: 30 points each dataset) FOR EACH DATASET:
           (06 points) ran a good number of experiments
                       to get familiar with the PRISM classification method and
                       different evaluation methods (%split, cross-validation,...)
           (08 points) good description of the experiment setting and the results 
           (08 points) good analysis of the results of the experiments
           (08 points) comparison of the results obtained with Prism and the
                       classifiers from previous project (ZeroR, ID3, and J4.8)
                       and argumentation of weknesses and/or strenghts of each of the
                       methods on this dataset, and argumentation of which method
                       should be preferred for this dataset and why. 
           (up to 5 extra credit points) excellent analysis of the results and 
                                         comparisons
           (up to 10 extra credit points) running additional interesting experiments
                       selecting other classification attributes instead of the 
                       required in this project statement ("editable/poisonous", 
                       "private/public")
    
    (TOTAL 5 points) SLIDES - how well do they summarize concisely
            the results of the project? We suggest you summarize the
            setting of your experiments and their results in a tabular manner.