WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 444X Data Mining and Knowledge Discovery in Databases - D Term 2004 
Project 2: Data Pre-processing, Mining, and Evaluation of Classification Rules

PROF. CAROLINA RUIZ 

DUE DATE: This project is due on Wednesday, April 7 2004 at 12 NOON. 
------------------------------------------


PROJECT DESCRIPTION

The purpose of this assignment is to gain first-hand experience with the construction of classification rules.

HOMEWORK ASSIGNMENT

See
Peter Mardziel's solutions to this homework assignment.

Consider the loan applications dataset discussed in class:

@relation credit-data

@attribute credit_history {bad, unknown, good}
@attribute debt {low, high}
@attribute collateral {none, adequate}
@attribute income {0-15, 15-35, >35}
@attribute risk {low, moderate, high}

@data
bad, low, none, 0-15, high
unknown, high, none, 15-35, high
unknown, low, none, 15-35, moderate
bad, low, none, 15-35, moderate
unknown, low, adequate, >35, low
unknown, low, none, >35, low
unknown, high, none, 0-15, high
bad, low, adequate, >35, moderate
good, low, none, >35, low
good, high, adequate, >35, low
good, high, none, 0-15, high
good, high, none, 15-35, moderate
good, high, none, >35, low
bad, high, none, 15-35, high

  1. (20 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-valuess that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.

  2. (20 points) Repeat part 1 above but now using p*[log_2(p/t) - log_2(P/T)] to rank the attribute-valuess that are candidates for inclusion in a rule.

  3. (10 points) Assume that a function m: Rules -> Real Numbers is given, such that this function receives a rule R as its input and outputs the likelihood that the improvement in classification accuracy given by the rule R (over the accuracy of Zero-R) occurs by chance. Hence, the lower m(R), the better R is.

    Discuss how the function m is used to prune a collection of perfect rules constructed by the Prism algorithm.


PROJECT ASSIGNMENT


REPORTS AND DUE DATE


GRADING CRITERIA

FOR THE PROJECT ASSIGNMENT PART (excluding the homework assignment part)
TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(TOTAL: 15 points) PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
           (i.e. using nominal values instead of numeric
            when appropriate, using as many of them 
            as possible, etc.) 
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
            pruning of classication rules in Weka's PRISM code
(up to 5 extra credit points for an outanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 60 points) EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
       (06 points) ran a good number of experiments
                   to get familiar with the PRISM classification method and
                   different evaluation methods (%split, cross-validation,...)
       (08 points) good description of the experiment setting and the results 
       (08 points) good analysis of the results of the experiments
       (08 points) comparison of the results obtained with Prism and the
                   classifiers from previous project (ZeroR, ID3, and J4.8)
                   and argumentation of weknesses and/or strenghts of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments
                   selecting other classification attributes instead of the 
                   required in this project statement ("private/public", "salary")

(TOTAL 5 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.