Consider the loan applications dataset discussed in class:
@relation credit-data @attribute credit_history {bad, unknown, good} @attribute debt {low, high} @attribute collateral {none, adequate} @attribute income {0-15, 15-35, >35} @attribute risk {low, moderate, high} @data bad, low, none, 0-15, high unknown, high, none, 15-35, high unknown, low, none, 15-35, moderate bad, low, none, 15-35, moderate unknown, low, adequate, >35, low unknown, low, none, >35, low unknown, high, none, 0-15, high bad, low, adequate, >35, moderate good, low, none, >35, low good, high, adequate, >35, low good, high, none, 0-15, high good, high, none, 15-35, moderate good, high, none, >35, low bad, high, none, 15-35, high
Discuss how the function m is used to prune a collection of perfect rules constructed by the Prism algorithm.
A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).
In particular,
To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification rules are easy to read.
You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your set of rules, the better.
Determine if/how the PRISM method prunes rules during their construction and/or after each rule is constructed. If pruning is done, determine exactly how it is done.
Your report should contain discussions of all the parts described in the PROJECT ASSIGNMENT section above and in addition should elaborate on the the following topics:
Provide a detail description of the preprocessing of your data. Justify the preprocessing you apply and why the resulting data is the appropriate one for mining classification rules from it.
Please submit the following files using the turnin system by 12 noon on Wed, April 07 2004. For your turnin submission, THE NAME OF THE PROJECT IS "project2". PLEASE MAKE JUST ONE PROJECT SUBMISSION PER GROUP. Submissions received on Wed, April 07 between 12 noon and 4:00 pm will be penalized with 30% off the grade and submissions after April 07 AT 4:00 pm won't be accepted.
FOR THE PROJECT ASSIGNMENT PART (excluding the homework assignment part) TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY (TOTAL: 15 points) PRE-PROCESSING OF THE DATASET: (05 points) Discretizing attributes as needed (05 points) Dealing with missing values appropriately (05 points) Dealing with attributes appropriately (i.e. using nominal values instead of numeric when appropriate, using as many of them as possible, etc.) (up to 5 extra credit points) Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.) (TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION (05 points) Description of the algorithm underlying the Weka filters used (15 points) Description of the algorithm underlying the construction and pruning of classication rules in Weka's PRISM code (up to 5 extra credit points for an outanding job) (providing just a structural description of the code, i.e. a list of classes and methods, will receive 0 points) (TOTAL: 60 points) EXPERIMENTS (TOTAL: 30 points each dataset) FOR EACH DATASET: (06 points) ran a good number of experiments to get familiar with the PRISM classification method and different evaluation methods (%split, cross-validation,...) (08 points) good description of the experiment setting and the results (08 points) good analysis of the results of the experiments (08 points) comparison of the results obtained with Prism and the classifiers from previous project (ZeroR, ID3, and J4.8) and argumentation of weknesses and/or strenghts of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why. (up to 5 extra credit points) excellent analysis of the results and comparisons (up to 10 extra credit points) running additional interesting experiments selecting other classification attributes instead of the required in this project statement ("private/public", "salary") (TOTAL 5 points) SLIDES - how well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.