### CS 444X Data Mining and Knowledge Discovery in Databases - D Term 2004  Project 2: Data Pre-processing, Mining, and Evaluation of Classification Rules

#### PROF. CAROLINA RUIZ

DUE DATE: This project is due on Wednesday, April 7 2004 at 12 NOON.

#### PROJECT DESCRIPTION

The purpose of this assignment is to gain first-hand experience with the construction of classification rules.

• Homework Assignment: The purpose of the homework is construct by hand collections of classification rules following the Prism algorithm over a loan applications dataset.

• Project Assignment: The purpose of this project is to construct the most accurate set of classification rules possible for each of the following classification tasks: (1) Predict the "public/private" attribute of the College Data (see below). (2) Predict the "Salary" attribute of the Census-Income Data (see below).

#### HOMEWORK ASSIGNMENT

See
Peter Mardziel's solutions to this homework assignment.

Consider the loan applications dataset discussed in class:

```@relation credit-data

@attribute debt {low, high}
@attribute income {0-15, 15-35, >35}
@attribute risk {low, moderate, high}

@data
unknown, high, none, 15-35, high
unknown, low, none, 15-35, moderate
unknown, low, none, >35, low
unknown, high, none, 0-15, high
good, low, none, >35, low
good, high, none, 0-15, high
good, high, none, 15-35, moderate
good, high, none, >35, low
```

1. (20 points) Construct "by hand" all the perfect classification rules that the Prism algorithm would output for this dataset using the ratio p/t to rank the attribute-valuess that are candidates for inclusion in a rule. You written solutions should show all your work. That is, the list of all attribute-values that were candidates during each of the stages of the rule construction process and which ones were selected.

2. (20 points) Repeat part 1 above but now using p*[log_2(p/t) - log_2(P/T)] to rank the attribute-valuess that are candidates for inclusion in a rule.

3. (10 points) Assume that a function m: Rules -> Real Numbers is given, such that this function receives a rule R as its input and outputs the likelihood that the improvement in classification accuracy given by the rule R (over the accuracy of Zero-R) occurs by chance. Hence, the lower m(R), the better R is.

Discuss how the function m is used to prune a collection of perfect rules constructed by the Prism algorithm.

#### PROJECT ASSIGNMENT

• Datasets: Consider the following sets of data:

1. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are: Let's make "private/public" the classification target. Note that even though the values of this attribute are 0s and 1s, this is a nominal (not a numberic!) attribute.

2. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

• Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

1. Load the data. Note that you need to translate the dataset into the arff format first.

2. Preprocessing of the Data:

A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

In particular,

• explore different ways of discretizing (if needed) continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
• explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting classification rules are easy to read.

3. Mining of Classification Rules: The following are guidelines for the construction of your classification rules:

• Code: Use PRISM the covering algorithm to generate classification rules implemented in the Weka system. Read the Weka code implementing PRISM in great detail (you need to describe the algorithm used in PRISM in your written report). Read in great detail Sections 4.1, 4.4, 6.2 from your textbook.

• Training and Testing Instances:

You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your set of rules, the better.

• Evaluation and Testing: Use different ways of testing your results of the mining technique employed

1. Supply input data and mine and evaluate your model over this same input data.

2. Supply separate training and testing data to Weka.

3. Supply input data to Weka and experiment with several split ratios for training and testing data.

4. Supply input data to Weka and use n-fold crossvalidation to test your results. Experiment with different values for the number of folds.

• Prunning of the rules:

Determine if/how the PRISM method prunes rules during their construction and/or after each rule is constructed. If pruning is done, determine exactly how it is done.

#### REPORTS AND DUE DATE

```FOR THE PROJECT ASSIGNMENT PART (excluding the homework assignment part)
TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

(TOTAL: 15 points) PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(05 points) Dealing with attributes appropriately
(i.e. using nominal values instead of numeric
when appropriate, using as many of them
as possible, etc.)
(up to 5 extra credit points)
Trying to do "fancier" things with attributes
(i.e. combining two attributes highly correlated
into one, using background knowledge, etc.)

(TOTAL: 20 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(15 points) Description of the algorithm underlying the construction and
pruning of classication rules in Weka's PRISM code
(up to 5 extra credit points for an outanding job)
(providing just a structural description of the code, i.e. a list of
classes and methods, will receive 0 points)

(TOTAL: 60 points) EXPERIMENTS
(TOTAL: 30 points each dataset) FOR EACH DATASET:
(06 points) ran a good number of experiments
to get familiar with the PRISM classification method and
different evaluation methods (%split, cross-validation,...)
(08 points) good description of the experiment setting and the results
(08 points) good analysis of the results of the experiments
(08 points) comparison of the results obtained with Prism and the
classifiers from previous project (ZeroR, ID3, and J4.8)
and argumentation of weknesses and/or strenghts of each of the
methods on this dataset, and argumentation of which method
should be preferred for this dataset and why.
(up to 5 extra credit points) excellent analysis of the results and
comparisons
(up to 10 extra credit points) running additional interesting experiments
selecting other classification attributes instead of the
required in this project statement ("private/public", "salary")

(TOTAL 5 points) SLIDES - how well do they summarize concisely
the results of the project? We suggest you summarize the
setting of your experiments and their results in a tabular manner.

```