WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 444X Data Mining and Knowledge Discovery in Databases - D Term 2004 
Project 3: Data Pre-processing, Mining, and Evaluation of Association Rules

PROF. CAROLINA RUIZ 

DUE DATE: This project is due on Wednesday, April 14 2004 at 12 NOON. 
------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to mine the best sets of association rules possible from three different datasets.

PROJECT ASSIGNMENT

Readings: Read in great detail Section 4.5 from your textbook.

This project consists of two parts:

  1. Part I.
    See
    Peter Mardziel's solutions to this homework assignment.

    Mine association rules by hand from the the loan applications dataset discussed in class:

    @relation credit-data
    
    @attribute credit_history {bad, unknown, good}
    @attribute debt {low, high}
    @attribute collateral {none, adequate}
    @attribute income {0-15, 15-35, >35}
    @attribute risk {low, moderate, high}
    
    @data
    bad, low, none, 0-15, high
    unknown, high, none, 15-35, high
    unknown, low, none, 15-35, moderate
    bad, low, none, 15-35, moderate
    unknown, low, adequate, >35, low
    unknown, low, none, >35, low
    unknown, high, none, 0-15, high
    bad, low, adequate, >35, moderate
    good, low, none, >35, low
    good, high, adequate, >35, low
    good, high, none, 0-15, high
    good, high, none, 15-35, moderate
    good, high, none, >35, low
    bad, high, none, 15-35, high
    
    by faithfully following the Apriori algorithm with minimal support = 20% and minimal confidence 90%. That is, start by generating candidate itemsets and frequent itemsets level by level and after all frequent itemsets have been generated, produce from them all the rules with confidence greater than or equal to the min. confidence. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

  2. Part II.
    Use the Apriori implementation in Weka to mine association rules from the following two datasets.

    1. Datasets: Consider the following sets of data:

      1. The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine Repository.
        The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.

      2. The Titanic Dataset. Look at the dataset description and the Data instances.

        I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:

        Class (0 = crew, 1 = first, 2 = second, 3 = third)
        Age   (1 = adult, 0 = child)
        Sex   (1 = male, 0 = female)
        Survived (1 = yes, 0 = no)
        

    2. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

      1. Load the data. Note that you need to translate the dataset into the arff format first.

      2. Preprocessing of the Data:

        A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

        In particular,

        • explore different ways of discretizing (if needed) continuous attributes. That is, convert numeric attributes into "nominal" ones by binning numeric values into intervals - See the weka.filter.DiscretizeFilter in Weka. Play with the filter and read the Java code implementing it.
        • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

        To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting association rules are easy to read.

      3. Mining of Association Rules: The following are guidelines for the construction of your association rules:

        • Code: Use the Apriori algorithm to generate association rules implemented in the Weka system. Read the Weka code implementing Apriori in great detail (you need to describe the algorithm used in Apriori in your written report). Read in great detail Section 4.5 from your textbook.

        • Training and Testing Instances:

          You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is likely). But remember that the more interesting your set of rules, the better.

        • Input Parameters: Run multiple experiment by modifying the input data and the input parameters offered by the Weka implementation of Apriori. Those input parameters include confidence, support, minimum number of rules, and others.

REPORTS AND DUE DATE


GRADING CRITERIA

TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

---------------------------------------------------------------------------
(TOTAL: 25 points) FOR PART I OF THE PROJECT
Following the Apriori algortithm by hand over the contact-lenses dataset.
(15 points) generation of the candidate itemsets and frequent itemsets
            level by level
(10 points) generation of the association rules from the frequent itemsets

---------------------------------------------------------------------------
(TOTAL: 75 points) FOR PART II OF THE PROJECT

(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE DESCRIPTION
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the Apriori algorithm for the construction of
            frequent itemsets and association rules. 
(up to 5 extra credit points for an outanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 10 points) PRE-PROCESSING OF THE DATASET:
(05 points) Discretizing attributes as needed
(05 points) Dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL: 46 points) EXPERIMENTS
(TOTAL: 23 points each dataset) FOR EACH DATASET:
       (05 points) ran a good number of experiments to get familiar with the 
                   Apriori algorithm varying the input parameters 
       (05 points) good description of the experiment setting and the results 
       (08 points) good analysis of the results of the experiments
                   INCLUDING discussion of particularly interesting association 
                   rules obtained.
       (05 points) comparison of the association rules obtained by Apriori and 
                   the classification rules obtained by Prism in project 2.
                   Argumentation of weknesses and/or strenghts of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 5 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

(TOTAL 4 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.
   (up to 6 extra credit points) for excellent summary and presentation of results 
   in the slides.