WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 
Homework and Project 4: Association Rules

PROF. CAROLINA RUIZ 

DUE DATES: Friday, Dec. 3, 9:00 am (electronic submission) and 11:00 am (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 6.1, 6.2, 6.3, 6.7. of your textbook.

This project consists of two parts:

  • Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See Solutions to this homework assignment by Yutao Wang.

    Consider the zoo.arff dataset converted to arff from the Zoo Data Set available at Univ. of California Irvine KDD Data Repository.

    1. Load this dataset onto Weka. Remove the 1st attribute (animal_name) which is a string. Go to "Associate" and run Apriori with "numRules = 30", "outputItemSets = True", "verbose = True", and default values for the remaining parameters.

    2. Now run Apriori with "numRules = 30", "outputItemSets = True", "verbose = True", treatZeroAsMissing = True", and default values for the remaining parameters.

      1. [5 points] What difference do you see between the rules obtained in Parts 1 and 2 above? Explain.

      2. [5 points] From now on, consider just the second set of rules (that is, when "treatZeroAsMissing = True"). Find an association rule you find interesting and explain it. Include the confidence and support values in your explanation of the rule.

      3. [10 points] What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Use the values of these metrics for the association rule you chose in the previous part to judge how interesting/useful this rule is.

      4. Look at the itemsets generated. Let's consider in particular the generation of 5-itemsets from 4-itemsets:
        Minimum support: 0.35 (35 instances)
        
        ...
        
        Size of set of large itemsets L(4): 8
        
        Large Itemsets L(4):
        hair=1 milk=1 toothed=1 backbone=1 38
        hair=1 milk=1 toothed=1 breathes=1 38
        hair=1 milk=1 backbone=1 breathes=1 39
        hair=1 toothed=1 backbone=1 breathes=1 38
        milk=1 toothed=1 backbone=1 breathes=1 40
        milk=1 backbone=1 breathes=1 tail=1 35
        toothed=1 backbone=1 breathes=1 legs=4 35
        toothed=1 backbone=1 breathes=1 tail=1 38
        
        Size of set of large itemsets L(5): 1
        
        Large Itemsets L(5):
        hair=1 milk=1 toothed=1 backbone=1 breathes=1 38
        

        1. [5 points] State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341). Show how the "join" condition was used to generate 5-itemsets from 4-itemsets. (Warning: not all candidate 5-itemsets are shown above.)

        2. [5 points] State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341). Show how the "subset" condition was used to eliminate candidate 5-itemsets from consideration before unnecessarily counting their support.

      5. [10 points] Consider the following frequent 4-itemset:
        milk=1 backbone=1 breathes=1 tail=1 
        
        Use Algorithms 6.2 and 6.3 (pp. 351-352), which are based on Theorem 6.2, to construct all rules with Confidence = 100% from this 4-itemset. Show your work by neatly constructing a lattice similar to the one depicted in Figure 6.15 (but you don't need to expand/include pruned rules).

    3. [5 points] Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.

    4. [10 points] Exercise 16, p. 411 of the textbook.

  • Part II. GROUP PROJECT ASSIGNMENT