WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012 
Homework and Project 4: Classification Rules and Association Rules

Prof. Carolina Ruiz and Ken Loomis 

DUE DATES: Friday, Nov. 30, 11:00 am (electronic submission) and 1:00 pm (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read Sections 5.1, 6.1, 6.2, 6.3, 6.7 of your textbook in great detail.

This project consists of two parts:

  1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See solutions by Ken Loomis.

    Consider the reduced_userprofile.arff dataset. This dataset was constructed from the userprofile.csv file of the Restaurant & consumer data available at the UCI Data Repository.

    1. Classification Rules using RIPPER:
      Use Weka, Excel, your own code, or other application to help you calculate the metrics used by RIPPER.
      1. (20 points) Construct the first rule that RIPPER would contruct over this dataset. Show your work and explain each step of the process.
      2. (10 points) Describe in words the process that RIPPER would follow to prune this rule. (No need for you to prune the rule, you just need to describe what RIPPER would do.)

    2. Association Rules using Apriori:
      Use Weka, Excel, your own code, or other application to help you calculate support of the itemsets constructed below.
      1. Generate all frequent itemsets over this dataset, following the Apriori algorithm level by level. Show each step of the process. Let min-support = 40% (that is, min-support-count = 55 instances).
        1. (5 points) State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341).
        2. (5 points) State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341).
        3. For each level,
          1. (10 points) Show how the "join" condition was used to generate k-itemsets (this level's itemsets) from frequent (k-1)-itemsets (previous level's frequent itemsets).
          2. (10 points) Show how the "subset" condition was used to eliminate candidate itemsets from consideration before unnecessarily counting their support.
          3. (10 points) Count support for all remaining itemsets in the level.
        4. (5 points) What's the termination condition for this process? Explain.
      2. (10 points) What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Pick one association rule from those that you generate in the next part below, and use the values of these metrics for this association rule to judge how interesting/useful this rule is.
      3. Take the first frequent itemset of the last level you constructed above.
        1. (5 points) Generate all rules from this frequent itemset that have exactly 2 items in the right-hand-side of the rule.
        2. (10 points) For each rule, calculate the confidence and the lift of the rule.
      4. (5 points) Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.

  2. Part II. GROUP PROJECT ASSIGNMENT

  3. Dataset: We will work with the same dataset used in projects 1, 2, and 3. The following 2 files contain the dataset: Important: For all experiments, perform missing value replacement for the target attribute. Replace the missing values with a new nominal value called "Missing". Or use the dataset that you may have saved for Project 2 as suggested at the beginning of the moderate challenge.

  4. Challenges: In each of the following challenges provide a detailed description of the preprocessing techniques used, the motivation for using these techniques, and any hypothesis/intuition gained about the information represented in the dataset. Answer the question provided as well as provide the information described in the PROJECT GUIDELINES.
  5. Grading sheet for this project.