WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014 
Homework and Project 4: Classification Rules and Association Rules

Prof. Carolina Ruiz 

DUE DATES: Monday, Dec. 8th, 9:00 am (electronic submission) and 11:00 am (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read Sections 5.1, 6.1, 6.2, 6.3, 6.7 of your textbook in great detail.

This project consists of two parts:

  1. Part I. GROUP HOMEWORK ASSIGNMENT

    See solutions to this homework:

    Consider the following cars dataset. This dataset was constructed by taking a small sample of data instances and attributes from the Car Evaluation Dataset available at the UCI Data Repository.

    @relation 'cars-weka.filters.unsupervised.attribute.Remove-R3,5-weka.filters.unsu
    pervised.instance.Resample-S1-Z2.0'
    
    @attribute buying {vhigh,high,med,low}
    @attribute maint {vhigh,high,med,low}
    @attribute persons {2,4,more}
    @attribute safety {low,med,high}
    @attribute class {unacc,acc,good}
    
    @data
    med,vhigh,more,low,unacc
    med,vhigh,2,med,unacc
    vhigh,vhigh,more,med,unacc
    med,high,4,low,unacc
    high,med,4,high,good
    low,med,2,med,unacc
    low,high,2,high,unacc
    low,vhigh,more,med,acc
    med,vhigh,4,med,acc
    med,vhigh,4,med,acc
    vhigh,vhigh,4,med,unacc
    med,med,more,med,acc
    med,vhigh,2,med,unacc
    med,med,4,low,unacc
    med,vhigh,more,low,unacc
    med,low,4,med,acc
    high,low,2,high,unacc
    high,med,4,low,unacc
    med,low,4,low,unacc
    high,high,4,low,unacc
    low,med,4,high,good
    low,low,2,high,unacc
    
    1. Classification Rules using RIPPER:
      You can choose between two possibilities:
      Option 1: Solve part (a) below by hand. In this case you have to answer part (b) below too.
      Option 2: Write your own code to solve part (a) below. In this case you don't have to answer part (b) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (a)).

      1. Use Weka, Excel, your own code (in a programming language of your choice), or other application to help you calculate the metrics used by RIPPER.
        1. (20 points) Construct the first rule that RIPPER would contruct over this dataset. Show your work and explain each step of the process.
        2. (10 points) Describe in words the process that RIPPER would follow to prune this rule. (No need for you to prune the rule, you just need to describe what RIPPER would do.)

      2. Existing real-world application of Classification Rules:
        Each group should search for an existing real-world successful application of Classification Rules. This sucessful story should be about mining these rule-based models from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
        Your homework report for this part should contain (at most 1 page):

        • (10 points) Name and description of the chosen application.
        • (10 points) Detailed description of how this application uses Classification Rules.
        • (5 points) List of (recent) references and sources that you used to identify and investigate this application.

    2. Association Rules using Apriori:
      You can choose between two possibilities:
      Option 1: Solve part (i) below by hand. In this case you have to answer part (ii) below too.
      Option 2: Write your own code to solve part (i) below. In this case you don't have to answer part (ii) below. Submit your code by email (with instructions on how to run it) and it will be read for correctness and tested. Your code will be graded over 25 points (separately from the score you receive in part (i)).

      1. Use Weka, Excel, your own code (in a programming language of your choice), or other application to help you calculate support of the itemsets constructed below. If you write your own correct, well documented code to solve this part of homework, you will ...
        1. Generate all frequent itemsets over this dataset, following the Apriori algorithm level by level. Show each step of the process. Let min-support-count = 3 instances (that is, min-support about 13%).
          1. (5 points) State what the "join" condition is (called "merge" in the Fk-1xFk-1 method in your textbook p. 341).
          2. (5 points) State what the "subset" condition is (called "candidate pruning" in the Fk-1xFk-1 method in your textbook p. 341).
          3. For each level,
            1. (10 points) Show how the "join" condition was used to generate k-itemsets (this level's itemsets) from frequent (k-1)-itemsets (previous level's frequent itemsets).
            2. (10 points) Show how the "subset" condition was used to eliminate candidate itemsets from consideration before unnecessarily counting their support.
            3. (10 points) Count support for all remaining itemsets in the level.
          4. (5 points) What's the termination condition for this process? Explain.
        2. (10 points) What are "lift", "leverage", and "conviction"? Provide an explicit formula for each one of them (look at the Weka code to find those formulas). Pick one association rule from those that you generate in the next part below, and use the values of these metrics for this association rule to judge how interesting/useful this rule is.
        3. Take the first frequent itemset of the last level you constructed above.
          1. (5 points) Generate all rules from this frequent itemset that have exactly 2 items in the right-hand-side of the rule.
          2. (10 points) For each rule, calculate the confidence and the lift of the rule.
        4. (5 points) Explain how the processs of mining association rules in Weka's Apriori is performed in terms of the following parameters: lowerBoundMinSupport, upperBoundMinSupport, delta, metricType, minMetric, numRules.

      2. Existing real-world application of Association Rules:
        Each group should search for an existing real-world successful application of Association Rules. This sucessful story should be about mining association rules from data to discover novel and useful patterns that have made a difference in a certain industry or field. The application domain is up to the group (e.g., finance, sports, healthcare, science, ...). The application must be a current application (ideally not older than 5 years).
        Your homework report for this part should contain (at most 1 page):

        • (10 points) Name and description of the chosen application.
        • (10 points) Detailed description of how this application uses Association Rules.
        • (5 points) List of (recent) references and sources that you used to identify and investigate this application.

  2. Part II. GROUP PROJECT ASSIGNMENT