WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 
Homework and Project 3: Classification Rules and Instance Based Learning

PROF. CAROLINA RUIZ 

DUE DATES: Tuesday, Nov. 23, 8:00 am (electronic submission) and 11:00 am (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 5.1 and 5.2 of your textbook.

This project consists of two parts:

  • Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See Solutions to this homework assignment by Yutao Wang.

    1. Consider the following dataset, adapted from the Shuttle Landing Control Data Set available at the The University of California Irvine (UCI) Machine Learning Data Repository. Visit those webpages above to learn more about this dataset.
      @relation shuttle-landing-control
      
      @attribute STABILITY	continuous
      @attribute ERROR	continuous
      @attribute WIND 	{head,tail}
      @attribute VISIBILITY	{yes, no}
      @attribute Class 	{noauto,auto}
      
      @data
      ( 1) 60, 0.5, tail, no,  auto
      ( 2) 75, 1.0, head, yes, noauto
      ( 3) 40, 0.9, head, no,  auto
      ( 4) 65, 0.0, head, no,  auto
      ( 5) 45, 0.2, head, yes, auto
      ( 6) 80, 0.1, tail, yes, noauto
      ( 7) 30, 0.4, head, yes, noauto
      ( 8) 90, 0.6, head, no,  auto
      ( 9) 65, 0.1, head, no,  auto
      (10) 85, 0.5, head, yes, noauto
      (11) 25, 0.6, tail, yes, auto
      (12) 40, 0.4, tail, yes, noauto
      (13) 15, 0.6, tail, yes, noauto
      (14) 25, 0.8, head, yes, noauto
      (15) 30, 0.2, head, yes, auto
      (16) 35, 0.4, head, yes, noauto
      (17) 70, 0.6, tail, no,  auto
      (18) 20, 0.5, tail, yes, auto
      (19) 75, 0.1, tail, no,  auto
      (20) 80, 0.2, head, yes, noauto
      (21) 85, 0.8, tail, yes, noauto
      (22) 60, 0.9, tail, yes, noauto
      

      (50 points) Classification Rules

      [See solutions to a similar problem from a previous offering of this course.]

      In this part, you will construct classification rules using the sequential covering algorithm (called Prism in Weka). Note that the dataset contains continuous attributes. Handle those continuous attributes as J4.8 would handle them, that is using binary splits. To reduce the amount of work, consider only the following split points:

         split point for STABILITY: 50
         split points for ERROR:    0.3 and 0.7
      
      regardless of what values for those attributes are present in the (subset of the) dataset under consideration. That is, the only predicates that can appear in the rules are: STABILITY ≤ 50, STABILITY > 50, ERROR ≤ 0.3, ERROR > 0.3, ERROR ≤ 0.7, ERROR > 0.7.

      1. Assume that the algorithm has produced the following two rules for Class=noauto so far:
        If ERROR > 0.7
           and VISIBILITY = yes then noauto
        If VISIBILITY = yes
           and STABILITY > 50 then noauto
        
        Starting from here, follow the sequential covering algorithm to construct "by hand" the 3rd rule for Class=noauto.

        1. (5 points) List the instances that are under consideration during the construction of this 3rd rule for Class=noauto.

        2. (10 points) Now, construct the 3rd rule using those instances. Use the ratio p/t to rank the attribute-values that are candidates for inclusion in a rule. Your written solutions should show all your work. That is, at each stage during the construction of this rule, list all the attribute-value pairs (together with their p/t ratios) that are candidates for inclusion in the rule, which one was selected, and why.

      2. Assume now that the algorithm has produced ALL the rules for Class=noauto. (You only have to construct the 3rd one, as described above, not all.) Starting from this point, follow the sequential covering algorithm to construct "by hand" now the 1st rule for Class=auto. Follow the same steps as before:

        1. (5 points) List the instances that are under consideration during the construction of this 1st rule for Class=auto.

        2. (10 points) Now, construct this 1st rule using those instances. At each stage during the construction of this rule, list all the attribute-value pairs (together with their p/t ratios) that are candidates for inclusion in the rule, which one was selected, and why.

      3. (20 points)Rule Prunning In this part, you will investigate how the RIPPER algorithm prunes a rule using a validation set. See Section 5.1 pp.220-221 of your textbook, and the JRip method in Weka, under Classification Rules (click "More" to see an algorithmic description of the method and also read the Weka code implementing it).

        Given the rule

        If VISIBILITY = yes
           and ERROR ≤ 0.7 
           and ERROR >  0.3 
           and STABILITY ≤ 50
           and WIND = tail 
        then Class=noauto
        
        and the validation set:
              STABILITY  ERROR   WIND   VISIBILITY  Class
        (v1 ) 35,        0.1,    head,  no,  	    auto
        (v2 ) 80,        0.6,    tail,  yes,        noauto
        (v3 ) 35,        0.1,    head,  no,  	    auto
        (v4 ) 10,        0.6,    tail,  yes,        noauto
        (v5 ) 40,        0.5,    tail,  yes,        auto
        (v6 ) 80,        0.6,    tail,  yes,        noauto
        (v7 ) 25,        0.4,    tail,  yes,        auto
        (v8 ) 80,        0.6,    tail,  yes,        auto
        (v9 ) 20,        0.6,    tail,  yes,        noauto
        (v10) 35,        0.1,    head,  no,  	    auto
        (v11) 40,        0.5,    head,  yes,        noauto
        (v12) 15,        0.4,    head,  yes,        noauto
        
        show each step of the pruning method used by RIPPER on this rule over the above validation set. Show your work.

      (50 points) Instance-Based Learning

      [See solutions to a similar problem from a previous offering of this course.] Assume that we want to predict the Class attribute (prediction target) of the following two new data instances:
           STABILITY  ERROR   WIND   VISIBILITY
      (23) 35,        0.1,    head,  no  
      (24) 80,        0.6,    tail,  yes
      
      using the k-nearest neighbors algorithm on the same training set of 22 instances above.

      For each of the variations of the k-nearest neighbors algorithm listed below do:

      • (5 points per variation) Implement a program/script that uses the described distance (similarity) metric to find the 4 nearest neighbors of each of the two instances above. Make the output of your code as verbose as possible so that we can see the work it does. Include this output as well as the code you wrote in your report. Two different values of a nominal attribute will contribute 1 towards the Euclidean distance, and two "same values" of a nominal attribute will contribute 0 towards the Euclidean distance of two data instances.
      • (5 points per variation) Use those 4 nearest neighbors to classify each of the two given instances (23) and (24) using the described "voting" methods below. Show your work in your report.

      Variations:

        • Data Attributes: Use the data attributes as provided.
        • Distance metric: (Plain) Euclidean distance.
        • "Voting" method: Majority vote with no distance weighting.

        • Data Attributes: Use the data attributes as provided.
        • Distance metric: (Plain) Euclidean distance.
        • "Voting" method: Majority vote using distance weighting: Use the 4 nearest neighbors weighted by the inverse of the distance. That is, if their respective distances to the test instance are d1, d2, d3, and d4, then the weights of the 4 nearest neighbors are w1 = 1/d1, w2 = 1/d2, w3 = 1/d3, and w4 = 1/d4 for the weighted majority vote.

        • Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1. That is, replace each STABILITY value with (value - min. value)/(max. value - min. value), where min. value and max. value are the minimum and maximum values of that attribute respectively. Apply the same tranformation to the STABILITY values of the test instances.
        • Distance metric: (Plain) Euclidean distance.
        • "Voting" method: Majority vote with no distance weighting.

        • Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1, as described in Variation 3.
        • Distance metric: (Plain) Euclidean distance.
        • "Voting" method: Majority vote with distance weighting, as described in Variation 2.

        • Data Attributes: Preprocess STABILITY so that it ranges from 0 to 1, as described in Variation 3.
        • Distance metric: Weighted Euclidean distance. Use the following weights for each attribute:
          STABILITY    2
          ERROR        1
          WIND         4.5 
          VISIBILITY   3 
          
        • "Voting" method: Majority vote with distance weighting, as described in Variation 2.

  • Part II. GROUP PROJECT ASSIGNMENT