WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 
Homework and Project 1: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ 

DUE DATES:
Part I (the individual homework assignment) is due on Tuesday, Sept. 9 at 1:00 pm and
Part II (the individual+group project) are due on Friday, Sept. 12th 2008 at 12 noon. 

------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Sections 4.3 and 6.1 from your textbook.

This project consists of two parts:

  1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See Amro Khasawneh's HW1 Solutions.

    Consider the following dataset, adapted from the Zoo Data Set available at the The University of California Irvine (UCI) Machine Learning Data Repository. Visit those webpages above to learn more about this dataset.

    ATTRIBUTES:	POSSIBLE VALUES:
    @attribute hair {no,yes}      % Does the animal have hair?
    @attribute eggs {no,yes}      % Does the animal lay eggs?
    @attribute toothed {no,yes}   % Does the animal have teeth?
    @attribute legs {0,2,4,5,6,8} % Number of legs - Assumed to be a nominal attribute
    @attribute type {type1,type2,type3,type4,type5,type6,type7}  % Type of animal
    
    animal
    (ignore!)
    hair eggs toothed legs type
    dolphinnonoyes0type1
    frognoyesyes4type5
    gnatnoyesno6type6
    herringnoyesyes0type4
    ladybirdnoyesno6type6
    lynxyesnoyes4type1
    mongooseyesnoyes4type1
    ostrichnoyesno2type2
    stingraynoyesyes0type4
    termitenoyesno6type6
    toadnoyesyes4type5
    tunanoyesyes0type4
    voleyesnoyes4type1
    waspyesyesno6type6
    wrennoyesno2type2

    1. (60 points) Construct the full ID3 decision tree using entropy to rank the predictive attributes (hair, eggs, toothed, legs) with respect to the target/classification attribute (type).

      Show all the steps of the calculations. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's.

    2. (10 points) Compute the accuracy of the decision tree you constructed on the following test data instances:

      animal
      (ignore!)
      hair eggs toothed legs type
      (ignore during classification)
      (use to calculate accuracy)
      YOUR DECISION
      TREE PREDICTION
      bassnoyesyes0type4 
      buffaloyesnoyes4type1 
      chickennoyesno2type2 
      crayfishnoyesno6type7 
      deeryesnoyes4type1 
      dovenoyesno2type2 
      goatyesnoyes4type1 
      pikenoyesyes0type4 
      toadnoyesyes4type5 
      vampireyesnoyes2type1 

      (5 points) The accuracy of your decision tree on this test data is: ________________
      
      (10 points) The confusion matrix of your decision tree on this test data is: ....
      

    3. Consider the following questions:
      • (5 points) What would your decision tree predict for the following test instance? Explain your answer.

        animal
        (ignore!)
        hair eggs toothed legs YOUR DECISION
        TREE PREDICTION
        EXPLANATION OF YOUR ANSWER
        scorpionnonono8  

      • (10 points) Explain in detail how J4.8 would classify the following instance (which contains a missing value) with respect to the decision tree that you constructed above.

        animal
        (ignore!)
        hair eggs toothed legs YOUR DECISION
        TREE PREDICTION
        EXPLANATION OF YOUR ANSWER
        no-nameyesnono?  

    4. Study how J4.8 performs post-prunning by reading in detail:

  2. Part II. INDIVIDUAL + GROUP PROJECT ASSIGNMENT
    [400 points: 100 points per data mining technique per individual/group parts. See
    Project Guidelines for the detailed distribution of these points]