WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2010 
Homework and Project 2: Data Pre-processing, Mining, and Evaluation of Decision Trees

PROF. CAROLINA RUIZ 

DUE DATES: Tuesday, Nov. 16, 8:00 am (electronic submission) and 11:00 am (hardcopy submission) 
------------------------------------------


HOMEWORK AND PROJECT OBJECTIVES

The purpose of this project is multi-fold:

HOMEWORK AND PROJECT ASSIGNMENTS

Readings: Read in great detail Chapter 4 of your textbook.

This project consists of two parts:

  1. Part I. INDIVIDUAL HOMEWORK ASSIGNMENT

    See Solutions by Hao Wan.

    Consider the following dataset, adapted from the Shuttle Landing Control Data Set available at the The University of California Irvine (UCI) Machine Learning Data Repository. Visit those webpages above to learn more about this dataset.

    @relation shuttle-landing-control
    
    @attribute STABILITY	{stab, xstab}
    @attribute ERROR	{LX, MM, SS}
    @attribute WIND 	{head,tail}
    @attribute VISIBILITY	{yes, no}
    @attribute Class 	{noauto,auto}
    
    @data
    ( 1) xstab, MM, tail, no,  auto
    ( 2) xstab, LX, head, yes, noauto
    ( 3) stab,  LX, head, no,  auto
    ( 4) xstab, SS, head, no,  auto
    ( 5) stab,  SS, head, yes, auto
    ( 6) xstab, SS, tail, yes, noauto
    ( 7) stab,  MM, head, yes, noauto
    ( 8) xstab, MM, head, no,  auto
    ( 9) xstab, SS, head, no,  auto
    (10) xstab, MM, head, yes, noauto
    (11) stab,  MM, tail, yes, auto
    (12) stab,  MM, tail, yes, noauto
    (13) stab,  MM, tail, yes, noauto
    (14) stab,  LX, head, yes, noauto
    (15) stab,  SS, head, yes, auto
    (16) stab,  MM, head, yes, noauto
    (17) xstab, MM, tail, no,  auto
    (18) stab,  MM, tail, yes, auto
    (19) xstab, SS, tail, no,  auto
    (20) xstab, SS, head, yes, noauto
    (21) xstab, LX, tail, yes, noauto
    (22) xstab, LX, tail, yes, noauto
    

    1. (20 points) Complete the construction of the full ID3 decision tree shown below using entropy to rank the predictive attributes (STABILITY, ERROR, WIND, VISIBILITY) with respect to the target/classification attribute (Class).

      Note that the root of the decision tree (VISIBILITY) is given and you do NOT need to do the gain or entropy calculations needed to determine that this attribute should be the root node. However, you need to show all the steps of the calculations from there on. Make sure you compute log in base b (for the appropriate b) correctly as some calculators don't have a log_b primitive for all b's. Also, state explicitly in your tree what instances exactly belong to each tree node using the line numbers provided next to each data instance in the dataset above.

          VISIBILITY
          /      \
        yes /        \ no
         /          \
         ?            ? 
      

    2. (5 points) Propose approaches to using your decision tree above to classify instances that contain missing values. Use the following instance to illustrate your ideas.
      VISIBILITY = yes, STABILITY = stab, ERROR  = ?, WIND = head
      

    3. Study how J4.8 performs post-prunning by reading in detail:

  2. Part II. GROUP PROJECT ASSIGNMENT