WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 444X Data Mining and Knowledge Discovery in Databases 
D Term 2004
Project 4: Numeric Predictions, Instance Based Learning, and Clustering

PROF. CAROLINA RUIZ 

DUE DATE: This project is due on Wednesday, April 28, 2004 at 12 noon  
------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration using the following data mining techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

PROJECT ASSIGNMENT

  1. Part I.
    Consider the dataset
    training_cpu.arff. This dataset is a subset of the cpu dataset dataset that is available with the Weka system. See page 15 of your textbook for a description of this dataset.
    1. (40 points) Numeric Predictions
      Follow the procedure described in the textbook to construct a model tree and a regression tree of that uses predictive attributes vendor, MYCT, MMIN, MMAX, CACH, CHMIN, and CHMAX in order to predict the attribute CLASS in the training_cpu.arff dataset. Remember to:
      1. (5 points) Start by translating the nominal attribute vendor into boolean/numeric attributes. This is done by taking the average of the CLASS values associated with each of the vendor values burroughs, dec, hp, honeywell, ibm. Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these five nominal values in the order listed. After this translation, all the predicting attributes are numeric.
      2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.
      3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
                SDR = sd(CLASS over all instances)
                      - ((k1/n)*sd(CLASS of instances with attribute value below split point)
                         + (k2/n)*sd(CLASS of instances with attribute value above split point))
        
                where sd stands for standard deviation.
                k1 is the number of instances with attribute value below split point.
                k2 is the number of instances with attribute value above split point.
                n is the number of instances.
             
      4. (10 points) Continue the construction of the tree following the same procedure recursively. Remember that for each internal node, the procedure above is applied only to the data instances that belong that that node. You can stop splitting a node when the node contains less than 4 data instances and/or when the standard deviation of the CLASS value of the node's instances is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute over the entire input dataset. See Figure 6.11 on p. 206 of your textbook.
      5. (10 points) For each leaf node in the tree:
        • Compute the value that would be predicted by that leaf in the case of a Regression Tree.
        • Compute the linear regression formula that would be used by that leaf to predict the CLASS value in the case of a Model Tree. In order to find the coefficients of the linear regression formula, run the linear regression method implemented in the Weka system for the appropriate data instances (those that belong to the leaf).
      6. (5 extra points) For each of the trees constructed (model tree and regression tree), predict the CLASS values for each of the test instances in testing_cpu.arff. That is, complete the following table:
        %
        % As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction
        % using instance-based learning with encoding length selection. In Progress
        % in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
        %
        @relation 'cpu'
        @attribute vendor {burroughs, dec, hp, honeywell, ibm}
        @attribute MYCT real
        @attribute MMIN real
        @attribute MMAX real
        @attribute CACH real
        @attribute CHMIN real
        @attribute CHMAX real
        @attribute class real
        @data
                                              CLASS   MODEL TREE    REGRESSION TREE 
                                                      PREDICTION    PREDICTION
        
        burroughs, 143,  512, 5000,  0, 7, 32,  29    _________    ___________
        
        burroughs, 110, 3100, 6200,  0, 6, 64,  45    _________    ___________
        
        dec,       200,  512, 8000,  8, 1,  8,  38    _________    ___________
        
        dec,       810,  512,  512,  8, 1,  1,  18    _________    ___________
        
        hp,         75, 3000, 8000,  8, 3, 48,  54    _________    ___________
        
        hp,        175,  256, 2000,  0, 3, 24,  20    _________    ___________
        
        honeywell, 330, 1000, 4000,  0, 3,  6,  25    _________    ___________
        
        honeywell, 140, 2000, 4000,  8, 1, 20,  32    _________    ___________
        
        ibm,        26, 8000, 32000, 0, 8, 24, 220    _________    ___________
        
        ibm,       225, 2000, 4000,  8, 3,  6,  31    _________    ___________
        
        
        
                                                      MODEL TREE    REGRESSION TREE 
                                                      ERROR         ERROR
        
        
        root mean-square error (see p. 148)           __________    ___________
        
        mean absolute error (see p. 148)              __________    ___________
        
        
        SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

    2. (20 points) Instance Based Learning
      Assume now that we want to predict the attribute CLASS using as predicting attributes in the training_cpu.arff dataset.
      1. (10 points) Find the 4 nearest neighbors of each of the test instances. Show your work in your report.
      2. (5 points) Classify the test instances using these 4 nearest neighbors without any weighting.
      3. (5 points) Classify the test instances using these 4 nearest neighbors weighted by the inverse of the distance.
      %
      % As used by Kilpatrick, D. & Cameron-Jones, M. (1998). Numeric prediction
      % using instance-based learning with encoding length selection. In Progress
      % in Connectionist-Based Information Systems. Singapore: Springer-Verlag.
      %
      @relation 'cpu'
      @attribute vendor {burroughs, dec, hp, honeywell, ibm}
      @attribute MYCT real
      @attribute MMIN real
      @attribute MMAX real
      @attribute CACH real
      @attribute CHMIN real
      @attribute CHMAX real
      @attribute class real
      @data
                                            CLASS   4-NN         4-NN with weight 
                                                    PREDICTION   PREDICTION
      
      burroughs, 143,  512, 5000,  0, 7, 32,  29    _________    ___________
      
      dec,       200,  512, 8000,  8, 1,  8,  38    _________    ___________
      
      hp,         75, 3000, 8000,  8, 3, 48,  54    _________    ___________
      
      honeywell, 330, 1000, 4000,  0, 3,  6,  25    _________    ___________
      
      ibm,        26, 8000, 32000, 0, 8, 24, 220    _________    ___________
      
      
      
                                                    4-NN         4-NN with weight 
                                                    ERROR         ERROR
      
      
      root mean-square error (see p. 148)           __________    ___________
      
      mean absolute error (see p. 148)              __________    ___________
      
      
      SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

  2. Part II.

    1. Datasets: Consider the following sets of data:

      1. The Titanic Dataset. Look at the dataset description and the Data instances.

        I suggest you use the following nominal values for the attributes rather than 0's and 1's to make the association rules easier to read:

        Class (0 = crew, 1 = first, 2 = second, 3 = third)
        Age   (1 = adult, 0 = child)
        Sex   (1 = male, 0 = female)
        Survived (1 = yes, 0 = no)
        

      2. The entire cpu dataset that is available with the Weka system. See page 15 of your textbook for a description of this dataset.
      For the classifications methods (numberic predictions and instance based learning) select appropriate (nominal or numeric) classification targets (i.e. "class attributes") for your experiments.

    2. Readings:
      • Textbook: Read in great detail the following Sections from your textbook:
        • Numeric Predictions: Sections 4.6, 6.5, 5.8.
        • Instance Based Learning: Sections 4.7, 6.4.
        • Clustering: Sections 6.6

      • Weka Code: Read the code of the relevant techiques implemented in the Weka system. Some of those techniques are enumerated below:
        • Numeric Predictions:
          • M5PRIME: Linear Regression, Regression Trees, Model Trees
        • Instance Based Learning:
          • IBk: k-nearest neighbors
          • (Optional - LWR: Locally Weighted Regression)
        • Clustering:
          • Simple k-means

    3. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

      1. Load the data. Note that you need to translate the dataset into the arff format first.

      2. Preprocessing of the Data:

        A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

        In particular,

        • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

        To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.

      3. Mining of Models: The following are guidelines for the construction of your classification rules:

        • Code: Use the Weka's algorithms listed above to generate models of each of the two datasets.

        • Training and Testing Instances:

          You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

    4. Evaluation and Testing: Supply input data to Weka and use an appropriate split ratio (say 66% vs 33%) for training and testing data.

      Analyze in detail the results obtained. For classification models, analyze the accuracy of the resulting models. For numeric predictions analyze the errors reported by Weka and explain their meaning.


REPORTS AND DUE DATE


GRADING CRITERIA


---------------------------------------------------------------------------
(TOTAL: 60 points) FOR PART I OF THE PROJECT

(TOTAL: 70 points) FOR PART II OF THE PROJECT

  (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE 
  (05 points) Description of the algorithm underlying the Weka filters used
  (10 points) Description of the ALGORITHM undelying the data mining 
              methods used in this project.
  (up to 10 extra credit points for an outanding job) 
  (providing just a structural description of the code, i.e. a list of 
  classes and methods, will receive 0 points)
  
  (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
  Discretizing attributes IF needed, and dealing with missing values appropriately
  (up to 5 extra credit points) 
             Trying to do "fancier" things with attributes
             (i.e. combining two attributes highly correlated
              into one, using background knowledge, etc.)
    
  (TOTAL: 46 points) EXPERIMENTS
  (TOTAL: 23 points each dataset) FOR EACH DATASET:
         (05 points) ran a good number of experiments to get familiar with the 
                     data mining methods in this project
         (05 points) good description of the experiment setting and the results 
         (08 points) good analysis of the results of the experiments
                     INCLUDING discussion of evaluations statistics returned by
                   the Weka systems (accuracy and/or errors) and discussion of
                     particularly interesting results 
         (05 points) comparison of the results with those obtained using other
                     methods in this and previous projects
                     Argumentation of weknesses and/or strenghts of each of the
                     methods on this dataset, and argumentation of which method
                     should be preferred for this dataset and why. 
       (up to 10 extra credit points) excellent analysis of the results and 
                                       comparisons
         (up to 10 extra credit points) running additional interesting experiments
  
  (TOTAL 4 points) SLIDES - how well do they summarize concisely
          the results of the project? We suggest you summarize the
          setting of your experiments and their results in a tabular manner.
     (up to 6 extra credit points) for excellent summary and presentation of results 
     in the slides.