WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2004 
Homework and Project 3: Numeric Predictions

PROF. CAROLINA RUIZ 

DUE DATE: Part I (the individual homework assignment) is due on Friday, Oct. 01 at 1:00 pm and Parts II.1 and II.2 (the individual+group project) are due on Monday, October 4th 2004 at 5 pm. 
------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate numeric prediction models for the two datasets under consideration using the following techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

PROJECT ASSIGNMENT

  1. Part I.
    See
    solutions to this homework assignment by Abraao Lourenco with accompanying Excel tables.

    Consider the dataset below. This dataset is a subset of the Auto Miles-per-galon (MPG) dataset that is available at the The University of California Irvine (UCI) Data Repository.

    
    @relation subset-auto-mpg
    
    @attribute mpg numeric
    @attribute cylinders numeric
    @attribute horsepower numeric
    @attribute weight numeric
    @attribute acceleration numeric
    @attribute model-year numeric
    @attribute car-name {chevrolet,toyota,volkswagen,ford}
    
    @data
    
    18,  8,  130,  3504,  12,    70,  chevrolet
    27,  4,  90,   2950,  17.3,  82,  chevrolet
    34,  4,  88,   2395,  18,    82,  chevrolet
    24,  4,  95,   2372,  15,    70,  toyota
    28,  4,  75,   2155,  16.4,  76,  toyota
    32,  4,  96,   2665,  13.9,  82,  toyota
    26,  4,  46,   1835,  20.5,  70,  volkswagen
    29,  4,  70,   1937,  14.2,  76,  volkswagen
    44,  4,  52,   2130,  24.6,  82,  volkswagen
    10,  8,  215,  4615,  14,    70,  ford
    28,  4,  79,   2625,  18.6,  82,  ford
    16,  8,  149,  4335,  14.5,  77,  ford
    
    For this homework, we want to predict the mpg attribute (prediction target) from the other predicting attributes cylinders, horsepower, weight, acceleration, model-year, and car-name

    1. (15 points) Linear Regression

      • (5 points) Describe the linear regression equation that would result from using linear regression to solve this numeric prediction problem.

      • (10 points) Describe in detail the procedure that would be followed by linear regression to find appropriate parameters for this equation. Run linear regression in the Weka system over this dataset and provide the precise linear regression equation output by Weka (you'll need it for the testing part below).

    2. (45 points) Regression Trees and Model Trees
      Follow the procedure described in the textbook to construct a model tree and a regression tree to solve this numeric prediction problem. Remember to:

      1. (5 points) Start by translating the nominal attribute car-name into boolean/numeric attributes. This is done by taking the average of the CLASS values associated with each of the vendor values chevrolet,toyota,volkswagen,ford Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these four nominal values in the order listed. After this translation, all the predicting attributes are numeric.

      2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.

      3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
                SDR = sd(CLASS over all instances)
                      - ((k1/n)*sd(CLASS of instances with attribute value below split point)
                         + (k2/n)*sd(CLASS of instances with attribute value above split point))
        
                where sd stands for standard deviation.
                k1 is the number of instances with attribute value below split point.
                k2 is the number of instances with attribute value above split point.
                n is the number of instances.
             

      4. (20 points) Continue the construction of the tree following the same procedure recursively. Remember that for each internal node, the procedure above is applied only to the data instances that belong that that node. You can stop splitting a node when the node contains less than 4 data instances and/or when the standard deviation of the CLASS value of the node's instances is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute over the entire input dataset. See Figure 6.11 on p. 206 of your textbook.

      5. (10 points) For each leaf node in the tree:
        • Compute the value that would be predicted by that leaf in the case of a Regression Tree.
        • Compute the linear regression formula that would be used by that leaf to predict the CLASS value in the case of a Model Tree. In order to find the coefficients of the linear regression formula, run the linear regression method implemented in the Weka system for the appropriate data instances (those that belong to the leaf).

    3. (15 points) Testing
      Use each of the three numeric predicting models constructed (linear regression equation, model tree and regression tree), to predict the CLASS values (i.e. the mpg) for each of the test instances below. That is, complete the following table:
                                                       LINEAR        MODEL TREE    REGRESSION TREE 
                                                       REGRESSION    PREDICTION    PREDICTION
      
      
      13,    8,  150,  4464,  12,    73,  chevrolet    __________    ___________   __________
      
      21,    4,  72,   2401,  19.5,  73,  chevrolet    __________    ___________   __________
      
      20,    6,  122,  2807,  13.5,  73,  toyota       __________    ___________   __________
      
      27.5,  4,  95,   2560,  14.2,  78,  toyota       __________    ___________   __________
      
      27,    4,  60,   1834,  19,    71,  volkswagen   __________    ___________   __________
      
      31.5,  4,  71,   1990,  14.9,  78,  volkswagen   __________    ___________   __________
      
      21,    4,  86,   2226,  16.5,  72,  ford         __________    ___________   __________
      
      36.1,  4,  66,   1800,  14.4,  78,  ford         __________    ___________   __________
      
      
      
      
      
                                                       LINEAR        MODEL TREE    REGRESSION TREE 
                                                       REGRESSION    ERROR         ERROR
                                                       ERROR
      
      root mean-square error (see p. 148)              __________    ___________   __________
      
      mean absolute error (see p. 148)                 __________    ___________   __________
      
      
      SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

  • Part II.

    1. Datasets: Consider the following sets of data:

      1. The Abalone Dataset available from the The University of California Irvine (UCI) Data Repository.

        Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

      2. The Automobile Dataset available from the The University of California Irvine (UCI) Data Repository.

        Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

    2. Readings:
      • Textbook: Read in great detail the following Sections from your textbook:
        • Numeric Predictions: Sections 4.6, 6.5, 5.8.

      • Weka Code: Read the code of the relevant techiques implemented in the Weka system. Some of those techniques are enumerated below:
        • Numeric Predictions:
          • M5PRIME: Linear Regression, Regression Trees, Model Trees

      Experiments:

      For each of the datasets, use the "Explorer" option of the Weka system to perform the following operations:

      • A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, etc. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

      • To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.

      • You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

      • INDIVIDUAL PROJECT AND WRITTEN REPORT.

        Your individual report should contain discussions of all the parts of the individual work you do for this project. In particular, it should elaborate on the the following topics:

        1. Code Description: Describe algorithmicly the Weka code of the 3 methods covered by this project: LINEAR REGRESSION, MODEL TREES, AND REGRESSION TREES; and filters that you use in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.

        2. Experiments: For EACH EXPERIMENT YOU RAN describe:
          • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
          • Any pre-processing done to the data. That is, did you remove any attributes? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
          • Your system parameters.
          • For each of the methods covered by this project (i.e. LINEAR REGRESSION, MODEL TREES, AND REGRESSION TREES):
            • Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the model.
            • Error measures of the resulting models.
            • Comparison of the results across the 3 numeric prediction methods used in this project.

        3. Summary of Results
          • For each of the datasets, what were the lowest error values obtained in your project? Include this model in your report.
          • Strengths and weaknesses of your project.

      • GROUP PROJECT AND WRITTEN REPORT. Your group report should contain discussions of all the parts of the group project. In particular, it should elaborate on the the following topics:

        1. Experiments: as described in the individual part above. For the group part, start by running experiments that build upon the experience that you gained with the individual projects.

          Once that are done with these join experiments, investigate the pruning method used in M5PRIME (i.e. when the option UNPRUNED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.

          Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.

        2. Summary of Results as described in the individual part above

      • GROUP ORAL REPORT. We will discuss the results from the individual projects during the class on Monday, October 4. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

      PROJECT SUBMISSION AND DUE DATE

      Part II is due Monday, Oct. 4nd at 5:00 pm. Submissions received on Monday, Oct. 4 between 5:01 pm and 7:00 pm will be penalized with 30% off the grade and submissions after Oct. 4 at 7:00 pm won't be accepted.

      Please submit the following files using the myWpi digital drop box:

      1. [lastname]_proj3_report.[ext] containing your individual written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
        • ruiz_proj3_report.pdf

        If you are taking this course for grad. credit, state this fact at the beginning of your report. In this case you submit only an individual report containing both the "individual" and the "group" parts, as you are working all by yourself on the projects.

      2. [lastname1_lastname2]_proj3_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
        • ruiz_smith_proj3_report.pdf if I worked with Joe Smith on this project.

      3. [lastname1_lastname2]_proj3_slides.[ext] (or [lastname]_proj3_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project (both individual and group parts, and classification and association rules).

    GRADING CRITERIA

    
    ---------------------------------------------------------------------------
    
    (TOTAL:  75 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
    
    (TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
    
      (TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE 
      (05 points) Description of the algorithm underlying the Weka filters used
      (10 points) Description of the ALGORITHM undelying the data mining 
                  methods used in this project.
      (up to 10 extra credit points for an outanding job) 
      (providing just a structural description of the code, i.e. a list of 
      classes and methods, will receive 0 points)
      
      (TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
      Preprocess attributes as needed and dealing with missing values appropriately
      (up to 5 extra credit points) 
                 Trying to do "fancier" things with attributes
                 (i.e. combining two attributes highly correlated
                  into one, using background knowledge, etc.)
        
      (TOTAL: 160 points: 80 points for the individual part and 80 points for the group part) 
      EXPERIMENTS
      (TOTAL: 40 points each dataset) FOR EACH DATASET:
             (05 points) ran a good number of experiments to get familiar with the 
                         data mining methods in this project
             (15 points) good description of the experiment setting and the results 
             (15 points) good analysis of the results of the experiments
                         INCLUDING discussion of evaluations statistics returned by
                       the Weka systems (accuracy and/or errors) and discussion of
                         particularly interesting results 
             (05 points) comparison of the results with those obtained using the other
                         methods in this project.
                         Argumentation of weknesses and/or strenghts of each of the
                         methods on this dataset, and argumentation of which method
                         should be preferred for this dataset and why. 
           (up to 10 extra credit points) excellent analysis of the results and 
                                           comparisons
             (up to 10 extra credit points) running additional interesting experiments
      
      (TOTAL 5 points) SLIDES - how well do they summarize concisely
              the results of the project? We suggest you summarize the
              setting of your experiments and their results in a tabular manner.
         (up to 6 extra credit points) for excellent summary and presentation of results 
         in the slides.
    
      (TOTAL 15 points) Class presentation - how well your oral presentation summarized
            concisely the results of the project and how focus your presentation was
            on the more creative/interesting/usuful of your experiments and results.
            This grade is given individually to each team member.