WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 444X Data Mining and Knowledge Discovery in Databases 
D Term 2003
Project 4: Numeric Predictions, Instance Based Learning, and Clustering

PROF. CAROLINA RUIZ 

DUE DATE: This project is due on Wednesday, April 23, 2003 at 12 noon  
------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to construct the most accurate models of different aspects of the two datasets under consideration (Mushroom data and College data) using the following data mining techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

PROJECT ASSIGNMENT

  1. Part I.
    Consider the dataset
    ten_percent_stratified_iris.arff. This dataset is a subset of the Iris dataset dataset obtained using stratified random sampling.
    1. (15 points) Numeric Predictions
      Follow the procedure described in the textbook to construct the root of a model tree (equivalently, regression tree) that uses predictive attributes sepalwidth, petallength, petalwidth, and iris-type, in order to predict the attribute sepallength in the ten_percent_stratified_iris.arff dataset. Remember to:
      1. (5 points) Start by translating the nominal attribute iris-type into 2 boolean attributes. This is done by taking the average of the sepallength values associated with "Iris-setosa", with ",Iris-versicolor", and with "Iris-virginica". Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these three nominal values in the order listed. After this translation, all the predicting attributes are numeric.
      2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.
      3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
                SDR = sd(sepallength over all instances)
                      - ((k1/n)*sd(sepallength of instances with attribute value below split point)
                         + (k2/n)*sd(sepallength of instances with attribute value above split point))
        
                where sd stands for standard deviation.
                k1 is the number of instances with attribute value below split point.
                k2 is the number of instances with attribute value above split point.
                n is the number of instances.
             
        Note that you don't need to construct the whole tree, just the root node. SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
    2. (15 points) Instance Based Learning
      Consider the new instance
      5.1,3.8,1.5,0.3,Iris-setosa.
      Assume now that we want to predict the attribute iris-type using sepallength, sepalwidth, petallength, and petalwidth as predicting attributes in the ten_percent_stratified_iris.arff dataset.
      1. (5 points) Find the 5 nearest neighbors of this new instance.
      2. (5 points) Classify this new instance using these 5 nearest neighbors without any weighting.
      3. (5 points) Classify this new instance using these 5 nearest neighbors weighted by the inverse of the distance.
      SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

  2. Part II.

    1. Datasets: Consider the following sets of data:

      1. The Mushroom Data Set.

      2. 1995 Data Analysis Exposition. This dataset contains college data taken from the U.S. News & World Report's Guide to America's Best Colleges. The necessary files are:
      For the classifications methods (numberic predictions and instance based learning) select appropriate (nominal or numeric) classification targets (i.e. "class attributes") for your experiments.

    2. Readings:
      • Textbook: Read in great detail the following Sections from your textbook:
        • Numeric Predictions: Sections 4.6, 6.5, 5.8.
        • Instance Based Learning: Sections 4.7, 6.4.
        • Clustering: Sections 6.6

      • Weka Code: Read the code of the relevant techiques implemented in the Weka system. Some of those techniques are enumerated below:
        • Numeric Predictions:
          • M5PRIME: Linear Regression, Regression Trees, Model Trees
        • Instance Based Learning:
          • IBk: k-nearest neighbors
          • (Optional - LWR: Locally Weighted Regression)
        • Clustering:
          • k-means

    3. Experiments: For each of the above datasets, use the "Explorer" option of the Weka system to perform the following operations:

      1. Load the data. Note that you need to translate the dataset into the arff format first.

      2. Preprocessing of the Data:

        A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, discretize attributes in a different way, etc. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

        In particular,

        • explore different ways of removing missing values. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it.

        To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.

      3. Mining of Models: The following are guidelines for the construction of your classification rules:

        • Code: Use the Weka's algorithms listed above to generate models of each of the two datasets.

        • Training and Testing Instances:

          You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

    4. Evaluation and Testing: Supply input data to Weka and use an appropriate split ratio (say 66% vs 33%) for training and testing data.

      Analyze in detail the results obtained. For classification models, analyze the accuracy of the resulting models. For numeric predictions analyze the errors reported by Weka and explain their meaning.


REPORTS AND DUE DATE


GRADING CRITERIA

TOTAL: 100 POINTS + EXTRA POINTS DEPENDING ON EXCEPTIONAL QUALITY

---------------------------------------------------------------------------
(TOTAL: 30 points) FOR PART I OF THE PROJECT

(TOTAL: 70 points) FOR PART II OF THE PROJECT

(TOTAL: 15 points) ALGORITHMIC DESCRIPTION OF THE CODE 
(05 points) Description of the algorithm underlying the Weka filters used
(10 points) Description of the ALGORITHM undelying the data mining 
            methods used in this project.
(up to 10 extra credit points for an outanding job) 
(providing just a structural description of the code, i.e. a list of 
classes and methods, will receive 0 points)

(TOTAL: 5 points) PRE-PROCESSING OF THE DATASET:
Discretizing attributes IF needed, and dealing with missing values appropriately
(up to 5 extra credit points) 
           Trying to do "fancier" things with attributes
           (i.e. combining two attributes highly correlated
            into one, using background knowledge, etc.)
    
(TOTAL: 46 points) EXPERIMENTS
(TOTAL: 23 points each dataset) FOR EACH DATASET:
       (05 points) ran a good number of experiments to get familiar with the 
                   data mining methods in this project
       (05 points) good description of the experiment setting and the results 
       (08 points) good analysis of the results of the experiments
                   INCLUDING discussion of evaluations statistics returned by
                   the Weka systems (accuracy and/or errors) and discussion of
                   particularly interesting results 
       (05 points) comparison of the results with those obtained using other
                   methods in this and previous projects
                   Argumentation of weknesses and/or strenghts of each of the
                   methods on this dataset, and argumentation of which method
                   should be preferred for this dataset and why. 
       (up to 10 extra credit points) excellent analysis of the results and 
                                     comparisons
       (up to 10 extra credit points) running additional interesting experiments

(TOTAL 4 points) SLIDES - how well do they summarize concisely
        the results of the project? We suggest you summarize the
        setting of your experiments and their results in a tabular manner.
   (up to 6 extra credit points) for excellent summary and presentation of results 
   in the slides.