WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2006 
Homework and Project 3: Numeric Predictions

PROF. CAROLINA RUIZ 

DUE DATES: ------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is to construct accurate numeric prediction models for the two datasets under consideration using the following techniques: Also, to gain close understanding of how those methods work, this project also include following those methods by hand on a toy dataset.

PROJECT ASSIGNMENT

  1. Part I. INDIVIDUAL Homework

    See Piotr Mardziel's Homework Solutions.

    Consider the dataset below. This dataset is an adaptation of the World Happiness Dataset.

    @relation world_happiness
    
    % - Life Expectancy from UN Human Development Report (2003)
    % - GDP per capita from figure published by the CIA (2006), figure in US$.
    % - Access to secondary education rating from UNESCO (2002)
    % - SWL (satisfaction with life) index calculated from data published 
    %   by New Economics Foundation (2006).
    
    @attribute country string
    @attribute continent {Americas,Africa,Asia,Europe}
    @attribute life-expectancy numeric
    @attribute GDP-per-capita numeric
    @attribute access-to-education-score numeric
    @attribute SWL-index numeric
    
    @data
    Switzerland,	Europe,		80.5,	32.3,	99.9,	273.33
    Canada,		Americas,	80,	34,	102.6,	253.33
    Usa,		Americas,	77.4,	41.8,	94.6,	246.67
    Germany,	Europe,		78.7,	30.4,	99,	240
    Mexico,		Americas,	75.1,	10,	73.4,	230
    France,		Europe,		79.5,	29.9,	108.7,	220
    Thailand,	Asia,		70,	8.3,	79,	216.67
    Brazil,		Americas,	70.5,	8.4,	103.2,	210
    Japan,		Asia,		82,	31.5,	102.1,	206.67
    India,		Asia,		63.3,	3.3,	49.9,	180
    Ethiopia,	Africa,		47.6,	0.9,	5.2,	156.67
    Russia,		Asia,		65.3,	11.1,	81.9,	143.3
    
    
    For this homework, we want to predict the SWL-index attribute (prediction target) from the other predicting attributes continent, life-expectancy, GDP-per-capita, access-to-education-score. Note that the attribute country identifies each data instance uniquely and as such will be disregarded in our analysis. It is provided just for context.

    1. (15 points) Linear Regression

      • (5 points) Describe the linear regression equation that would result from using linear regression to solve this numeric prediction problem.

      • (5 points) Describe in detail the procedure that would be followed by linear regression to find appropriate parameters for this equation.

      • (5 points) Run linear regression in the Weka system over this dataset and provide the precise linear regression equation output by Weka (you'll need it for the testing part below). Set the "eliminateColinearAttributes" parameter of linear regression to False so that your linear regression formula includes all the predicting attributes.

    2. (45 points) Regression Trees and Model Trees
      Follow the procedure described in the textbook to construct a model tree and a regression tree to solve this numeric prediction problem. Remember to:

      1. (5 points) Start by translating the nominal attribute continent into boolean/numeric attributes. This is done by taking the average of the CLASS values associated with each of the continent values Americas,Africa,Asia,Europe. Sort them in decresing order by average. Now, create new boolean attributes, one for each possible split of these four nominal values in the order listed. After this translation, all the predicting attributes are numeric.

      2. (5 points) Sort the values of each attribute in say increasing order. Define a "split point" of an attribute as the midpoint between two subsequent values of the attribute.

      3. (5 points) Consider the set of split points of all attributes. Select as the condition for the root node on your tree, the split point that maximizes the value of the following formula:
           SDR = sd(CLASS over all instances)
                 - ((k1/n)*sd(CLASS of instances with attribute value below split point)
                    + (k2/n)*sd(CLASS of instances with attribute value above split point))
        
           where sd stands for standard deviation.
           k1 is the number of instances with attribute value below split point.
           k2 is the number of instances with attribute value above split point.
           n is the number of instances.
             
        To reduce the number of calculations that you need to perform, your HW solutions can be limited to the following split points:
          binary attributes	life-expectancy	GDP-per-capita	access-to-education-score
          0.5			(63.3+47.6)/2 	(3.3+0.9)/2	(49.9+5.2)/2
        			(70.0+65.3)/2	(8.3+3.3)/2	(73.4+49.9)/2
        			(75.1+70.5)/2	(29.9+11.1)/2	(94.6+81.9)/2
        			(78.7+77.4)/2	(32.3+31.5)/2	(102.1+99.9)/2
        			(82.0+80.5)/2	(41.8+34.0)/2	(108.7+103.2)/2
          
        If during the construction of the tree you encounter an attribute such that none of the split points listed above apply to the instances in the node, then use instead all the attribute's split points that apply to that collection of instances.

      4. (20 points) Continue the construction of the tree following the same procedure recursively. Remember that for each internal node, the procedure above is applied only to the data instances that belong that that node. You can stop splitting a node when the node contains less than 4 data instances and/or when the standard deviation of the CLASS value of the node's instances is less than 0.05*sda, where sda is the standard deviation of the CLASS attribute over the entire input dataset. See Figure 6.15 on p. 248 of your textbook.

      5. (10 points) For each leaf node in the tree:
        • Compute the value that would be predicted by that leaf in the case of a Regression Tree.
        • Compute the linear regression formula that would be used by that leaf to predict the CLASS value in the case of a Model Tree. In order to find the coefficients of the linear regression formula, run the linear regression method implemented in the Weka system for the appropriate data instances (those that belong to the leaf).

    3. (15 points) Testing
      Use each of the three numeric predicting models constructed (linear regression equation, model tree and regression tree), to predict the CLASS values (i.e., the SWL-index) for each of the test instances below. That is, complete the following table:
                                                       	LINEAR        MODEL TREE    REGRESSION TREE 
                                                       	REGRESSION    PREDICTION    PREDICTION
      
      
      Costa_Rica,	Americas, 78.2,	11.1,	50.9,	250	__________    ___________   __________
      
      United_Kingdom,	Europe,   78.4,	30.3,	157.2,	236.67	__________    ___________   __________
      
      South_Africa,	Africa,   48.4,	12,	90.2,	190	__________    ___________   __________
      
      Lithuania,	Europe,   72.3,	13.7,	93.4,	156.67	__________    ___________   __________
      
      
      
                                                       	LINEAR        MODEL TREE    REGRESSION TREE 
                                                       	REGRESSION    ERROR         ERROR
                                                       	ERROR
      
      root mean-square error (see p. 178)              	__________    ___________   __________
      
      mean absolute error (see p. 178)                 	__________    ___________   __________
      
      
      SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.

  • Part II. Project

    1. Datasets: Two datasets will be analyzed in this part:

      1. The World Happiness Dataset. See also the World Happiness Dataset with Continents information added by Paul Sader and converted into arff. Thanks, Paul!!

        Use the attribute SWL-index as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice. Since the SWL-ranking can be derived from SWL-index, remove SWL-ranking from consideration. Also, remove the attribute country as each of its values identifies an instance uniquely. Note that the access-to-education-score attribute contains missing values, marked with ".". Remember to replace them with "?" in your arff file.

      2. Together with your project partner, choose one dataset from the following options:

        • The Abalone Dataset available from the The University of California Irvine (UCI) Data Repository.

          Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

        • The Automobile Dataset available from the The University of California Irvine (UCI) Data Repository.

          Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.

        • A dataset of your choice. This dataset can be one available on a public, online data repository (including but not limited to the datasets used on Projects 1 or 2) or any other valid source. The dataset should contain at least 500 data instances with at least 5 different attributes (ideally some numeric and some nominal). For ideas, see the Data Sets listed on the course webpage/syllabus.

    2. Readings:
      • Textbook: Read in great detail the following Sections from your textbook:
        • Numeric Predictions: Sections 4.6, 6.5, 5.8.

      • Weka Code: Read the code of the relevant techiques implemented in the Weka system. Some of those techniques are enumerated below:
        • Numeric Predictions:
          • Linear Regression (under "functions")
          • M5P (under "trees"): Regression Trees, Model Trees

      Experiments:

      For each of the datasets, use the Weka system to perform the following operations:

      • A main part of the project is the PREPROCESSING of your dataset. You should apply relevant filters to your dataset before doing the mining and/or using the results of previous mining tasks. For instance, you may decide to remove apparently irrelevant attributes, replace missing values if any, etc. Missing values in arff files are represented with the character "?". See the weka.filter.ReplaceMissingValuesFilter in Weka. Play with the filter and read the Java code implementing it. Your report should contain a detailed description of the preprocessing of your dataset and justifications of the steps you followed. If Weka does not provide the functionality you need to preprocess your data as you need to obtain useful patterns, preprocess the data yourself either by writing the necessary filters (you can incorporate them in Weka if you wish).

      • To the extent possible/necessary, modify the attribute names and the nominal value names so that the resulting models are easy to read.

      • You may restrict your experiments to a subset of the instances in the input data IF Weka cannot handle your whole dataset (this is unlikely). But remember that the more accurate your models, the better.

      • Ideas for making this project even more interesting:
        • Start by choosing a dataset in a domain that excites you.
        • Before you start running experiments, look at the raw data in detail. Figure out 5 or more specific, interesting questions that you want to answer with your experiments. These questions may be phrased as conjectures that you want to confirm/refute with your experimental results. Sample questions on the World Happiness are: what's the relative importance of access to education, per-capita income, and life expectancy in satisfaction with life? What factor is most important? Are there general trends by continent?
          If you pursue this last question, add continent information to your dataset. The first group/individual to submit a correct dataset including continent information to the course mailing list will receive a 25 points bonus. To make it standard, let's assume there are 6 continents: Antarctica, Americas, Europe, Asia, Africa, Australia.
          Paul Sader was the first to submit a correct dataset with continent information added to it. Thanks, Paul!
        • Design your preprocessing and experiments around answering these 5 questions.
        • Analyze your resulting models in the light of your 5 questions.

      • WRITTEN REPORT.

        Your project report should contain discussions of all the parts of the work you do for this project. In particular, it should elaborate on the the following topics:

        1. Code Description: Describe algorithmicly the Weka code of the 3 methods covered by this project: LINEAR REGRESSION, MODEL TREES, AND REGRESSION TREES; and filters that you use in the project. More precisely, explain the ALGORITHM underlying the code in terms of the input it receives, the output it produces, and the main steps it follows to produce this output. PLEASE NOTE THAT WE EXPECT A DETAIL DESCRIPTION OF THE ALGORITHMS USED NOT A LIST OF OBJECTS AND METHODS IMPLEMENTED IN THE CODE.

        2. Experiments: For EACH EXPERIMENT YOU RAN describe:
          • 5 or more specific questions/conjectures about the dataset domain that you aim to answer/validate with your experiments. See the section on "Ideas for making this project even more interesting" above.
          • Instances: What data did you use for the experiment? That is, did you use the entire dataset of just a subset of it? Why?
          • Any pre-processing done to the data. That is, did you remove any attributes? Did you replace missing values? If so, what strategy did you use to select a replacement of the missing values?
          • Your system parameters.
          • For each of the methods covered by this project (i.e. LINEAR REGRESSION, MODEL TREES, AND REGRESSION TREES):
            • Results and detail ANALYSIS of results of the experiments you ran using different ways of testing (split ratio and N-fold cross-validation) the model.
            • Error measures of the resulting models.
            • Comparison of the results across the 3 numeric prediction methods used in this project.
            • Detailed analysis of the models obtained. Examine the linear regression formulas and the trees obtained. Elaborate on interesting attribute weights in those formulas, as well as presence/absence of predicting attributes in the trees. Do these models answer any of your 5 questions? If so, what's the answer? If not, why not and what modifications to the experiments are needed to answer those questions?
          • Investigate the pruning method used in M5P (i.e., when the option UNPRUNED=FALSE is chosen from the M5P GUI), which is explained in your textbook. Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.

            Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.

        3. Summary of Results
          • For each of the datasets, what were the lowest error values obtained in your project? Include this model in your report as well as interesting aspects of this model.
          • Strengths and weaknesses of your project.

      • ORAL REPORT. We will discuss the results from the individual projects during the class on Tuesday, Dec. 5. Your oral report should summarize the different sections of your written report as described above. Each group will have about 4 minutes to explain your results and to discuss your project in class. Be prepared!

      PROJECT SUBMISSION AND DUE DATE

      Part II is due Friday, Dec. 1st at 11:50 am. BRING A HARDCOPY OF THE WRITTEN REPORT WITH YOU TO CLASS. In addition, you must submit your report electronically as specified below. Submissions received on Friday, Dec. 1st between 11:51 am and 12:00 midnight will be penalized with 30% off the grade, submissions received on Saturday Dec. 2th between 12:01 am (early morning) and 8:00 am will be penalized with 60% off the grade; and submissions received after Saturday Dec. 2th at 8:00 am won't be accepted.

      Please submit the following files using the myWpi digital drop box:

      1. [lastname]_proj3_report.[ext] for those of you who are working alone on this project. If you are taking this course for grad. credit, state this fact at the beginning of your report. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
        • ruiz_proj3_report.pdf

      2. [lastname1_lastname2]_proj3_report.[ext] containing your group written reports. This file should be either a PDF file (ext=pdf), a Word file (ext=doc), or a PostScript file (ext=ps). For instance my file would be named (note the use of lower case letters only):
        • ruiz_smith_proj3_report.pdf if I worked with Joe Smith on this project.

      3. [lastname1_lastname2]_proj3_slides.[ext] (or [lastname]_proj3_slides.[ext] in the case of students taking this course for graduate credit) containing your slides for your oral reports. This file should be either a PDF file (ext=pdf) or a PowerPoint file (ext=ppt). Your group will have only 4 minutes in class to discuss the entire project.

    GRADING CRITERIA

    
    (TOTAL:  75 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above.
    
    (TOTAL: 200 points) FOR THE PROJECT (PART II) as follows:
    
    
    NOTES:
    Along with scoring criteria, provided are lists of suggestions. Positive suggestions (marked with green "+") include aspects of a project that have positive impacts on the scoring criteria. Negative suggestions (marked with red "-") have negative impacts. Finally, suggestions marked with "!" result in extra points.
    !this can earn a few extra points (0-8)

    ! !this can earn a more extra points (0-16)

    ! ! !this can earn even more extra points (0-24)

    ! ! ! !this can earn a large extra points (0-32)

    ! ! ! ! !this can earn a very large amount of extra points (0-40)

    +this has a positive impact

    + +this has a greater positive impact

    + + +this has a significant positive impact

    + + + +this has very significant positive impact

    + + + + +this has huge positive impact

    -this has a negative impact

    - -this has a greater negative impact

    - - -this has a significant negative impact

    - - - -this has very significant negative impact

    - - - - -this has huge negative impact


    METHODS for this project are:
    linear regression,
    model tree, and
    regression tree.

    DATASETS for this project are:
    dataset1: world happiness dataset
    dataset2: abalone dataset, automobile dataset, OR a dataset of your choice

    TOTAL: 200 points: Project 3 Report
    TOTAL: 20 points: Algorithmic Description of the Code

    (05 points) Description of the algorithms underlying the WEKA filters used. If you have already described some used filters in previous projects then copying and pasting your work would be a good idea. If issues were raised about the old descriptions, consider revising them.

    (TOTAL: 15 points) Description of the ALGORITHMS underlying the data mining methods used in this project (see METHODS). This includes:
    (10 points) Descriptions of model construction

    (5 points) Descriptions of instance classification/prediction


    +(very) high-level pseudo code with explanation and justification

    - -raw pseudo code with no justification or reasoning behind the steps involved

    - - - - -structural description of WEKA code, i.e. classes/members/methods


    TOTAL: 5 points: Slides
    (5 points) How well do they summarize concisely the results of the project? We suggest you summarize the setting of your experiments and their results in a tabular manner.
    ! excellent summary and presentation of results in the slides

    + (potential) unanswered questions presented to class

    + visual aids

    + +main ideas and observations summarized

    - nothing but accuracy measures

    - lack of visual aids


    TOTAL: 5 points: Pre-Processing of the Datasets
    (5 points) Preprocess attributes as needed and dealing with missing values appropriately
    !Trying to do "fancier" things with attributes (i.e. combining two attributes highly correlated into one, using background knowledge, etc.)

    TOTAL: 15 points: Class Presentation
    (15 points) How well your oral presentation summarized concisely the results of the project and how focus your presentation was on the more creative/interesting/useful of your experiments and results. This grade is given individually to each team member.

    TOTAL: 155 points: Experiments Goals
    (30 points) Good description of the experiment setting and the results. This applies to all the experiments done for this project.
    + + + motivation for experiments described

    + specify testing method used (cross-validation / % split / etc.)

    + + overall summary of your experiments in one concise list that includes relevant parameters

    - included are trivial experiment details that are not used in dicscussion (full algorithm parameters, full WEKA output, etc.)

    - - ambiguous or unclear setting and/or results


    (30 points) Good analysis of the results of the experiments INCLUDING discussion of evaluations statistics returned by the WEKA systems (accuracy and/or errors) and discussion of particularly interesting results. This applies to all the experiments done for this project.
    ! ! ! !excellent analysis of the results and comparisons

    ! ! ! !running additional interesting experiments

    - - - - results rewritten in prose used in place of discussion and analysis


    TOTAL: 35 points: Method-Oriented Goals
    (15 points) ran a good number of experiments to get familiar with the data mining methods in this project (see METHODS) using a variety of datasets (see DATASETS)
    +explore variation in non-trivial method parameters (debug/verbose mode IS a trivial parameter)

    - - -use only one of the datasets


    (10 points) Investigate the pruning method used in M5P (i.e., when the option UNPRUNED=FALSE is chosen from the M5P GUI), which is explained in your textbook. Describe in detail how the pruning is done and run experiments to see how it modifies the resulting tree and how this affects the error values reported for the tree.
    - - -use only one of the datasets


    (10 points) Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.
    - - -use only one of the datasets


    TOTAL: 40 points: Dataset-Oriented Goals
    For each dataset (see DATASETS):
    (10 points) 5 or more specific questions/conjectures about the dataset domain that you aim to answer/validate with your experiments

    (10 points) Comparison of the results across all the methods used in this project (see METHODS). That is, compare the results of one method on this dataset to the results of the other methods on the same dataset.
    ! !comparison with methods from past projects

    TOTAL: 20 points: Holistic Goals
    For each dataset (see DATASETS):
    (10 points) Argumentation of weaknesses and/or strengths of each of the methods on this dataset, and argumentation of which method should be preferred for this dataset and why.