CS444X Data Mining and Knowledge Discovery in Databases. D-2003
Exam 2 - April 29, 2003

Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Solutions by Carolina Ruiz.

Instructions


Problem I. Data Mining Techniques (40 points)

For each of the following data mining applications, state what data mining task best matches the objectives described (possible tasks are classification, clustering, association, and regression). Also state what the instances are and what the attributes are in each case. Note that the training and test data are not described explicitly, so you will have to infer what these would be like from the statement.

  1. (10 Points) Predicting the approximate number of olympic medals won by athletes of a given country based on information about the country such as population, per capita income, type of government, and geographic location.
    Solution:
    

    • (2 points) Task: This is a regression task.

    • (4 points) Justify your answer: The objective is similar to classification, but there is a large number of possible "classes", and the numerical distance between them plays a role in evaluating prediction quality.

    • (2 points) Instances: The instances are countries.

    • (2 points) Attributes The attributes are population, per capita income, type of government, geographic location, etc.

  2. (10 Points) Discovering subspecies within an animal species based on characteristics of individual animals like size, speed, longevity, and fur color.
      Solution:

    • (2 points) Task: This is clustering.

    • (4 points) Justify your answer: The subspecies are not known in advance, thus we have an unsupervised learning problem. Instances are to be grouped into subsets within which similarity is higher than it is between different subgroups.

    • (2 points) Instances: The instances are individual animals.

    • (2 points) Attributes The attributes are the characteristics listed (size, speed, ...)

  3. (10 Points) Describing relationships among people's personality traits and/or their interests, for example "introverted people like action movies", or "outgoing people are analytical".
      Solution: Solution:

    • (2 points) Task: Association.

    • (4 points) Justify your answer: There is no particular target attribute, and no grouping by similarity is sought. Instead co-occurrences of attribute values are the objective.

    • (2 points) Instances: The instances are people.

    • (2 points) Attributes The attributes are the personality traits and interests described in the statement.
    Solution: Solution:

  4. (10 Points) Predicting whether a person will buy T-shirts or not based on the person's answers to survey questions.
      Solution:

    • (2 points) Task: Classification.

    • (4 points) Justify your answer: There is a clear class attribute (buys-T-shirts) that must be predicted. There is no numeric prediction nor notion of distance between class values that is used to penalize some class errors less than others (as would be the case in regression).

    • (2 points) Instances: The instances are people.

    • (2 points) Attributes The attributes are (answers to) the survey questions.

Problem II. Association Rules (30 + 10 extra points)

This problem refers to the credit dataset discussed in class, which is shown below in ARFF format. The instances of this dataset may be interpreted as transactions as discussed in class. Each transaction is a list of items. Each item is an attribute-value pair of the form attribute=value.
@relation credit-data

@attribute credit_history {bad, unknown, good}
@attribute debt {minor, major}
@attribute collateral {none, adequate}
@attribute risk {low, moderate, high}

@data
%
% 14 instances
%

bad,      minor,  none,      high
bad,      minor,  none,      moderate
bad,      minor,  adequate,  moderate
bad,      major,  none,      high
unknown,  minor,  none,      moderate
unknown,  minor,  adequate,  low
unknown,  minor,  none,      low
unknown,  major,  none,      high
unknown,  major,  none,      high
good,     minor,  none,      low
good,     major,  none,      high
good,     major,  none,      moderate
good,     major,  none,      low
good,     major,  adequate,  low

  1. (10 Points) a) Compute the support and the confidence of the association rule credit-history=bad & collateral=none => risk=high relative to the dataset shown. You may leave your answers in the form of fractions if you wish.
    Solution:
    
      Support of a rule is the percentage of instances in that dataset that 
      contain all the items in the rule:
    
        support = P(credit-history=bad & collateral=none & risk=high) = 2/14 = 1/7
    
    
      Confidence of a rule is the percentage of instances that contain all the items 
      in the left-hand side of the rule among those instances that contain  all the 
      items in the right-hand side of the rule:
    
         confidence = P(risk=high | credit-history=bad & collateral=none) = 2/3
    
    

  2. (10 Points) How many *different* 1-itemsets are there for this dataset?
    Solution:
    
    	As many as there are constraints of the form attribute=value, that is,
    	as many as the total number of different attribute values:
    
               credit_history: 3  {bad, unknown, good}
               debt:           2  {minor, major}
               collateral:     2  {none, adequate}
               risk:           3  {low, moderate, high}
    
    
    	The total number is 10.
    

  3. (10 Points) Find all frequent 1-itemsets for a minimum support threshold of 0.40 (i.e. 40%). Express each itemset in the form attribute=value. In each case, write the number of instances in which the 1-itemset appears next to the itemset. Note the approximate values 5/14=.36, 6/14=.43.
    Solution:
    
    	The minimum support value of 0.40 requires that a frequent 1-itemset
    	appear in 6 or more instances. Here are the counts and the winners:
    
    
            Attribute-value pair          support   frequent?
    
               credit_history=bad           4/14    no
               credit_history=unknown       5/14    no
               credit_history=good          5/14    no
               debt=minor                   7/14    yes
               debt=major                   7/14    yes
               collateral=none             11/14    yes
               collateral=adequate          3/14    no
               risk=low                     5/14    no
               risk=moderate                4/14    no
               risk=high                    5/14    no
    
    
    

  4. (10 extra points) Based on your answer for the number of frequent 1-itemsets, how many *candidate* 2-itemsets would the Apriori algorithm consider when seeking frequent 2-itemsets? You may assume that only 2-itemsets corresponding to two different attributes are considered here.
    Solution:
    
    	the candidate 2-itemsets are pairs of frequent 1-itemsets.
    	There are 2 such pairs that combine 1-itemsets from different attributes:
    
            {debt=minor, collateral=none}
            {debt=major, collateral=none}
              
    

Problem III. Numeric Predictions (35 points + 10 extra credit points)

Consider the following dataset consisting of two attributes age and weight (measured in pounds). Note that instances have been labeled with a number in parentheses in case you need to refer to them in your solutions.
@relation how-much-you-weigh

@attribute age {young,adult,elderly}
@attribute weight numeric

@data
(1) young,    110
(2) young,     90
(3) adult,    150
(4) adult,    170
(5) elderly,  130
(6) elderly,  160

The purpose of this problem is to predict the weight of a person based on his/her age.

  1. (15 Points)Nominal to Binary Translate the nominal attribute age into as many binary attributes as needed. Show your work.
    
    Solution:
    
      In order to convert the nominal attribute age into binary ones, we calculate the
      average weight for each age value:
    
       young:  average weight = 100 = (110 + 90)/2
       adult:  average weight = 160 = (150 + 170)/2
       eldery: average weight = 145 = (130 + 160)/2
    
     Now, we sort those nominal values in decreasing order of average weight:
        adult, eldery, young.
    
     We intruce two binary attributes: 
      age=adult  and  age=elderly,adult
    
    
    Show below the dataset that results from converting the age attribute into the binary attribute(s) (each of which will be treated as numeric).
    @relation modified-how-much-you-weigh
    
    @attribute ___ age=adult  numeric ____________________
    @attribute age=elderly,adult  numeric ________________
    @attribute weight numeric
    
    @data
    
    (1) __ 0,  0, _____________________    110
    
    (2) __ 0,  0, _____________________     90
    
    (3) __ 1,  1, _____________________    150
    
    (4) __ 1,  1, _____________________    170
    
    (5) __ 0,  1, _____________________    130
    
    (6) __ 0,  1, _____________________    160
    
    

  2. (10 Points)Root of Model/Regression Tree:
    Assume that we want to construct a model (or a regression) tree to predict weight in terms of the binary attributes you contructed above. Identify the split points that you need to consider and select from them the one for the root of the model/regression tree. Show your work.
    
    Note that there are two split points (one for each binary attribute)
    that we need to consider:
       
      age=adult = 0.5  and  age=elderly,adult = 0.5
    
    Let's denote the 1st of them as p1 (i.e. p1 = "age=adult = 0.5") and the 
    second one of them as p2 (i.e. p2 = "age=elderly,adult = 0.5") .
    
    We select as the condition for the root node of our tree
    the split point that maximizes the value of the following formula:
    
      SDR = sd(weight over all instances)
             - [(k1/n)*sd(weight of instances with attribute value below split point)
                 + (k2/n)*sd(weight of instances with attribute value above spl it point)]
    
            where sd stands for standard deviation.
            k1 is the number of instances with attribute value below split point.
            k2 is the number of instances with attribute value above split point.
            n is the number of instances.
    
    Using the standard deviations sd) calculated by Ethan (thanks Ethan!!)
    we calculated the standard deviation reduction of each of these split points:
    
    p1: STR = sd(110,90,150,170,130,160) - [(4/6)*sd(110,90,130,160) + (2/6)*sd(150,170)]
           
            = 30.82207-  [(4/6)* 29.86079 + (2/6)* 14.14214] 
    
            = 6.200833 
    
    p2: STR = sd(110,90,150,170,130,160) - [(2/6)*sd(110,90) + (4/6)*sd(150,170,130,160)]
    
            = 30.82207-  [(2/6)* 14.14214 + (4/6)* 17.07825] 
    
            = 14.72252 
    
    
    Hence, p2 is used for the root of the model/regression tree:
    
    
    
                         -------------------
                        | age=elderly,adult |
                         -------------------
                       /                     \
                 < 0.5/                       \ > 0.5
                     /                         \
    

  3. (10 Points)Regression Tree:
    Continue the construction of the tree to obtain a REGRESSION TREE to predict weight. Show what the remaining internal nodes are and show the precise function used to make predictions in each of the leaves.
    
    Solution:
    
    Let's construct the right-most and the left-most children of the root node.
    
    LEFT-MOST CHILD age=elderly,adult < 0.5:
    
      Note that instances (1) and (2) lie in this child. Both of them 
      are such that age=adult is < 0.5, so there is no need to add more 
      "internal" nodes (i.e. tests) to this branch.
    
      So the left-most child of the root is a leaf node with predicting 
      function equal to the average weight of instances (1) and (2).
    
      Predicting function = (90 + 110)/2 = 100
      
    
    RIGHT-MOST CHILD age=elderly,adult > 0.5:
    
      Note that instances (3), (4), (5), and (6) lie in this child. The only
      remaining split point is p1, so we need to use it:
    
    
    
                         -------------------
                        | age=elderly,adult |
                         -------------------
                       /                     \
                 < 0.5/                       \ > 0.5
                     /                         \
    
              ~~~~~~~                    -----------
             |  100  |                  | age=adult | 
              ~~~~~~~                    -----------
           instances 1, 2              /             \
                                 < 0.5/               \ > 0.5
                                     /                 \
                                 ~~~~~~~~           ~~~~~~~~         
                                |  145   |         |  160   |
                                 ~~~~~~~~           ~~~~~~~~         
                              instances 5, 6     instances 3, 4
                 
    
        The predictive functions of each of the last two leaves is the 
       average weight of the instances that belong to the leaf.
    

  4. (10 extra credit Points)Model Tree:
    Now, modify the leaves of your regression tree above to obtain a MODEL TREE to predict weight. Show the precise function used to make predictions in each of the leaves.
    Solution:
    
    The structure of the model tree is identical to that of the regression
    tree constructed above. The only difference is the predictive function
    in each of the nodes. They are linear regressions that predict the
    target attribute (weight) in terms of the attributes age=adult and
    age=elderly,adult.
    
    
    
    
                         -------------------
                        | age=elderly,adult |
                         -------------------
                       /                     \
                 < 0.5/                       \ > 0.5
                     /                         \
    
              ~~~~~~~                    -----------
             |  L1   |                  | age=adult | 
              ~~~~~~~                    -----------
           instances 1, 2              /             \
                                 < 0.5/               \ > 0.5
                                     /                 \
                                 ~~~~~~~~           ~~~~~~~~         
                                |   L2   |         |   L3   |
                                 ~~~~~~~~           ~~~~~~~~         
                              instances 5, 6     instances 3, 4
                 
    
    Where:
    
    L1 = w0 + w1*age=adult + w2*age=elderly,adult
    
         with w0, w1, and w2 computed using linear regression over instances 1, 2.
    
    
    L2 = w3 + w4*age=adult + w5*age=elderly,adult
    
         with w3, w4, and w5 computed using linear regression over instances 5, 6.
    
    
    L3 = w6 + w7*age=adult + w8*age=elderly,adult
    
         with w6, w7, and w8 computed using linear regression over instances 3, 4.
    
    
    -------------------
    
    For comparison purposes, I include below Weka's output to this problem.
    
    
    === Run information ===
    
    Scheme:       weka.classifiers.m5.M5Prime -O m -U -F 10.0 -V 5
    Relation:     how-much-you-weigh
    Instances:    6
    Attributes:   2
                  age
                  weight
    Test mode:    evaluate on training data
    
    === Classifier model (full training set) ===
    
    Unpruned training model tree:
    
    age=elderly,adult <= 0.5 : LM1 (2/40%)
    age=elderly,adult >  0.5 : 
    |   age=adult <= 0.5 : LM2 (2/60%)
    |   age=adult >  0.5 : LM3 (2/40%)
    
    Models at the leaves:
    
        LM1:  weight = 100
        LM2:  weight = 145
        LM3:  weight = 160
    
    Pruned training model tree:
    
    LM1 (6/53.5%)
    
    Models at the leaves:
    
      Unsmoothed (simple):
    
        LM1:  weight = 100 + 52.5age=elderly,adult
    
    Number of Rules : 1
    
    Time taken to build model: 0 seconds
    
    === Evaluation on training set ===
    === Summary ===
    
    Correlation coefficient                  0.8796
    Mean absolute error                     11.6667
    Root mean squared error                 13.3853
    Relative absolute error                 46.6667 %
    Root relative squared error             47.5727 %
    Total Number of Instances                6