Exam 2 - April 29, 2003

Department of Computer Science

Worcester Polytechnic Institute

**Solutions by Carolina Ruiz.**

- Show your work
**Justify**your answers- Use the space provided to write your answers
- Ask in case of doubt

**(10 Points)**Predicting the approximate number of olympic medals won by athletes of a given country based on information about the country such as population, per capita income, type of government, and geographic location.**Solution:****(2 points) Task:**This is a regression task.**(4 points) Justify your answer:**The objective is similar to classification, but there is a large number of possible "classes", and the numerical distance between them plays a role in evaluating prediction quality.**(2 points) Instances:**The instances are countries.**(2 points) Attributes**The attributes are population, per capita income, type of government, geographic location, etc.

**(10 Points)**Discovering subspecies within an animal species based on characteristics of individual animals like size, speed, longevity, and fur color.**(2 points) Task:**This is clustering.**(4 points) Justify your answer:**The subspecies are not known in advance, thus we have an unsupervised learning problem. Instances are to be grouped into subsets within which similarity is higher than it is between different subgroups.**(2 points) Instances:**The instances are individual animals.**(2 points) Attributes**The attributes are the characteristics listed (size, speed, ...)

**Solution:****(10 Points)**Describing relationships among people's personality traits and/or their interests, for example "introverted people like action movies", or "outgoing people are analytical".**(2 points) Task:**Association.**(4 points) Justify your answer:**There is no particular target attribute, and no grouping by similarity is sought. Instead co-occurrences of attribute values are the objective.**(2 points) Instances:**The instances are people.**(2 points) Attributes**The attributes are the personality traits and interests described in the statement.

**Solution:**Solution:**Solution:**Solution:**(10 Points)**Predicting whether a person will buy T-shirts or not based on the person's answers to survey questions.**(2 points) Task:**Classification.**(4 points) Justify your answer:**There is a clear class attribute (buys-T-shirts) that must be predicted. There is no numeric prediction nor notion of distance between class values that is used to penalize some class errors less than others (as would be the case in regression).**(2 points) Instances:**The instances are people.**(2 points) Attributes**The attributes are (answers to) the survey questions.

**Solution:**

@relation credit-data @attribute credit_history {bad, unknown, good} @attribute debt {minor, major} @attribute collateral {none, adequate} @attribute risk {low, moderate, high} @data % % 14 instances % bad, minor, none, high bad, minor, none, moderate bad, minor, adequate, moderate bad, major, none, high unknown, minor, none, moderate unknown, minor, adequate, low unknown, minor, none, low unknown, major, none, high unknown, major, none, high good, minor, none, low good, major, none, high good, major, none, moderate good, major, none, low good, major, adequate, low

**(10 Points)**a) Compute the support and the confidence of the association rule**credit-history=bad & collateral=none => risk=high**relative to the dataset shown. You may leave your answers in the form of fractions if you wish.**Solution:**Support of a rule is the percentage of instances in that dataset that contain all the items in the rule: support = P(credit-history=bad & collateral=none & risk=high) = 2/14 = 1/7 Confidence of a rule is the percentage of instances that contain all the items in the left-hand side of the rule among those instances that contain all the items in the right-hand side of the rule: confidence = P(risk=high | credit-history=bad & collateral=none) = 2/3**(10 Points)**How many *different* 1-itemsets are there for this dataset?**Solution:**As many as there are constraints of the form attribute=value, that is, as many as the total number of different attribute values: credit_history: 3 {bad, unknown, good} debt: 2 {minor, major} collateral: 2 {none, adequate} risk: 3 {low, moderate, high} The total number is 10.**(10 Points)**Find all frequent 1-itemsets for a minimum support threshold of 0.40 (i.e. 40%). Express each itemset in the form attribute=value. In each case, write the number of instances in which the 1-itemset appears next to the itemset. Note the approximate values 5/14=.36, 6/14=.43.**Solution:**The minimum support value of 0.40 requires that a frequent 1-itemset appear in 6 or more instances. Here are the counts and the winners: Attribute-value pair support frequent? credit_history=bad 4/14 no credit_history=unknown 5/14 no credit_history=good 5/14 no debt=minor 7/14 yes debt=major 7/14 yes collateral=none 11/14 yes collateral=adequate 3/14 no risk=low 5/14 no risk=moderate 4/14 no risk=high 5/14 no**(10 extra points)**Based on your answer for the number of frequent 1-itemsets, how many *candidate* 2-itemsets would the Apriori algorithm consider when seeking frequent 2-itemsets? You may assume that only 2-itemsets corresponding to two different attributes are considered here.**Solution:**the candidate 2-itemsets are pairs of frequent 1-itemsets. There are 2 such pairs that combine 1-itemsets from different attributes: {debt=minor, collateral=none} {debt=major, collateral=none}

@relation how-much-you-weigh @attribute age {young,adult,elderly} @attribute weight numeric @data (1) young, 110 (2) young, 90 (3) adult, 150 (4) adult, 170 (5) elderly, 130 (6) elderly, 160The purpose of this problem is to predict the weight of a person based on his/her age.

**(15 Points)Nominal to Binary**Translate the nominal attribute**age**into as many binary attributes as needed. Show your work.**Solution:**In order to convert the nominal attribute age into binary ones, we calculate the average weight for each age value: young: average weight = 100 = (110 + 90)/2 adult: average weight = 160 = (150 + 170)/2 eldery: average weight = 145 = (130 + 160)/2 Now, we sort those nominal values in decreasing order of average weight: adult, eldery, young. We intruce two binary attributes: age=adult and age=elderly,adult@relation modified-how-much-you-weigh @attribute ___

**age=adult numeric**____________________**@attribute age=elderly,adult numeric**________________ @attribute weight numeric @data (1) __ 0, 0, _____________________ 110 (2) __ 0, 0, _____________________ 90 (3) __ 1, 1, _____________________ 150 (4) __ 1, 1, _____________________ 170 (5) __ 0, 1, _____________________ 130 (6) __ 0, 1, _____________________ 160**(10 Points)Root of Model/Regression Tree:**

Assume that we want to construct a model (or a regression) tree to predict weight in terms of the binary attributes you contructed above. Identify the split points that you need to consider and select from them the one for the root of the model/regression tree. Show your work.Note that there are two split points (one for each binary attribute) that we need to consider: age=adult = 0.5 and age=elderly,adult = 0.5 Let's denote the 1st of them as p1 (i.e. p1 = "age=adult = 0.5") and the second one of them as p2 (i.e. p2 = "age=elderly,adult = 0.5") . We select as the condition for the root node of our tree the split point that maximizes the value of the following formula: SDR = sd(weight over all instances) - [(k1/n)*sd(weight of instances with attribute value below split point) + (k2/n)*sd(weight of instances with attribute value above spl it point)] where

**sd**stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances. Using the standard deviations sd) calculated by Ethan (thanks Ethan!!) we calculated the standard deviation reduction of each of these split points: p1: STR = sd(110,90,150,170,130,160) - [(4/6)*sd(110,90,130,160) + (2/6)*sd(150,170)] = 30.82207- [(4/6)* 29.86079 + (2/6)* 14.14214] = 6.200833 p2: STR = sd(110,90,150,170,130,160) - [(2/6)*sd(110,90) + (4/6)*sd(150,170,130,160)] = 30.82207- [(2/6)* 14.14214 + (4/6)* 17.07825] = 14.72252 Hence, p2 is used for the root of the model/regression tree: ------------------- | age=elderly,adult | ------------------- / \ < 0.5/ \ > 0.5 / \**(10 Points)Regression Tree:**

Continue the construction of the tree to obtain a**REGRESSION TREE**to predict weight. Show what the remaining internal nodes are and show the precise function used to make predictions in each of the leaves.**Solution:**Let's construct the right-most and the left-most children of the root node. LEFT-MOST CHILD age=elderly,adult < 0.5: Note that instances (1) and (2) lie in this child. Both of them are such that age=adult is < 0.5, so there is no need to add more "internal" nodes (i.e. tests) to this branch. So the left-most child of the root is a leaf node with predicting function equal to the average weight of instances (1) and (2). Predicting function = (90 + 110)/2 = 100 RIGHT-MOST CHILD age=elderly,adult > 0.5: Note that instances (3), (4), (5), and (6) lie in this child. The only remaining split point is p1, so we need to use it: ------------------- | age=elderly,adult | ------------------- / \ < 0.5/ \ > 0.5 / \ ~~~~~~~ ----------- | 100 | | age=adult | ~~~~~~~ ----------- instances 1, 2 / \ < 0.5/ \ > 0.5 / \ ~~~~~~~~ ~~~~~~~~ | 145 | | 160 | ~~~~~~~~ ~~~~~~~~ instances 5, 6 instances 3, 4 The predictive functions of each of the last two leaves is the average weight of the instances that belong to the leaf.**(10 extra credit Points)Model Tree:**

Now, modify the leaves of your regression tree above to obtain a**MODEL TREE**to predict weight. Show the precise function used to make predictions in each of the leaves.**Solution:**The structure of the model tree is identical to that of the regression tree constructed above. The only difference is the predictive function in each of the nodes. They are linear regressions that predict the target attribute (weight) in terms of the attributes age=adult and age=elderly,adult. ------------------- | age=elderly,adult | ------------------- / \ < 0.5/ \ > 0.5 / \ ~~~~~~~ ----------- | L1 | | age=adult | ~~~~~~~ ----------- instances 1, 2 / \ < 0.5/ \ > 0.5 / \ ~~~~~~~~ ~~~~~~~~ | L2 | | L3 | ~~~~~~~~ ~~~~~~~~ instances 5, 6 instances 3, 4 Where: L1 = w0 + w1*age=adult + w2*age=elderly,adult with w0, w1, and w2 computed using linear regression over instances 1, 2. L2 = w3 + w4*age=adult + w5*age=elderly,adult with w3, w4, and w5 computed using linear regression over instances 5, 6. L3 = w6 + w7*age=adult + w8*age=elderly,adult with w6, w7, and w8 computed using linear regression over instances 3, 4. ------------------- For comparison purposes, I include below Weka's output to this problem. === Run information === Scheme: weka.classifiers.m5.M5Prime -O m -U -F 10.0 -V 5 Relation: how-much-you-weigh Instances: 6 Attributes: 2 age weight Test mode: evaluate on training data === Classifier model (full training set) === Unpruned training model tree: age=elderly,adult <= 0.5 : LM1 (2/40%) age=elderly,adult > 0.5 : | age=adult <= 0.5 : LM2 (2/60%) | age=adult > 0.5 : LM3 (2/40%) Models at the leaves: LM1: weight = 100 LM2: weight = 145 LM3: weight = 160 Pruned training model tree: LM1 (6/53.5%) Models at the leaves: Unsmoothed (simple): LM1: weight = 100 + 52.5age=elderly,adult Number of Rules : 1 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correlation coefficient 0.8796 Mean absolute error 11.6667 Root mean squared error 13.3853 Relative absolute error 46.6667 % Root relative squared error 47.5727 % Total Number of Instances 6