See Piotr Mardziel's Homework Solutions.
Consider the dataset below. This dataset is an adaptation of the World Happiness Dataset.
@relation world_happiness % - Life Expectancy from UN Human Development Report (2003) % - GDP per capita from figure published by the CIA (2006), figure in US$. % - Access to secondary education rating from UNESCO (2002) % - SWL (satisfaction with life) index calculated from data published % by New Economics Foundation (2006). @attribute country string @attribute continent {Americas,Africa,Asia,Europe} @attribute life-expectancy numeric @attribute GDP-per-capita numeric @attribute access-to-education-score numeric @attribute SWL-index numeric @data Switzerland, Europe, 80.5, 32.3, 99.9, 273.33 Canada, Americas, 80, 34, 102.6, 253.33 Usa, Americas, 77.4, 41.8, 94.6, 246.67 Germany, Europe, 78.7, 30.4, 99, 240 Mexico, Americas, 75.1, 10, 73.4, 230 France, Europe, 79.5, 29.9, 108.7, 220 Thailand, Asia, 70, 8.3, 79, 216.67 Brazil, Americas, 70.5, 8.4, 103.2, 210 Japan, Asia, 82, 31.5, 102.1, 206.67 India, Asia, 63.3, 3.3, 49.9, 180 Ethiopia, Africa, 47.6, 0.9, 5.2, 156.67 Russia, Asia, 65.3, 11.1, 81.9, 143.3For this homework, we want to predict the SWL-index attribute (prediction target) from the other predicting attributes continent, life-expectancy, GDP-per-capita, access-to-education-score. Note that the attribute country identifies each data instance uniquely and as such will be disregarded in our analysis. It is provided just for context.
SDR = sd(CLASS over all instances) - ((k1/n)*sd(CLASS of instances with attribute value below split point) + (k2/n)*sd(CLASS of instances with attribute value above split point)) where sd stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances.To reduce the number of calculations that you need to perform, your HW solutions can be limited to the following split points:
binary attributes life-expectancy GDP-per-capita access-to-education-score 0.5 (63.3+47.6)/2 (3.3+0.9)/2 (49.9+5.2)/2 (70.0+65.3)/2 (8.3+3.3)/2 (73.4+49.9)/2 (75.1+70.5)/2 (29.9+11.1)/2 (94.6+81.9)/2 (78.7+77.4)/2 (32.3+31.5)/2 (102.1+99.9)/2 (82.0+80.5)/2 (41.8+34.0)/2 (108.7+103.2)/2If during the construction of the tree you encounter an attribute such that none of the split points listed above apply to the instances in the node, then use instead all the attribute's split points that apply to that collection of instances.
LINEAR MODEL TREE REGRESSION TREE REGRESSION PREDICTION PREDICTION Costa_Rica, Americas, 78.2, 11.1, 50.9, 250 __________ ___________ __________ United_Kingdom, Europe, 78.4, 30.3, 157.2, 236.67 __________ ___________ __________ South_Africa, Africa, 48.4, 12, 90.2, 190 __________ ___________ __________ Lithuania, Europe, 72.3, 13.7, 93.4, 156.67 __________ ___________ __________ LINEAR MODEL TREE REGRESSION TREE REGRESSION ERROR ERROR ERROR root mean-square error (see p. 178) __________ ___________ __________ mean absolute error (see p. 178) __________ ___________ __________SHOW IN DETAIL ALL THE STEPS OF THE PROCESS.
Use the attribute SWL-index as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice. Since the SWL-ranking can be derived from SWL-index, remove SWL-ranking from consideration. Also, remove the attribute country as each of its values identifies an instance uniquely. Note that the access-to-education-score attribute contains missing values, marked with ".". Remember to replace them with "?" in your arff file.
Use the attribute age as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
Use the attribute price as the prediction target. After you run experiments predicting this attribute you may, if you wish, run additional experiments using a different predicting target of your choice.
If you pursue this last question, add continent information to your dataset. The first group/individual to submit a correct dataset including continent information to the course mailing list will receive a 25 points bonus. To make it standard, let's assume there are 6 continents: Antarctica, Americas, Europe, Asia, Africa, Australia.Paul Sader was the first to submit a correct dataset with continent information added to it. Thanks, Paul!
Your project report should contain discussions of all the parts of the work you do for this project. In particular, it should elaborate on the the following topics:
Investigate also the smoothing method used in M5PRIME (i.e. when the option useUNSMOOTHED=FALSE is chosen from the M5PRIME GUI). Describe in detail how the smoothing is done and run experiments to see how it modifies the resulting tree and/or how it affects the error values reported for the tree.
Please submit the following files using the myWpi digital drop box:
(TOTAL: 75 points) FOR THE HOMEWORK (PART I) as stated in the Homework assignment above. (TOTAL: 200 points) FOR THE PROJECT (PART II) as follows: