The purpose of this problem is to construct a tree to predict the attribute mpg (milespergallon) using the other four attributes (carname, cylinders, weight, modelyear).
carname cylinders weight modelyear mpg (1) chevrolet 8 3504 70 18 (2) chevrolet 4 2950 82 27 (3) chevrolet 4 2395 82 34 (4) toyota 4 2372 70 24 (5) toyota 4 2155 76 28 (6) toyota 4 2665 82 32 (7) volkswagen 4 1835 70 26 (8) volkswagen 4 1937 76 29 (9) volkswagen 4 2130 82 44 (10) ford 8 4615 70 10 (11) ford 4 2665 82 28 (12) ford 8 4335 77 16
The partial tree below is the result of applying the model/regression tree construction algorithm. The tree contains 5 leaves, marked with LM1, LM2, LM3, LM4, LM5.
cylinders <= 6 :   modelyear <= 79 :     modelyear <= 73 : LM1 This leaf contains instances (4) and (7)     modelyear > 73 : LM2 This leaf contains instances (5) and (8)    modelyear > 79 :     weight <= ? : LM3 This leaf contains instances ?     weight > ? : LM4 This leaf contains instances ?  cylinders > 6 : LM5 This leaf contains instances (1), (10), and (12)
Solutions:
carname cylinders weight modelyear mpg (9) volkswagen 4 2130 82 44 (3) chevrolet 4 2395 82 34 (6) toyota 4 2665 82 32 (11) ford 4 2665 82 28 (2) chevrolet 4 2950 82 27
Solutions:
SP1 = (2395 + 2130)/2 = 2262.5 SP2 = (2665 + 2395)/2 = 2530 SP3 = (2950 + 2665)/2 = 2807.5
std({44, 34, 32, 28, 27}) = 6.8 std({44}) = 0 std({34, 32, 28, 27}) = 3.3 std({44, 34}) = 7.1 std({32, 28, 27}) = 2.6 std({44, 34, 32, 28}) = 6.8 std({27}) = 0 SHOW YOUR WORK.Solutions:We select the split point that maximizes the value of the following formula: SDR = sd(mpg over all instances)  [(k1/n)*sd(mpg of instances with attribute value below split point) + (k2/n)*sd(mpg of instances with attribute value above split point)] where sd stands for standard deviation. k1 is the number of instances with attribute value below split point. k2 is the number of instances with attribute value above split point. n is the number of instances. SDR of split point SP1 = 2262.5 SDR = std({44, 34, 32, 28, 27})  [(1/5)*std({44}) + ((4/5)*std({34, 32, 28, 27})] = 6.8  [(1/5)*0 + (4/5)*3.3] = 4.16 SDR of split point SP2 = 2530 SDR = std({44, 34, 32, 28, 27})  [(2/5)*std({44, 34}) + ((3/5)*std({32, 28, 27})] = 6.8  [(2/5)*7.1 + (3/5)*2.6] = 2.4 SDR of split point SP3 = 2807.5 SDR = std({44, 34, 32, 28, 27})  [(4/5)*std({44, 34, 32, 28}) + ((1/5)*std({27})] = 6.8  [(4/5)*6.8 + (1/5)*0] = 1.36
Solutions:The best split point is SP1: weight=2262.5, since it is the split point with the highest standard deviation reduction SDR.
cylinders <= 6 :   modelyear <= 79 :     modelyear <= 73 : LM1 This leaf contains instances (4) and (7)     modelyear > 73 : LM2 This leaf contains instances (5) and (8)    modelyear > 79 :     weight <= ? _2262.5_ : LM3 This leaf contains instances ? _(9)_____     weight > ? _2262.5_ : LM4 This leaf contains instances ? _(2),(3),(6),(11)_  cylinders > 6 : LM5 This leaf contains instances (1), (10), and (12)
DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction).
Solutions:CALCULATE the precise value that the leaf marked as LM5 in the tree above will output. Show your work.The average of the target attribute for all the instances at a given leaf node is used as the predicted value for an instance classified by that leaf node.
Solutions:LM5 = (18 + 10 + 16)/3 = 14.666
DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction).
Solutions:ILLUSTRATE what the function/formula that the leaf marked as LM5 in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.) To simplify your answer, you can disregard the nominal attribute carname.Each leaf node will output its predications based on a linear equation. The linear equation at each leaf node is formed by a linear regression on the training instances found at that particular leaf node.
Solutions:
LM5 : mpg = w0 + w1*cylinders + w2*weight + w3*modelyearwhere the weights w0, w1, w2, and w3 are found using Linear Regression over the data instances (1), (10), and (12).
TRAINING DATA  TEST DATA  


Solutions:I'll use Euclidean distance for this problem. The Euclidean distances of each training instance to the test instance are:
TRAINING INSTANCE DISTANCE TO TEST INSTANCE (1) sqr(1 + 7^2) < where "7^2" denotes "7 squared" (2) sqr(1 + 5^2) (3) sqr(1 + 2^2) (4) sqr(1 + 5^2) (5) sqr(1 + 1^2) (6) sqr(1 + 5^2) (7) sqr(0 + 7^2) (8) sqr(0 + 5^2)Hence instance (5) is the nearest neighbor of the test instance.
Solutions:The same as the mpg value of the nearest neighbor of the test instance, namely 29.
Solutions:From the distance calculations above, the 3 nearest neighbors of the test instance are (5), (3), and (8).
Solutions:The average mpg value among the 3 nearest neighbor of the test instance, namely (29 + 28 + 28)/3 = 28.33.
Solutions:I include here two alternate solutions, taken from students' exam solutions:
 Sample Solution 1: (by Berk Birand)
The results of clustering through leader clustering and kmeans clustering are not necessarily identical. To see this, we can peak into how the algorithms work. In kmeans, we will always get k clusters as a result. On the other hand, leader clustering migh output more clusters if there are instances lying far away (as set by t) from the centroids.
We can also observe that if t is big enough (larger than the largest distance between two instances in the dataset), then the two methods will be equal, since leader clustering will never create new leaders (i.e., clusters).
 Sample Solution 2: (by Elijah ForbesSummers)
Not always, though it can produce the same clustering.
values of t  0 infinity as t approaches 0 as t approaches infinity there will be more & more leader clustering becomes smaller clusters, reaching kmeans clustering  that its maximum when t=0 is, no pair of instances and every instance is its will be at a distance own cluster. greater than t.
Solutions:I include here two alternate solutions, taken from students' exam solutions:
 Sample Solution 1: (by Elijah ForbesSummers)
Leader clustering, with an appropriate value for t. kmeans will always try to put instances into the nearest cluster, which may cause clusters to shift "artificially" (that is, the outlier does not belong in them). By including a threshold t there is a cutoff for reasonable values in a cluster and outliers are thus explicitly defined and singled out.
 Sample Solution 2: (by Chris Gianfrancesco)
If a reasonable value for t can be found, leader clustering can deal better with outliers by clustering them away from the "normal" data. As a tradeoff though, one has less control over the number of clusters developed than with simple kmeans.
Solutions: (taken from Berk Birand's exam)Web usage mining involves looking at the web server's log files to get information about the uses of a web site. The data can then include data/time of the load, number of customers a day, etc.
Web content mining attempts to find patterns on the web pages themselves. That is, in the written html code. It can deal with searching for patterns in the text, summarizing a web site, etc.
Solutions: (taken from Kerri Edlund's exam)Data mining takes data that is understandable to machines and makes it understandable to humans, such as finding patterns that a human can evaluate. Text mining takes something that is understandable to humans and makes it understandable to machines, such as determing which keywords best describe a document so that a search on keywords can later be performed.