Exam 2 CS 4445 B06

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2006
Solutions Exam 2 - December 12, 2006

By Prof. Carolina Ruiz
Department of Computer Science
Worcester Polytechnic Institute

Problem I. Numeric Predictions (45 points)

Consider the following dataset which is a small subset adapted from the Auto Miles-per-gallon (MPG) dataset, which is available at the University of California Irvine (UCI) Data Repository. Note that instances have been labeled with a number in parentheses so that you can refer to them in your solutions.

car-name cylinders weight model-year mpg
(1) chevrolet 8 3504 70 18
(2) chevrolet 4 2950 82 27
(3) chevrolet 4 2395 82 34
(4) toyota 4 2372 70 24
(5) toyota 4 2155 76 28
(6) toyota 4 2665 82 32
(7) volkswagen 4 1835 70 26
(8) volkswagen 4 1937 76 29
(9) volkswagen 4 2130 82 44
(10) ford 8 4615 70 10
(11) ford 4 2665 82 28
(12) ford 8 4335 77 16

The purpose of this problem is to construct a tree to predict the attribute mpg (miles-per-gallon) using the other four attributes (car-name, cylinders, weight, model-year).

The partial tree below is the result of applying the model/regression tree construction algorithm. The tree contains 5 leaves, marked with LM1, LM2, LM3, LM4, LM5.


cylinders <= 6 : 
|
|   model-year <= 79 : 
|   |
|   |   model-year <= 73 :   LM1 This leaf contains instances (4) and (7) 
|   |
|   |   model-year >  73 :   LM2 This leaf contains instances (5) and (8) 
|   |
|   model-year >  79 : 
|   |
|   |   weight <= ? :        LM3 This leaf contains instances ? 
|   |
|   |   weight >  ? :        LM4 This leaf contains instances ? 
|    
cylinders >  6 :             LM5 This leaf contains instances (1), (10), and (12)

Constructing the remaining internal node We need to determine the correct value of weight to split the node marked with a "?" in the tree above. That is the split value x of weight that will have the highest SDR.

(5 Points) Relevant data instances List all the data instances that need to be considered when splitting the node marked with "?" above. Suggestion: list these instances sorted in increasing order by weight value.

Solutions:

car-name cylinders weight model-year mpg
(9) volkswagen 4 2130 82 44
(3) chevrolet 4 2395 82 34
(6) toyota 4 2665 82 32
(11) ford 4 2665 82 28
(2) chevrolet 4 2950 82 27

(5 Points) Candidate split points List all the candidate split points for the attribute weight that need to be considered to find the correct value for "? in the nodes "weight <= ?" and "weight > ?" in the tree above.
Solutions:
```
SP1 = (2395 + 2130)/2 = 2262.5

SP2 = (2665 + 2395)/2 = 2530

SP3 = (2950 + 2665)/2 = 2807.5
```

(15 Points) Evaluating candidate split points Compute the SDR (Standard Deviation Reduction) of each of the candidate split points that you listed above. For your convenience, the following standard deviations (std) are provided:

  std({44, 34, 32, 28, 27}) = 6.8

  std({44}) = 0 			std({34, 32, 28, 27}) =  3.3

  std({44, 34}) = 7.1 			std({32, 28, 27}) = 2.6 

  std({44, 34, 32, 28}) = 6.8		std({27}) = 0

SHOW YOUR WORK.


Solutions:

We select the split point that maximizes the value of the 
following formula:

SDR = sd(mpg over all instances)
      - [(k1/n)*sd(mpg of instances with attribute value below split point)
         + (k2/n)*sd(mpg of instances with attribute value above split point)]

   where sd stands for standard deviation.
   k1 is the number of instances with attribute value below split point.
   k2 is the number of instances with attribute value above split point.
   n is the number of instances.


SDR of split point SP1 = 2262.5

  SDR = std({44, 34, 32, 28, 27}) 
        - [(1/5)*std({44}) + ((4/5)*std({34, 32, 28, 27})]

      = 6.8 - [(1/5)*0 + (4/5)*3.3]

      = 4.16

SDR of split point SP2 = 2530

  SDR = std({44, 34, 32, 28, 27}) 
        - [(2/5)*std({44, 34}) + ((3/5)*std({32, 28, 27})]

      = 6.8 - [(2/5)*7.1 + (3/5)*2.6]

      = 2.4 


SDR of split point SP3 = 2807.5 

  SDR = std({44, 34, 32, 28, 27}) 
        - [(4/5)*std({44, 34, 32, 28}) + ((1/5)*std({27})]

      = 6.8 - [(4/5)*6.8 + (1/5)*0]

      = 1.36

(3 Points) Choosing the best candidate split point According to the SDR's that you computed above select the best split point.
Solutions:
The best split point is SP1: weight=2262.5, since it is the split point with the highest standard deviation reduction SDR.

(2 Points) Completing the tree Replace the "?" marks in the following tree with the split point that you determined to be the correct one. ALSO, WRITE DOWN WHICH INSTANCES BELONG TO leaves LM3 and LM4.


cylinders <= 6 : 
|
|   model-year <= 79 : 
|   |
|   |   model-year <= 73 :       LM1 This leaf contains instances (4) and (7) 
|   |
|   |   model-year >  73 :       LM2 This leaf contains instances (5) and (8) 
|   |
|   model-year >  79 : 
|   |
|   |   weight <= ? _2262.5_ :   LM3 This leaf contains instances ? _(9)_____
|   |
|   |   weight >  ? _2262.5_ :   LM4 This leaf contains instances ? _(2),(3),(6),(11)_
|
cylinders >  6 :                 LM5 This leaf contains instances (1), (10), and (12)

Constructing the leaves of the tree
1. (7 Points) Regression Tree Assume that we will use the tree above as a regression tree.
  DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction).
  Solutions:
  The average of the target attribute for all the instances at a given leaf node is used as the predicted value for an instance classified by that leaf node.
  CALCULATE the precise value that the leaf marked as LM5 in the tree above will output. Show your work.
  Solutions:
  LM5 = (18 + 10 + 16)/3 = 14.666
2. (8 Points) Model Tree Assume that we will use the tree above as a model tree.
  DESCRIBE how to calculate the leaf values (that is, the value that each of the leaf nodes will output as its prediction).
  Solutions:
  Each leaf node will output its predications based on a linear equation. The linear equation at each leaf node is formed by a linear regression on the training instances found at that particular leaf node.
  ILLUSTRATE what the function/formula that the leaf marked as LM5 in the tree above will use to produce its output is like. (You don't have to produce the precise function just illustrate what the function will be like.) To simplify your answer, you can disregard the nominal attribute car-name.
  Solutions:
```
  LM5 : mpg = w0 + w1*cylinders + w2*weight + w3*model-year
```
  where the weights w0, w1, w2, and w3 are found using Linear Regression over the data instances (1), (10), and (12).

Problem II. Instance-based Learning (25 points)

Consider the following dataset which is a small subset adapted from the Auto Miles-per-gallon (MPG) dataset available at the UCI Data Repository. Note that instances have been labeled with a number in parentheses so that you can refer to them in your solutions.

TRAINING DATA TEST DATA

car-name model-year mpg
(1) chevrolet 70 18
(2) chevrolet 82 27
(3) toyota 75 28
(4) toyota 82 32
(5) volkswagen 76 29
(6) volkswagen 82 44
(7) ford 70 10
(8) ford 82 28

car-name model-year mpg
(test) ford 77 ?

(10 points) Use the 1-nearest-neighbor algorithm to predict the mpg value of the test instance. That is, answer the following questions:
1. (5 points) Which training instance is the 1-nearest neighbor of the test instance?
  Solutions:
  I'll use Euclidean distance for this problem. The Euclidean distances of each training instance to the test instance are:
```
TRAINING INSTANCE   DISTANCE TO TEST INSTANCE
     (1)               sqr(1 + 7^2)   <-- where "7^2" denotes "7 squared"
     (2)               sqr(1 + 5^2)   
     (3)               sqr(1 + 2^2)   
     (4)               sqr(1 + 5^2)   
     (5)               sqr(1 + 1^2)   
     (6)               sqr(1 + 5^2)   
     (7)               sqr(0 + 7^2)   
     (8)               sqr(0 + 5^2)   
```
  Hence instance (5) is the nearest neighbor of the test instance.
2. (5 points) What is the mpg value of the test instance predicted by the 1-nearest neighbor algorithm?
  Solutions:
  The same as the mpg value of the nearest neighbor of the test instance, namely 29.
(15 points) Use the 3-nearest-neighbor algorithm (without scaling the attributes or weighting instances by their distance to the test instance) to predict the mpg value of the test instance. That is, answer the following questions:
1. (10 points) Which 3 training instances are the 3-nearest neighbors of the test instance?
  Solutions:
  From the distance calculations above, the 3 nearest neighbors of the test instance are (5), (3), and (8).
2. (5 points) What is the mpg value of the test instance predicted by the 3-nearest neighbor algorithm? EXPLAIN YOUR ANSWER.
  Solutions:
  The average mpg value among the 3 nearest neighbor of the test instance, namely (29 + 28 + 28)/3 = 28.33.

Problem III. Clustering (20 points)

Consider the following clustering method called Leader Clustering. It receives two parameters: an integer value k and a real value t. It works like k-means clustering: It starts by selecting k instances (which will be called leaders) and then assigns each of the training instances to the cluster of the closest leader, except that if the distance of a training instance to its closest leader is greater than the input threshold t, then this training instance becomes a new leader. Once that all the training instances have been assigned to a leader's cluster, then the centroids of the resulting clusters are calculated and the process is repeated with them as the new leaders, until a stable clustering is found.

(15 points) Given a dataset, a value k and a value t, is the clustering produced by Leader Clustering the same as the clustering produced by k-means Clustering? Assume that the initial k instances selected as leaders/centroids by both clustering methods are the same. Consider cases depending on the input parameter t. Explain your answer.
Solutions:
I include here two alternate solutions, taken from students' exam solutions:
- Sample Solution 1: (by Berk Birand)
  The results of clustering through leader clustering and k-means clustering are not necessarily identical. To see this, we can peak into how the algorithms work. In k-means, we will always get k clusters as a result. On the other hand, leader clustering migh output more clusters if there are instances lying far away (as set by t) from the centroids.
  We can also observe that if t is big enough (larger than the largest distance between two instances in the dataset), then the two methods will be equal, since leader clustering will never create new leaders (i.e., clusters).
- Sample Solution 2: (by Elijah Forbes-Summers)
  Not always, though it can produce the same clustering.
```
            values of t
 |------------------------------------------|
 0                                        infinity
as t approaches 0                  as t approaches infinity
there will be more & more          leader clustering becomes
smaller clusters, reaching         k-means clustering - that
its maximum when t=0               is, no pair of instances
and every instance is its          will be at a distance 
own cluster.                       greater than t.
```
(5 points) Which of the two methods will be best at dealing with outliers (i.e., data instances that are "far away" or very different to the other instances in the dataset)? Explain your answer.
Solutions:
I include here two alternate solutions, taken from students' exam solutions:
- Sample Solution 1: (by Elijah Forbes-Summers)
  Leader clustering, with an appropriate value for t. k-means will always try to put instances into the nearest cluster, which may cause clusters to shift "artificially" (that is, the outlier does not belong in them). By including a threshold t there is a cutoff for reasonable values in a cluster and outliers are thus explicitly defined and singled out.
- Sample Solution 2: (by Chris Gianfrancesco)
  If a reasonable value for t can be found, leader clustering can deal better with outliers by clustering them away from the "normal" data. As a tradeoff though, one has less control over the number of clusters developed than with simple k-means.

Problem IV. Applications (10 points + 5 extra points)

(7 points) The main two subareas of web mining are web usage mining and web content mining. Briefly describe the difference between these two subareas.
Solutions: (taken from Berk Birand's exam)
Web usage mining involves looking at the web server's log files to get information about the uses of a web site. The data can then include data/time of the load, number of customers a day, etc.
Web content mining attempts to find patterns on the web pages themselves. That is, in the written html code. It can deal with searching for patterns in the text, summarizing a web site, etc.
(8 points) Describe one of the differences between data mining and text mining.
Solutions: (taken from Kerri Edlund's exam)
Data mining takes data that is understandable to machines and makes it understandable to humans, such as finding patterns that a human can evaluate. Text mining takes something that is understandable to humans and makes it understandable to machines, such as determing which keywords best describe a document so that a search on keywords can later be performed.

	car-name	cylinders	weight	model-year	mpg
(1)	chevrolet	8	3504	70	18
(2)	chevrolet	4	2950	82	27
(3)	chevrolet	4	2395	82	34
(4)	toyota	4	2372	70	24
(5)	toyota	4	2155	76	28
(6)	toyota	4	2665	82	32
(7)	volkswagen	4	1835	70	26
(8)	volkswagen	4	1937	76	29
(9)	volkswagen	4	2130	82	44
(10)	ford	8	4615	70	10
(11)	ford	4	2665	82	28
(12)	ford	8	4335	77	16