CS 525D Spring 2004

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
HOMEWORK - Spring 2004

PROF. CAROLINA RUIZ

DUE DATE: Tuesday March 9 at 3:00 pm. Late submissions will NOT be accepted

Instructions

You must work on this homework individually.
Hand in hardcopies with written solutions to all the 5 problems below.
The homework solutions are due at 3:00 pm on Tuesday March 9. LATE HOMEWORK SUBMISSIONS WILL NOT BE ACCEPTED
Your homework solutions must be your own. Help or assistance from classmates, other people, or online resources are NOT allowed. Any type of cheating will be penalized with an F grade for the course and will be reported to the WPI Judicial Board in accordance with the Academic Honesty Policy.
Show your work
Justify your answers

Problem 1. Knowledge Discovery in Databases (10 points)

(3 points) Define knowledge discovery in databases.
(4 points) Briefly describe the steps of the knowledge discovery in databases process.
(3 points) Define data mining.

Problem 2. Data Integration, Data Warehousing and OLAP (30 points)

(7 points) Chapter 2: Exercise 2.3.
(7 points) Chapter 2: Exercise 2.4.
(7 points) Chapter 2: Exercise 2.10.
(9 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

Problem 3. Data Preprocessing (30 points)

Consider the following dataset.
```
   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS? 

   02/13/00   mostly sunny    47            25          strong  no 
   03/10/00   mostly cloudy   66            57          weak    yes
   06/28/00   cloudy          91            75          medium  yes
   07/12/00   sunny           82            27          strong  no
   08/30/00   rainy           76            80          weak    no
   09/23/00   drizzle         66            70          weak    yes
   11/24/00   sunny           52            60          medium  no
   12/19/00   mostly sunny    41            30          strong  no
   01/12/01   cloudy          36            40          ??      no
   04/13/01   mostly cloudy   57            40          weak    yes
   05/20/01   mostly sunny    68            50          medium  yes
   06/28/01   drizzle         73            20          weak    yes
   07/06/01   sunny           95            85          weak    no
   08/20/01   rainy           91            60          weak    yes
   09/01/01   mostly sunny    80            10          medium  no
   10/23/01   mostly cloudy   52            44          weak    no 
```
1. (3 points) Assuming that the missing value (marked with "??") for Wind cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS? is the target attribute.
2. (3 points) Define a concept hierarchy over the attribute OUTLOOK so that the number of different values for that attribute can be reduced to just 3.
3. (3 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals. Explain your answer.
4. (3 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals. Explain your answer.
5. (3 points) Would you include the attribute DATE into your task-relevant data when mining for patterns that predict the values for the PLAYS? attribute? Explain your answer.
6. (3 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
```
 [mean - (k+1)*sd, mean - k*sd)   
 for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
```
  Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work.
(5 points) Chapter 3: Exercise 3.3.
(5 points) Chapter 3: Exercise 3.5.
(5 points) Chapter 3: Exercise 3.7.

Problem 4. Relevance Analysis (30 points)

The following table contains training examples that help a robot janitor predict whether or not an office contains a recycling bin.

	STATUS	DEPT.	OFFICE SIZE	RECYCLING BIN?
1.	faculty	ee	large	no
2.	staff	ee	small	no
3.	faculty	cs	medium	yes
4.	student	ee	large	yes
5.	staff	cs	medium	no
6.	faculty	cs	large	yes
7.	student	ee	small	yes
8.	staff	cs	medium	no

(10 points) Compute the entropy of each of the attributes STATUS, DEPT., and OFFICE SIZE with respect to the attribute RECYCLING BIN? Show your work. Rank the attributes STATUS, DEPT., and OFFICE SIZE according to their relevance in predicting the target attribute RECYCLING BIN?. List first the most relevant one, and last the least relevant one. Explain your answer.

(10 points + 2 extra points ) Consider the Singular Value Decomposition (SVD) method as presented in the paper

M.W. Berry, Z. Drmac, and E.R. Jessup. "Matrices, Vector Spaces, and Information Retrieval" SIAM Reviews. Vol. 41, No. 2, pp. 335-362.

which was distributed in class. Given the SVD of the specific matrix A into 3 matrices U, Sigma, and V shown on page 349 of the paper, answer the following questions. Explain your answers.

(2 points) What is the rank of A?
(2 points) What collection of vectors naturally derived from the SVD of A forms a basis for the column space of A?
(2 points) What collection of vectors naturally derived from the SVD of A forms a basis for the row space of A?
(2 points) How can one obtain the best possible rank-3 approximation of A?
(4 points) Suppose that A represents the database that you want to mine for patterns. Explain how you would use SVD to reduce the number of features in A before mining.

(10 points + 2 extra points) Read the paper

R. Agrawal, C. Faloutsos and A. Swami. "Efficient Similarity Search in Sequence Databases Foundations of Data Organization and Algorithms". (FODO) Conference, Oct. 1993, Evanston, Illinois, Oct. 13-15, 1993.

that was distributed in class and that is available online at http://www-2.cs.cmu.edu/~christos/cpub.html (see item 22 under Refereed Conferences). Answer the following questions about this paper:

(3 points) Describe in terms of the inputs and outputs the problem that the paper is solving.
(4 points) Describe the steps followed by the solution proposed by the authors to solve the problem (i.e. to go from the inputs to the desired outputs). In particular, describe how Fourier Transform is used for feature reduction.
(5 points) List and explain at least 3 properties of the Fourier Transform that make the Fourier Transform desirable and appropriate for feature reduction.