WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING  
HOMEWORK - Spring 2004

PROF. CAROLINA RUIZ 

DUE DATE: Tuesday March 9 at 3:00 pm. Late submissions will NOT be accepted  
------------------------------------------

Instructions


Problem 1. Knowledge Discovery in Databases (10 points)

  1. (3 points) Define knowledge discovery in databases.

  2. (4 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (3 points) Define data mining.

Problem 2. Data Integration, Data Warehousing and OLAP (30 points)

  1. (7 points) Chapter 2: Exercise 2.3.

  2. (7 points) Chapter 2: Exercise 2.4.

  3. (7 points) Chapter 2: Exercise 2.10.

  4. (9 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.


Problem 3. Data Preprocessing (30 points)

  1. Consider the following dataset.
       DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS? 
    
       02/13/00   mostly sunny    47            25          strong  no 
       03/10/00   mostly cloudy   66            57          weak    yes
       06/28/00   cloudy          91            75          medium  yes
       07/12/00   sunny           82            27          strong  no
       08/30/00   rainy           76            80          weak    no
       09/23/00   drizzle         66            70          weak    yes
       11/24/00   sunny           52            60          medium  no
       12/19/00   mostly sunny    41            30          strong  no
       01/12/01   cloudy          36            40          ??      no
       04/13/01   mostly cloudy   57            40          weak    yes
       05/20/01   mostly sunny    68            50          medium  yes
       06/28/01   drizzle         73            20          weak    yes
       07/06/01   sunny           95            85          weak    no
       08/20/01   rainy           91            60          weak    yes
       09/01/01   mostly sunny    80            10          medium  no
       10/23/01   mostly cloudy   52            44          weak    no 
    

    1. (3 points) Assuming that the missing value (marked with "??") for Wind cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS? is the target attribute.

    2. (3 points) Define a concept hierarchy over the attribute OUTLOOK so that the number of different values for that attribute can be reduced to just 3.

    3. (3 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals. Explain your answer.

    4. (3 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals. Explain your answer.

    5. (3 points) Would you include the attribute DATE into your task-relevant data when mining for patterns that predict the values for the PLAYS? attribute? Explain your answer.

    6. (3 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
       [mean - (k+1)*sd, mean - k*sd)   
       for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
      
      Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work.

  2. (5 points) Chapter 3: Exercise 3.3.

  3. (5 points) Chapter 3: Exercise 3.5.

  4. (5 points) Chapter 3: Exercise 3.7.

Problem 4. Relevance Analysis (30 points)

  1. The following table contains training examples that help a robot janitor predict whether or not an office contains a recycling bin.

    STATUS DEPT. OFFICE SIZE RECYCLING BIN?
    1. faculty ee large no
    2. staff ee small no
    3. faculty cs medium yes
    4. student ee large yes
    5. staff cs medium no
    6. faculty cs large yes
    7. student ee small yes
    8. staff cs medium no
    1. (10 points) Compute the entropy of each of the attributes STATUS, DEPT., and OFFICE SIZE with respect to the attribute RECYCLING BIN? Show your work. Rank the attributes STATUS, DEPT., and OFFICE SIZE according to their relevance in predicting the target attribute RECYCLING BIN?. List first the most relevant one, and last the least relevant one. Explain your answer.

  2. (10 points + 2 extra points ) Consider the Singular Value Decomposition (SVD) method as presented in the paper
    M.W. Berry, Z. Drmac, and E.R. Jessup. "Matrices, Vector Spaces, and Information Retrieval" SIAM Reviews. Vol. 41, No. 2, pp. 335-362.
    which was distributed in class. Given the SVD of the specific matrix A into 3 matrices U, Sigma, and V shown on page 349 of the paper, answer the following questions. Explain your answers.
    1. (2 points) What is the rank of A?
    2. (2 points) What collection of vectors naturally derived from the SVD of A forms a basis for the column space of A?
    3. (2 points) What collection of vectors naturally derived from the SVD of A forms a basis for the row space of A?
    4. (2 points) How can one obtain the best possible rank-3 approximation of A?
    5. (4 points) Suppose that A represents the database that you want to mine for patterns. Explain how you would use SVD to reduce the number of features in A before mining.

  3. (10 points + 2 extra points) Read the paper
    R. Agrawal, C. Faloutsos and A. Swami. "Efficient Similarity Search in Sequence Databases Foundations of Data Organization and Algorithms". (FODO) Conference, Oct. 1993, Evanston, Illinois, Oct. 13-15, 1993.
    that was distributed in class and that is available online at http://www-2.cs.cmu.edu/~christos/cpub.html (see item 22 under Refereed Conferences). Answer the following questions about this paper:
    1. (3 points) Describe in terms of the inputs and outputs the problem that the paper is solving.
    2. (4 points) Describe the steps followed by the solution proposed by the authors to solve the problem (i.e. to go from the inputs to the desired outputs). In particular, describe how Fourier Transform is used for feature reduction.
    3. (5 points) List and explain at least 3 properties of the Fourier Transform that make the Fourier Transform desirable and appropriate for feature reduction.