WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING  
HOMEWORK - Spring 2008

PROF. CAROLINA RUIZ 

DUE DATE: Thursday February 7 at 3:30 pm.
------------------------------------------

Instructions


Problem I. Knowledge Discovery in Databases (25 points)

  1. (7 points) Define knowledge discovery in databases.

  2. (12 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (7 points) Define data mining.
Base your answers on the class handouts and the paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

Problem II. Data Preprocessing (75 points)

Consider the following dataset.
   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS 

   02/13/06   mostly sunny    47            25          strong  no 
   03/10/06   mostly cloudy   66            57          weak    yes
   06/28/06   cloudy          91            75          medium  yes
   07/12/06   sunny           82            27          strong  no
   08/30/06   rainy           76            80          weak    no
   09/23/06   drizzle         66            70          weak    yes
   11/24/06   sunny           52            60          medium  no
   12/19/06   mostly sunny    41            30          strong  no
   01/12/07   cloudy          36            40          ?      	no
   04/13/07   mostly cloudy   57            40          weak    yes
   05/20/07   mostly sunny    68            50          medium  yes
   06/28/07   drizzle         73            20          weak    yes
   07/06/07   sunny           95            85          weak    yes
   08/20/07   rainy           91            60          weak    yes
   09/01/07   mostly sunny    80            10          medium  no
   10/23/07   mostly cloudy   52            44          weak    no 

  1. (5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.

  2. (5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.

  3. (10 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Explain your answer.

  4. (10 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals using unsupervised discretization. Explain your answer.

  5. (5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.

  6. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
     [mean - (k+1)*sd, mean - k*sd)   
     for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
    
    Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work.

  7. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Looking at the Weka code and at the textbook, explain precisely how those intervals were obtained. Show your work.

Problem III. Feature Selection (60 points)

Consider the weather.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset.
  1. (5 points) Run the CFS filter of Weka on this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your homework solutions.
  2. Looking at the code that implements this CFS filter, as well as its description in the textbook and in class, describe in detail the process followed by CFS:
    1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
    2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
    3. (25 points) Use the CFS formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

      weather_data_attribute_latice.gif

      Figure 7.1 (p.293) taken from the textbook


Problem IV. Dimensionality Reduction (60 points)

Consider the Iris dataset that comes with the Weka system (iris.arff). In this problem we'll investigate the effects of applying Principal Components Analysis (PCA) to this dataset. The Iris dataset has 4 numeric predictive attributes: sepallength, sepalwidth, petallength, and petalwidth; and a nominal CLASS with 3 possible values: Iris-setosa, Iris-versicolor, and Iris-virginica.
  1. Visualization of the original dataset: Use a software package (e.g., Excel, matlab, ...) that allows you to produce the following plots of this dataset: Include your plots in your written report (10 points) and describe any obsevations you can make from these plots (10 points).

  2. Dimensionality Reduction: Load the dataset onto Weka and apply PCA (with default parameters) to it. Include in your document the results you obtain together with an explanation of them (15 points).

  3. Visualization of the original dataset: Save the transformed dataset using Weka. Using the visualization tool you used above, construct a plot of CLASS as a function of the two most significant attributes produced by PCA. Include your plot in your written report (10 points) and describe any obsevations you can make from these plots, especially in comparison with the plots of the original dataset (15 points).

Problem V. Data Integration, Data Warehousing and OLAP (30 points)

  1. (10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

  2. (20 points) (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
    1. Enumerate three classes of schemas that are popularly used for modeling data warehouses.
    2. Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
    3. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?