### CS 525D KNOWLEDGE DISCOVERY AND DATA MINING   HOMEWORK - Fall 2009

#### PROF. CAROLINA RUIZ

DUE DATE: Thursday Oct. 1st at 2:00 pm.

#### Instructions

• You must work on this homework individually. That is, your homework solutions must be your own. Help or assistance from classmates, other people, or online resources are NOT allowed.
• Hand in hardcopies with written solutions to all the problems below.

#### Problem I. Knowledge Discovery in Databases (25 points)

1. (7 points) Define knowledge discovery in databases.

2. (12 points) Briefly describe the steps of the knowledge discovery in databases process.

3. (7 points) Define data mining.
Base your answers on the class handouts and the paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

#### Problem II. Data Preprocessing (75 points)

Consider the following dataset.
```   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS

02/13/06   mostly sunny    47            25          strong  no
03/10/06   mostly cloudy   66            57          weak    yes
06/28/06   cloudy          91            75          medium  yes
07/12/06   sunny           82            27          strong  no
08/30/06   rainy           76            80          weak    no
09/23/06   drizzle         66            70          weak    yes
11/24/06   sunny           52            60          medium  no
12/19/06   mostly sunny    41            30          strong  no
01/12/07   cloudy          36            40          ?      	no
04/13/07   mostly cloudy   57            40          weak    yes
05/20/07   mostly sunny    68            50          medium  yes
06/28/07   drizzle         73            20          weak    yes
07/06/07   sunny           95            85          weak    yes
08/20/07   rainy           91            60          weak    yes
09/01/07   mostly sunny    80            10          medium  no
10/23/07   mostly cloudy   52            44          weak    no
```

1. (5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.

2. (5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.

3. (10 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Explain your answer.

4. (10 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals using unsupervised discretization. Explain your answer.

5. (5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.

6. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
``` [mean - (k+1)*sd, mean - k*sd)
for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
```
Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work.

7. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Looking at the Weka code and at the textbook, explain precisely how those intervals were obtained. Show your work.

#### Problem III. Feature Selection (60 points)

Consider the weather.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset.
1. (5 points) Run the CFS filter of Weka on this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your homework solutions.
2. Looking at the code that implements this CFS filter, as well as its description in the textbook and in class, describe in detail the process followed by CFS:
1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
3. (25 points) Use the CFS formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

Figure 7.1 (p.293) taken from the textbook

#### Problem IV. Dimensionality Reduction (60 points)

Consider the Iris dataset that comes with the Weka system (iris.arff). In this problem we'll investigate the effects of applying Principal Components Analysis (PCA) to this dataset. The Iris dataset has 4 numeric predictive attributes: sepallength, sepalwidth, petallength, and petalwidth; and a nominal CLASS with 3 possible values: Iris-setosa, Iris-versicolor, and Iris-virginica.
1. Visualization of the original dataset: Use a software package (e.g., Excel, matlab, ...) that allows you to produce the following plots of this dataset:
• A plot of CLASS as a function of sepallength, sepalwidth, and petallength;
• A plot of CLASS as a function of sepallength, sepalwidth, and petalwidth;
• A plot of CLASS as a function of sepallength, petallength, and petalwidth;
• A plot of CLASS as a function of sepalwidth, petallength, and petalwidth;
Include your plots in your written report (10 points) and describe any obsevations you can make from these plots (10 points).

2. Dimensionality Reduction: Load the dataset onto Weka and apply PCA (with default parameters) to it. Include in your document the results you obtain together with an explanation of them (15 points).

3. Visualization of the original dataset: Save the transformed dataset using Weka. Using the visualization tool you used above, construct a plot of CLASS as a function of the two most significant attributes produced by PCA. Include your plot in your written report (10 points) and describe any obsevations you can make from these plots, especially in comparison with the plots of the original dataset (15 points).

#### Problem V. Data Integration, Data Warehousing and OLAP (60 points)

1. (10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.

2. (20 points) (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
1. Enumerate three classes of schemas that are popularly used for modeling data warehouses.
2. Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
3. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?
3. (30 points) Consider the following relational table:

 MODEL YEAR COLOR SALES Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

1. (5 points) Depict the data in the relational table above as a multidimensional cuboid.
2. (5 points) Illustrate the result of rolling-up MODEL from individual models to all.
3. (5 points) Illustrate the result of drilling-down time from YEAR to month.
4. (5 points) Illustrate the result of slicing for MODEL=Chevy.
5. (5 points) Illustrate the result of dicing for MODEL=Chevy and YEAR=1991.
6. (5 points) Starting with the basic cuboid model, year, color, sales, what specific OLAP operations should one perform in order to obtain the total number of red cars sold?