CS 525D Fall 2009

Computer Science Department

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING
HOMEWORK - Fall 2009

PROF. CAROLINA RUIZ

DUE DATE: Thursday Oct. 1st at 2:00 pm.

Instructions

You must work on this homework individually. That is, your homework solutions must be your own. Help or assistance from classmates, other people, or online resources are NOT allowed.
Hand in hardcopies with written solutions to all the problems below.
Show your work and justify your answers.

Problem I. Knowledge Discovery in Databases (25 points)

(7 points) Define knowledge discovery in databases.
(12 points) Briefly describe the steps of the knowledge discovery in databases process.
(7 points) Define data mining.

Base your answers on the class handouts and the paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996.

Problem II. Data Preprocessing (75 points)

Consider the following dataset.

   DATE       OUTLOOK         TEMPERATURE   HUMIDITY    WIND    PLAYS 

   02/13/06   mostly sunny    47            25          strong  no 
   03/10/06   mostly cloudy   66            57          weak    yes
   06/28/06   cloudy          91            75          medium  yes
   07/12/06   sunny           82            27          strong  no
   08/30/06   rainy           76            80          weak    no
   09/23/06   drizzle         66            70          weak    yes
   11/24/06   sunny           52            60          medium  no
   12/19/06   mostly sunny    41            30          strong  no
   01/12/07   cloudy          36            40          ?      	no
   04/13/07   mostly cloudy   57            40          weak    yes
   05/20/07   mostly sunny    68            50          medium  yes
   06/28/07   drizzle         73            20          weak    yes
   07/06/07   sunny           95            85          weak    yes
   08/20/07   rainy           91            60          weak    yes
   09/01/07   mostly sunny    80            10          medium  no
   10/23/07   mostly cloudy   52            44          weak    no

(5 points) Assuming that the missing value (marked with "?") for WIND cannot be ignored, discuss 3 different alternatives to fill in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the attribute PLAYS is the target attribute.
(5 points) Describe a reasonable transformation of the attribute OUTLOOK so that the number of different values for that attribute is reduced to just 3.
(10 points) Discretize the attribute TEMPERATURE by binning it into 4 equi-width intervals using unsupervised discretization. Explain your answer.
(10 points) Discretize the attribute HUMIDITY by binning it into 4 equi-depth intervals using unsupervised discretization. Explain your answer.
(5 points) Would you keep the attribute DATE into your dataset when mining for patterns that predict the values for the PLAYS attribute? Explain your answer.
(10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
```
 [mean - (k+1)*sd, mean - k*sd)   
 for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
```
Assume that the mean of the attribute HUMIDITY above is 48 and that the standard deviation sd of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work.
(30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Looking at the Weka code and at the textbook, explain precisely how those intervals were obtained. Show your work.

Problem III. Feature Selection (60 points)

Consider the weather.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset.

(5 points) Run the CFS filter of Weka on this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your homework solutions.
Looking at the code that implements this CFS filter, as well as its description in the textbook and in class, describe in detail the process followed by CFS:
1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
3. (25 points) Use the CFS formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.
  
  Figure 7.1 (p.293) taken from the textbook

Problem IV. Dimensionality Reduction (60 points)

Consider the Iris dataset that comes with the Weka system (iris.arff). In this problem we'll investigate the effects of applying Principal Components Analysis (PCA) to this dataset. The Iris dataset has 4 numeric predictive attributes: sepallength, sepalwidth, petallength, and petalwidth; and a nominal CLASS with 3 possible values: Iris-setosa, Iris-versicolor, and Iris-virginica.

Visualization of the original dataset: Use a software package (e.g., Excel, matlab, ...) that allows you to produce the following plots of this dataset:
- A plot of CLASS as a function of sepallength, sepalwidth, and petallength;
- A plot of CLASS as a function of sepallength, sepalwidth, and petalwidth;
- A plot of CLASS as a function of sepallength, petallength, and petalwidth;
- A plot of CLASS as a function of sepalwidth, petallength, and petalwidth;
Include your plots in your written report (10 points) and describe any obsevations you can make from these plots (10 points).
Dimensionality Reduction: Load the dataset onto Weka and apply PCA (with default parameters) to it. Include in your document the results you obtain together with an explanation of them (15 points).
Visualization of the original dataset: Save the transformed dataset using Weka. Using the visualization tool you used above, construct a plot of CLASS as a function of the two most significant attributes produced by PCA. Include your plot in your written report (10 points) and describe any obsevations you can make from these plots, especially in comparison with the plots of the original dataset (15 points).

Problem V. Data Integration, Data Warehousing and OLAP (60 points)

(10 points) Describe the main differences between the mediation approach and the data warehousing approach for data integration.
(20 points) (Adapted from Han's and Kamber's textbook.) Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit.
1. Enumerate three classes of schemas that are popularly used for modeling data warehouses.
2. Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
3. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?

(30 points) Consider the following relational table:

MODEL

YEAR

COLOR

SALES

Chevy

1990

red

5

Chevy

1990

white

87

Chevy

1990

blue

62

Chevy

1991

red

54

Chevy

1991

white

95

Chevy

1991

blue

49

Chevy

1992

red

31

Chevy

1992

white

54

Chevy

1992

blue

71

Ford

1990

red

64

Ford

1990

white

62

Ford

1990

blue

63

Ford

1991

red

52

Ford

1991

white

9

Ford

1991

blue

55

Ford

1992

red

27

Ford

1992

white

62

Ford

1992

blue

39

(5 points) Depict the data in the relational table above as a multidimensional cuboid.
(5 points) Illustrate the result of rolling-up MODEL from individual models to all.
(5 points) Illustrate the result of drilling-down time from YEAR to month.
(5 points) Illustrate the result of slicing for MODEL=Chevy.
(5 points) Illustrate the result of dicing for MODEL=Chevy and YEAR=1991.
(5 points) Starting with the basic cuboid model, year, color, sales, what specific OLAP operations should one perform in order to obtain the total number of red cars sold?

MODEL	YEAR	COLOR	SALES
Chevy	1990	red	5
Chevy	1990	white	87
Chevy	1990	blue	62
Chevy	1991	red	54
Chevy	1991	white	95
Chevy	1991	blue	49
Chevy	1992	red	31
Chevy	1992	white	54
Chevy	1992	blue	71
Ford	1990	red	64
Ford	1990	white	62
Ford	1990	blue	63
Ford	1991	red	52
Ford	1991	white	9
Ford	1991	blue	55
Ford	1992	red	27
Ford	1992	white	62
Ford	1992	blue	39

CS 525D KNOWLEDGE DISCOVERY AND DATA MINING HOMEWORK - Fall 2009