HOMEWORK - Fall 2009

- You must work on this homework individually. That is, your homework solutions must be your own. Help or assistance from classmates, other people, or online resources are NOT allowed.
- Hand in
**hardcopies**with written solutions to all the problems below. - Show your work and justify your answers.

- (7 points) Define
**knowledge discovery in databases**. - (12 points) Briefly describe the steps of the
**knowledge discovery in databases process**. - (7 points) Define
**data mining**.

DATE OUTLOOK TEMPERATURE HUMIDITY WIND PLAYS02/13/06 mostly sunny 47 25 strong no 03/10/06 mostly cloudy 66 57 weak yes 06/28/06 cloudy 91 75 medium yes 07/12/06 sunny 82 27 strong no 08/30/06 rainy 76 80 weak no 09/23/06 drizzle 66 70 weak yes 11/24/06 sunny 52 60 medium no 12/19/06 mostly sunny 41 30 strong no 01/12/07 cloudy 36 40 ? no 04/13/07 mostly cloudy 57 40 weak yes 05/20/07 mostly sunny 68 50 medium yes 06/28/07 drizzle 73 20 weak yes 07/06/07 sunny 95 85 weak yes 08/20/07 rainy 91 60 weak yes 09/01/07 mostly sunny 80 10 medium no 10/23/07 mostly cloudy 52 44 weak no

- (5 points) Assuming that the missing value (marked with "?")
for WIND cannot be
ignored, discuss 3 different alternatives to fill in that missing
value. In each case, state what the selected value would be and the
advantages and disadvantages of the approach.
You may assume that the attribute PLAYS is the target attribute.
- (5 points) Describe a reasonable transformation of the attribute OUTLOOK
so that the number of different values for that attribute is
reduced to just 3.
- (10 points) Discretize the attribute TEMPERATURE by binning it into
4 equi-width intervals using unsupervised discretization. Explain your answer.
- (10 points) Discretize the attribute HUMIDITY by binning it into
4 equi-depth intervals using unsupervised discretization. Explain your answer.
- (5 points) Would you keep the attribute DATE into your
dataset when mining for patterns that predict the values
for the PLAYS attribute? Explain your answer.
- (10 points)
Consider the following new approach to discretizing a numeric
attribute: Given the
**mean**and the standard deviation (**sd**) of the attribute values, bin the attribute values into the following intervals:[

Assume that the**mean**- (k+1)***sd**,**mean**- k***sd**) for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...**mean**of the attribute HUMIDITY above is 48 and that the standard deviation**sd**of this attribute is 22.5. Discretize HUMIDITY using this new approach. Show your work. - (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False) to discretize the TEMPERATURE attribute. Describe the resulting intervals. Looking at the Weka code and at the textbook, explain precisely how those intervals were obtained. Show your work.

- (5 points) Run the CFS filter of Weka on this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your homework solutions.
- Looking at the code that implements this CFS filter, as well
as its description in the textbook and in class, describe in detail
the process followed by CFS:
- (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
- (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you can add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
- (25 points) Use the CFS formulas to calculate the goodness of the "best"
(sub)set of attributes considered. Show your work.
Figure 7.1 (p.293) taken from the textbook

**Visualization of the original dataset:**Use a software package (e.g., Excel, matlab, ...) that allows you to produce the following plots of this dataset:- A plot of CLASS as a function of sepallength, sepalwidth, and petallength;
- A plot of CLASS as a function of sepallength, sepalwidth, and petalwidth;
- A plot of CLASS as a function of sepallength, petallength, and petalwidth;
- A plot of CLASS as a function of sepalwidth, petallength, and petalwidth;

**Dimensionality Reduction:**Load the dataset onto Weka and apply PCA (with default parameters) to it. Include in your document the results you obtain together with an explanation of them (15 points).**Visualization of the original dataset:**Save the transformed dataset using Weka. Using the visualization tool you used above, construct a plot of CLASS as a function of the two most significant attributes produced by PCA. Include your plot in your written report (10 points) and describe any obsevations you can make from these plots, especially in comparison with the plots of the original dataset (15 points).

- (10 points) Describe the main differences between the
**mediation approach**and the**data warehousing approach**for data integration. - (20 points) (Adapted from Han's and Kamber's textbook.)
Suppose that a data warehouse consists of the three dimensions
*time*,*doctor*, and*patient*, and the two measures*count*and*charge*, where*charge*is the fee that a doctor charges a patient for a visit.- Enumerate three classes of schemas that are popularly used for modeling data warehouses.
- Draw a schema diagram for the above data warehouse using one of the schema classes listed in your previous answer.
- Starting with the base cuboid [
*day*,*doctor*,*patient*], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2005?

- (30 points) Consider the following relational table:
**MODEL****YEAR****COLOR****SALES**Chevy

1990

red

5

Chevy

1990

white

87

Chevy

1990

blue

62

Chevy

1991

red

54

Chevy

1991

white

95

Chevy

1991

blue

49

Chevy

1992

red

31

Chevy

1992

white

54

Chevy

1992

blue

71

Ford

1990

red

64

Ford

1990

white

62

Ford

1990

blue

63

Ford

1991

red

52

Ford

1991

white

9

Ford

1991

blue

55

Ford

1992

red

27

Ford

1992

white

62

Ford

1992

blue

39

- (5 points) Depict the data in the relational table above as a multidimensional cuboid.
- (5 points) Illustrate the result of rolling-up
MODEL from individual models to
**all**. - (5 points) Illustrate the result of drilling-down time from YEAR to month.
- (5 points) Illustrate the result of slicing for MODEL=Chevy.
- (5 points) Illustrate the result of dicing for MODEL=Chevy and YEAR=1991.
- (5 points) Starting with the basic cuboid
*model, year, color, sales*, what specific OLAP operations should one perform in order to obtain the total number of red cars sold?