WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2018  
Project 1: Data Pre-processing

PROF. CAROLINA RUIZ 

Due Date: Sept. 13, 2018. ------------------------------------------

Instructions


Problem I. Knowledge Discovery in Databases (20 points)

  1. (5 points) Define knowledge discovery in databases.

  2. (10 points) Briefly describe the steps of the knowledge discovery in databases process.

  3. (5 points) Define data mining.
Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996. However, your answers must be written in your own words.

Problem II. Data Preprocessing (60 points)

Consider the following small subset of the Adult Dataset available at the UCI Machine Learning Data Repository. See the link above for a description of this dataset.
  Attributes SEX and CLASS are discrete;
  Attributes AGE, EDUCATION_NUM and HOURS_PER_WEEK are continuous.

    SEX		AGE	EDUCATION_NUM	HOURS_PER_WEEK	CLASS 
    Male	27	9		40		<=50K
    Female	28	13		40		<=50K
    Male	29	10		50		<=50K
    Male	30	9		40		<=50K
    Male	35	11		40		<=50K
    Female	36	9		40		<=50K
    Female	37	14		40		<=50K
    Male	38	9		?		<=50K
    Male	40	16		60		>50K
    Female	44	14		40		<=50K
    Male	45	14		40		>50K
    Female	47	14		50		<=50K
    Male	48	9		46		<=50K
    Male	49	11		40		>50K
    Male	49	9		40		>50K
    Female	49	9		40		<=50K
    Male	50	13		55		>50K
    Male	52	9		45		>50K
    Male	52	13		40		<=50K
    Male	54	10		60		>50K

  1. (5 points) Assuming that the missing value (marked with "?") in HOURS_PER_WEEK cannot be ignored, discuss 3 different alternatives to filling in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the CLASS attribute is the target attribute.

  2. (5 points) Describe a reasonable transformation of the attribute EDUCATION_NUM so that the number of different values for that attribute is reduced to just 3. [First investigate the meaning of this attribute in the dataset webpage provided above.]

  3. (5 points) Discretize the AGE attribute by binning it into 4 equi-width intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  4. (5 points) Discretize the AGE attribute by binning it into 4 equi-depth (= equal-frequency) intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.

  5. (10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
     [mean - (k+1)*sd, mean - k*sd)   
     for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
    
    Assume that the mean of the attribute AGE above is 42 and that the standard deviation sd of this attribute is 8. Discretize AGE by hand using this new approach. Show your work.

  6. (30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False and UseBetterEncoding=True and default values for the other parameters) to discretize the HOURS_PER_WEEK attribute. Describe the resulting intervals. Find the Java code that implements this filter in the directories that contain the Weka files. (See the instructions to find Weka's source code at the beginning of this project assignment.) Read the code carefully and describe the algorithm followed by this code in your own words in your written report.

Problem III. Feature Selection (60 points)

Consider the weather.nominal.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset. (See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6 and also Mark A.Hall's phd thesis). See Section 2.4.6 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definition and formulas for Mutual Information.
  1. (5 points) Apply Weka's CfsSubsetEval (available under the Select attributes tab) to this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your project solutions.
  2. Looking at the code that implements CfsSubsetEval, as well as its description in the textbook and in class, describe in detail the process that it follows:
    1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
    2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
    3. (25 points) Use the CfsSubsetEval formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.

      weather_data_attribute_latice.gif

      Taken from Witten's and Frank's textbook slides - Chapter 7.


Problem IV. Exploring Real Data (65 points)

Consider this given subset of the Adult Dataset extracted from the Adult Dataset available at the UCI Machine Learning Repository.

Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Load it into Python as well.

  1. Dataset Exploration. (40 points) Use Python, your own code, or Weka to complete the following parts. Please state in your report which tool from the above list you used for each part.

    1. (5 points) Start by familiarizing yourself with the dataset. Carefully look at the data directly (for this use Excel or a file editor, as well as Weka's and Python's funcionality to explore and to visualize the data). Describe in your report your observations about what is good about this data (mention at least 2 different good things), and what is problematic about this data (mention at least 2 different bad things). If appropriate, include visualizations of those good/bad things.

    2. For the AGE attribute:
      1. (5 points) Calculate the quartiles, mean, median, range, and variance of this attribute.
      2. (5 points) Plot a histogram of this attribute using 10 bins.

    3. In this part, use use only the following attributes in the dataset: AGE, EDUCATION-NUM, RACE, SEX, CAPITAL-GAIN, CAPITAL-LOSS, HOURS-PER-WEEK, and CLASS. For these attributes calculate:
      1. (10 points) the covariance matrix and
      2. (10 points) the correlation matrix of these attributes.
        Construct a visualization of each of these matrices (e.g., heatmap) using Python to more easily understand them.
        See Section 2.4.5 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definitions and formulas for correlation and covariance.
      3. (5 points) If you had to remove 2 of the attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.

  2. Dimensionality Reduction.

    You must perform each of the parts of this problem both in Weka and separately in Python.

    1. (10 points) For this part, USE ONLY THE CONTINUOUS (denoted as "numeric") attributes in the dataset. Apply Principal Components Analysis in Weka and separately in Python to reduce the dimensionality of the full dataset. In Weka, use the PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.99. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first new attribute(= component) obtained. Look at the results and elaborate on any interesting observations you can make about the results.
    2. (5 points) Repeat the PCA experiments above but adding now the MARITAL-STATUS attribute to the dataset (that is, all continuous attributes and MARITAL-STATUS). Explain in your report any changes in the results. Describe also how the MARITAL-STATUS attribute was transformed from discrete to continuous so that PCA could handle it.

  3. Feature Selection. (10 points)

    You must perform each of the parts of this problem both in Weka and separately in Python.

    For this part, USE ONLY THE DISCRETE attributes in the dataset. Use the CLASS attribute as the target classification attribute. Apply Correlation Based Feature Selection (CFS) (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Separately, use Python for the same purpose. Look at the results to determine which attributes were selected by this method and elaboreate on any interesting observations you can make about the results.


ORAL AND WRITTEN REPORTS AND DUE DATE