WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2014 
Project 1: Data Exploration and Data Pre-processing

PROF. CAROLINA RUIZ 

DUE DATE: Thursday, Nov. 6, 2014 at the beginning of class (11:00 am) ------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is multi-fold:

PROJECT ASSIGNMENT

Readings:

Written Report: Your written report should consist of your answers to each of the parts in the assignment below. Both members of the team are expected to be involved in and contribute to each and every problem on this project.

Assignment:

  1. Weka and Dataset.

    1. Weka: Download and install the developer version of the Weka system as described in the Course Webpage. Determine the name/path of the directory created to store the Weka files (e.g., C:\Program Files\Weka-3-7\). We'll call that directory WekaDirectory in the remaining of this project description.

      You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. This "weka-src.jar" file is a zip file. Hence you need to unzip it to extract its contents. Inside, you will find the .java files that implement Weka.

      Read the "Explorer Guide" and the "Experimenter Tutorial" provided with the Weka system. Browse through the "Package Documentation" to become familiar with it.

      When needed, use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:

      java -Xmx768m -jar weka.jar
      

    2. Dataset: In this project we will use the Census-Income (also known as "Adult") Dataset available from the Univ. California Irvine (UCI) Machine Learning Repository.

      In particular,

      • Use the data in the file "adult.data", and the description of the data in the file "adult.names".
      • Use the nominal attribute "salary" (called >50K, ≤50K in the data files) as the classification target.

    3. Convert the dataset to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a CensusIncome.arff file with the converted dataset.

    4. (5 points) Include in your report the header of your CensusIncome.arff file together with the 10 first data instances of the dataset. (Do NOT include the full dataset - Just the first 10 data instances.)

    5. Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Increase the memory available to Weka as needed.

  2. Data Exploration. See Chapter 3 of the textbook. Use the full dataset and Excel, Matlab, your own code, Weka or other software, to complete the following parts. Please state in your report which tool(s) from the above list you used for each part below.

    1. (5 points) Are there any attributes that you would remove from the dataset beforehand? If so, which? Provide an explanation of why you would remove each of them.

    2. For each of the following nominal attributes
      workclass,
      education, and
      sex
      
      1. (5 points) Calculate the frequency and modes of the attribute.
      2. (10 points) Provide a graphical depiction of the distribution of the target classification attribute for each of the values of the three aforementioned attribute under consideration.

    3. For each of the following continues attributes
      age
      education-num
      capital-gain
      
      1. (10 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
      2. (20 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for each attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.
      3. (10 points) For the capital-gain attribute only, plot a graph in which the Y axis corresponds to the classification target (salary), the X axis is the capital-gain attribute, and for each data instance in the dataset, there is a point (x,y) in the plot where x and y are respectively the capital-gain value and the salary value of the data instance. For example, the plot will contain the point (2174, ≤50) which corresponds to the first data instance of the dataset.

    4. For the set of all continuous (= numeric) attributes in the dataset:
      1. (20 points) Calculate the covariance matrix and the correlation matrix of these attributes. See notes on using Matlab and Excel to calculate these matrices.
      2. (5 points) If you had to remove some of these continuous attributes from the dataset based on these two matrices, which attributes would you remove if any and why? Explain your answer.

  3. Data Preprocessing. See Chapter 2 of your textbook. Upload the dataset into Weka as described above. We'll refer to this dataset as the "input dataset" below.

    1. Sampling.

      1. (5 points) Use Weka's unsupervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute in the subsample.

      2. (5 points) Use Weka's supervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute in the subsample.

      3. (5 points) Are the above two distributions different? Why is that?

    2. Attribute Discretization. Starting with the input dataset (before sampling):

      1. (10 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute age of the input dataset into 10 bins using equal frequency (i.e., useEqualFrequency=True). Include the results in your report, as well as the distribution of the target attribute in each of the bins.

      2. (10 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute age of the input dataset into 10 bins using equal width (i.e., useEqualFrequency=False). Include the results in your report, as well as the distribution of the target attribute in each of the bins.

      3. (10 points) Use Weka's supervised Discretize filter to discretize the continuous attribute age of the input dataset with respect to the class attribute. Include the results in your report, as well as the distribution of the target attribute in each of the resulting bins.

      4. Weka Code. Find the Java code that implements the unsupervised discretization filter in the directories that contain the Weka files, following the instructions provided above.
        1. (5 points) Include the first 20 lines of the code implementing this filter in your report, and
        2. (10 points) Describe the algorithm followed by this code when doing unsupervised equal-frequency discretization (i.e., useEqualFrequency=True) in your own words.

    3. Missing Values. Starting with the input dataset before sampling and before discretization:

      Use Weka's unsupervised ReplaceMissingValues filter to fill in the missing values in the attribute occupation.

      1. (10 points) Describe the Weka code implementing this filter in your report.
      2. (10 points) Compare the distribution of the original occupation attribute against the distribution of this attribute after replacing the missing values.

    4. Dimensionality Reduction. Starting with the input dataset before sampling, before discretization, and before replacing missing values:

      Apply Principal Components Analysis to reduce the dimensionality of the input dataset. For this, use Weka's PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95.

      1. (3 points) How many dimensions (= attributes) does the original dataset contain?
      2. (3 points) How many dimensions are obtained after PCA?
      3. (3 points) How much of the variance do they explain?
      4. (5 points) Include in your report the linear combinations that define the first two new attributes(= components) obtained.
      5. (6 points) Look at the results and elaborate on any interesting observations you can make about the results.

    5. Feature Selection. Starting with the input dataset before sampling, before discretization, before replacing missing values, and before dimensionality reduction:

      Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6) to the input dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters.

      1. (3 points) Include in your report which attributes were selected by this method.
      2. (5 points) Also, what can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed for part 2.4.2 above?
      3. (2 points) Were the 3 attributes you chose to remove in part 2.4.2 above kept or removed by CfsSubsetEval?

REPORTS AND DUE DATE

Hand in a hardcopy of your written report at the beginning of class the day the project is due. We will discuss the results from the project during class so be prepared to give an oral presentation.