CS 4445 B Term 2012

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012
Project 1: Data Exploration and Data Pre-processing

PROF. CAROLINA RUIZ

DUE DATE: Friday, Nov. 2, 2012 at the beginning of class (1:00 pm)
** This is an individual project **

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain experience with data exploration and data pre-processing.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).

PROJECT ASSIGNMENT

Readings:

Read Chapters 1, 2, and 3 of your textbook.
Read the manual that comes with the Weka system as needed.

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

Weka and Dataset.

Weka: Download and install the developer version of the Weka system as described in the Course Webpage. Determine the name/path of the directory created to store the Weka files (e.g., C:\Program Files\Weka-3-7\). We'll call that directory WekaDirectory in the remaining of this project description.
You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. This "weka-src.jar" file is a zip file. Hence you need to unzip it to extract its contents. Inside, you will find the .java files that implement Weka.
Read the "Explorer Guide" and the "Experimenter Tutorial" provided with the Weka system. Browse through the "Package Documentation" to become familiar with it.
When needed, use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:
java -Xmx768m -jar weka.jar

Dataset: In this project we will use a dataset that Ken Loomis collected (thanks, Ken!) from Pennsylvania's Department of Education 2011-2012 PSSA and AYP Results. This dataset decribes Pennsylvania's standardized testing scores by school, grade-level, and subject. Ken aggregated the Pennsylvania System of School Assessment (PSSA) results for Math, Reading, Science, and Writing. He also added the past 10 years history of whether or not the school achieved Adequate Yearly Progress (AYP), as well as demographic and other information information about the schools.
The following 2 files contain the dataset:

PA_School_Dataset_Description.txt: contains a description of the dataset, and
PA_School_Dataset.csv: contains the data instances.

Let's use the nominal AYPProceedingLevel2012 attribute as the classification target. This target attribute has the following possible values (= classes): MadeAYP, SchoolImprovement, CorrectiveAction, MakingProgress, and Warning.
Convert the dataset to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a PA_School_Dataset.arff file with the converted dataset.
(5 points) Include in your report the header of your PA_School_Dataset.arff file together with the 10 first data instances of the dataset. (Do NOT include the full dataset - Just the first 10 data instances.)
Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Increase the memory available to Weka as needed.
Data Exploration. See Chapter 3 of the textbook. Use the full dataset and Excel, Matlab, your own code, Weka or other software, to complete the following parts. Please state in your report which tool(s) from the above list you used for each part below.
1. (5 points) Are there any attributes that you would remove from the dataset beforehand? If so, which? Provide an explanation of why you would remove each of them.
2. For each of the following nominal attributes
```
SchoolType,
AYPProceedingLevel2004, and
AYPProceedingLevel2012
```
  1. (5 points) Calculate the frequency and modes of the attribute.
  2. (10 points) Provide a graphical depiction of the distribution of the target classification attribute (AYPProceedingLevel2012) for each of the values of the attribute under consideration (except for AYPProceedingLevel2012 itself).
3. For each of the following continues attributes
```
PctAdvancedMath
PctAdvancedReading
PctAdvancedScience
PctAdvancedWriting
```
  1. (10 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
  2. (20 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for each attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.
  3. (10 points) For the PctAdvancedMath attribute only, plot a graph in which the X axis corresponds to the classification target (AYPProceedingLevel2012), the Y axis is the PctAdvancedMath attribute, and for each data instance in the dataset, there is a point (x,y) in the plot where x and y are respectively the AYPProceedingLevel2012 value and the PctAdvancedMath value of the data instance. For example, the plot will contain the point (MadeAYP, 44.4) which corresponds to the first data instance of the dataset.
4. For the set of all continuous (= numeric) attributes in the dataset:
  1. (20 points) Calculate the covariance matrix and the correlation matrix of these attributes. See notes on using Matlab and Excel to calculate these matrices.
  2. (5 points) If you had to remove 3 of these continuous attributes from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
Data Preprocessing. See Chapter 2 of your textbook. Upload the dataset into Weka as described above. We'll refer to this dataset as the "input dataset" below.
1. Sampling.
  1. (5 points) Use Weka's unsupervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute in the subsample.
  2. (5 points) Use Weka's supervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute in the subsample.
  3. (5 points) Are the above two distributions different? Why is that?
2. Attribute Discretization. Starting with the input dataset (before sampling):
  1. (10 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute PctAdvancedMath of the input dataset into 10 bins using equal frequency (i.e., useEqualFrequency=True). Include the results in your report, as well as the distribution of the target attribute in each of the bins.
  2. (10 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute PctAdvancedMath of the input dataset into 10 bins using equal width (i.e., useEqualFrequency=False). Include the results in your report, as well as the distribution of the target attribute in each of the bins.
  3. (10 points) Use Weka's supervised Discretize filter to discretize the continuous attribute PctAdvancedMath of the input dataset with respect to the class attribute. Include the results in your report, as well as the distribution of the target attribute in each of the resulting bins.
  4. Weka Code. Find the Java code that implements the unsupervised discretization filter in the directories that contain the Weka files, following the instructions provided above.
    1. (5 points) Include the code implementing this filter in your report, and
    2. (10 points) Describe the algorithm followed by this code when doing unsupervised equal-frequency discretization (i.e., useEqualFrequency=True) in your own words.
3. Missing Values. Starting with the input dataset before sampling and before discretization:
  Use Weka's unsupervised ReplaceMissingValues filter to fill in the missing values in the attribute PctAdvancedMath.
  1. (10 points) Describe the Weka code implementing this filter in your report.
  2. (10 points) Compare the distribution of the original PctAdvancedMath attribute against the distribution of this attribute after replacing the missing values.
4. Dimensionality Reduction. Starting with the input dataset before sampling, before discretization, and before replacing missing values:
  Apply Principal Components Analysis to reduce the dimensionality of the input dataset. For this, use Weka's PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95.
  1. (3 points) How many dimensions (= attributes) does the original dataset contain?
  2. (3 points) How many dimensions are obtained after PCA?
  3. (3 points) How much of the variance do they explain?
  4. (5 points) Include in your report the linear combinations that define the first two new attributes(= components) obtained.
  5. (6 points) Look at the results and elaborate on any interesting observations you can make about the results.
5. Feature Selection. Starting with the input dataset before sampling, before discretization, before replacing missing values, and before dimensionality reduction:
  Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6) to the input dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters.
  1. (3 points) Include in your report which attributes were selected by this method.
  2. (5 points) Also, what can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed for part 2.2.3 above?
  3. (2 points) Were the 3 attributes you chose to remove in part 2.2.3 above kept or removed by CfsSubsetEval?

REPORTS AND DUE DATE

Hand in a hardcopy of your written report at the beginning of class the day the project is due. We will discuss the results from the project during class so be prepared to give an oral presentation.

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012 Project 1: Data Exploration and Data Pre-processing

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORTS AND DUE DATE

CS 4445 Data Mining and Knowledge Discovery in Databases - B Term 2012
Project 1: Data Exploration and Data Pre-processing