CS 4445 A Term 2008

Computer Science Department

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Project 0: Data Pre-processing, Mining, and Evaluation of Patterns

PROF. CAROLINA RUIZ

DUE DATES: Friday, Sept. 5 2008 at the beginning of class (1:00 pm)
This is an individual project

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is multi-fold:

To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).
To gain experience "pre-processing" datasets to clean, normalize, and discretize data attributes, and, when needed, reduce the dimensionality of the data.
To gain experience with the evaluation of the models/patterns constructed with a data mining technique.

PROJECT ASSIGNMENT

Readings: Read Chapters 9, 10, and 12 of your textbook to learn more about the Weka system.

Written Report: Your written report should consist of your answers to the items marked with an asterisk "*" in the assignment description below.

Assignment:

See Amro Khasawneh's Proj0 Solutions.
Download and install the developer version of the Weka system as described in the Course Webpage. Determine the name/path of the directory created to store the Weka files (e.g., C:\Program Files\Weka-3-5\). We'll call that directory WekaDirectory in the remaining of this project description.
Part I: The Iris Dataset.
In the first part of this project you'll get familiar and work with one of the sample datasets provided with the Weka system: The Iris dataset.
1. Looking at the dataset: Below we use 3 alternate ways of looking at a dataset:
  1. Using Weka's ArffViewer:
    1. Run the Weka system.
    2. Select "Tools" and then "ArffViewer".
    3. Go to "File" → "Open" → "data" → "iris.arff" and open this file.
  2. Using a word editor to access the dataset file directly:
    1. Separately, use a word editor (e.g., WordPad, emacs, ...) to open the file where the Iris dataset is stored. If WekaDirectory were the name of the directory where the Weka files reside, then the Iris dataset would be stored at: WekaDirectory/data/iris.arff
    2. * Include the header of this dataset in your report (just the header, NOT the data).
    3. Compare this file with the display of this dataset in the ArffViewer above.
  3. Using Weka's Explorer:
    1. From the main Weka window, select "Applications" and then "Explorer".
    2. Go to "Preprocess" → "Open file" → "data" → "iris.arff" and open this file.
    3. * Record in your report the minimum and maximum value for each numeric attribute in the dataset (you can find this info in the Explorer display).
2. Analyzing the raw dataset
  1. From the Weka's Explorer window, select "Classify".
  2. "Choose ZeroR" and click on "Start".
  3. * Record in your report:
    1. The class predicted by ZeroR. Also, provide a clear and concise explanation of why that's the class chosen by ZeroR.
    2. The accuracy of this ZeroR prediction (that's is the percentage of correctly classified instances).
    3. The confusion matrix. Give a brief description of why this confusion matrix contains non-zero values in just one column.
  4. Go back to the "Classify" window of Weka's Explorer. Click on "Choose", select "OneR" and click on "Start".
  5. * Record in your report:
    1. The rule output by OneR.
    2. The accuracy of OneR's prediction (that's is the percentage of correctly classified instances).
    3. The confusion matrix of OneR's prediction.
    4. Select and Right-click on "rules.OneR" under the "Result list" part of the Classify window. Select "Visualize classifier errors". Play with different selections of "X" and "Y" axes for the display until you find a display that you can understand and explain. Include this display together with your explanation of it in your report.
3. Preprocessing the dataset before analysing it.
  1. Discretize the numeric attributes in this dataset:
    1. On the Preprocess tab of the Weka's Explorer window, go to "Filter" and click "Choose". Select "filters" → "supervised" → "attribute" → "Discretize".
    2. Back on the Filter part of the Preprocess tab of the Weka's Explorer window, click on "Apply".
    3. * Using the attributes' information displayed on the Explorer window, include in your report the values of each of the attributes in the filtered dataset and how many data instances contain each value.
  2. * Using this discretized dataset, run ZeroR again. Record in your report the class predicted by ZeroR, the accuracy of this prediction, and the confusion matrix. Are there any differences between the results for ZeroR over the discretized data and over the raw data? Explain why there are or there are no differences.
  3. * Using this discretized dataset, run OneR again. Record in your report the class predicted by OneR, the accuracy of this prediction, and the confusion matrix. Are there any differences between the results for OneR over the discretized data and over the raw data? Explain why there are or there are no differences.
4. Weka Code. * Find the Java code that implements the discretization filter in the directories that contain the Weka files. Include the code implementing this filter in your report, and describe the algorithm followed by this code in your own words.
  You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. This "weka-src.jar" file is a zip file. Hence you need to winzip or unzip it to extract its contents. Inside, you will find the .java files that implement Weka.
Part II: The Census-Income Dataset.
In the second part of this project you'll get familiar and work with a dataset NOT provided with the Weka system: The census-income dataset.
The census-income dataset from the US Census Bureau which is available at the Univ. of California Irvine (UCI) Data Repository.
The census-income dataset contains census information for 48,842 people. It has 14 attributes for each person (age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country) and a boolean attribute class classifying the input of the person as belonging to one of two categories >50K, <=50K.
1. Convert the census-income data to the arff format. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., a word editor, Excel, etc.). Create a census-income.arff file with the converted dataset.
2. * Include in your report the header of your census-income.arff file together with the 10 first data instances of the dataset. (Do NOT include the full dataset - Just the first 10 data instances.)
3. Load this dataset into Weka by opening your census-income.arff dataset from the "Explorer" window in Weka. The full dataset should fit in main memory, but if it doesn't, remove as many instances as you need (but not more!) from your census-income.arff file (do this outside Weka using for instance a word editor or Excel).
4. * Run the "ReplaceMissingValues" filter (available at Filters → Unsupervised → Attribute) over this dataset. Explain clearly in your report what this filter does to the dataset.
5. * Run the "Resample" filter (available at Filters → Unsupervised → instance) This filter selects a random sample from the data instances in the current dataset.
  1. Click on "More" to learn more about this filter
  2. Use say "20.0" in the "SampleSizePercent" to select a subsample of 20% of the current instances.
  3. Click "OK"
  4. Click "Apply" on the main Weka Window
  Describe in your report the meaning of each of the parameters of this Resample filter.
Apply ZeroR and OneR to this dataset. (No need to include your results in the written report.)

REPORTS AND DUE DATE

Hand in a hardcopy of your written report at the beginning of class the day the project is due. We will discuss the results from the project during class.

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008 Project 0: Data Pre-processing, Mining, and Evaluation of Patterns

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORTS AND DUE DATE

CS 4445 Data Mining and Knowledge Discovery in Databases - A Term 2008
Project 0: Data Pre-processing, Mining, and Evaluation of Patterns