CS 548 Fall 2018 - Project 1

Computer Science Department

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2018
Project 1: Data Pre-processing

PROF. CAROLINA RUIZ

Due Date: Sept. 13, 2018.

Slides: Canvas Submission by 2:00 pm.
Written report: Hand in a hardcopy by the beginning of class.

Instructions

Group Project: This is a group project. Please do not split the project in a way that each student does only a portion of the work. Instead each student is expected to work on the entire project individually and then meet with the group to clarify doubts, share findings, and combine the project solutions into one group report. Help or assistance from other groups, other people, or online resources is NOT allowed. Submit just one written report and one set of slides per group.
Questions and Comments: If you have any questions about the project or the test, please post your questions to the Canvas discussion forum for this course. Do NOT email your question to the professor (unless your question is private and related just to your own situation). That way all students get to participate in and benefit from the discussion.
- To access the discussion forum go to Canvas, select "CS548-BCB503-CS583-F18-191: KDD-Data-Mining" under "My Courses", and then click on "Discussions" on the left hand-side bar.
- I suggest you set up your Canvas account so that you receive email notifications when anyone in the class posts comments on the forum. For this click on the "Subscribe" button.
- High quality participation on the discussion forum (e.g., providing good answers to other students' questions) will count toward your class participation grade.
Mandatory Readings: Read Chapters 1, 2, and Appendix B.1 from your textbook in detail.
Slides and Project Submission Instructions: Follow the directions under "Oral and Written Report Submission and Due Date" below to prepare and submit your slides and written report.
Weka and Python: Install the Weka system (developer version) and Python as described in the Course Webpage.
Regarding Weka:
- You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. You need to unzip and/or use jar utilities this file to extract its contents. Inside, you will find the .java files that implement Weka.
- Consult the "README" file, the "documentation" webpage, and the "WekaManual" provided with the Weka system (in the same directory where Weka was downloaded). Browse through the "Package Documentation" to become familiar with it.
- When needed, use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:
```
java -Xmx768m -jar weka.jar
```
Regarding Python:
- See Prof. Ruiz's miscellaneous notes on Python.

Problem I. Knowledge Discovery in Databases (20 points)

(5 points) Define knowledge discovery in databases.
(10 points) Briefly describe the steps of the knowledge discovery in databases process.
(5 points) Define data mining.

Base your answers on the definitions presented in class, the textbook, and the following paper: Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases". AAAI Magazine, pp. 37-54. Fall 1996. However, your answers must be written in your own words.

Problem II. Data Preprocessing (60 points)

Consider the following small subset of the Adult Dataset available at the UCI Machine Learning Data Repository. See the link above for a description of this dataset.

  Attributes SEX and CLASS are discrete;
  Attributes AGE, EDUCATION_NUM and HOURS_PER_WEEK are continuous.

    SEX		AGE	EDUCATION_NUM	HOURS_PER_WEEK	CLASS 
    Male	27	9		40		<=50K
    Female	28	13		40		<=50K
    Male	29	10		50		<=50K
    Male	30	9		40		<=50K
    Male	35	11		40		<=50K
    Female	36	9		40		<=50K
    Female	37	14		40		<=50K
    Male	38	9		?		<=50K
    Male	40	16		60		>50K
    Female	44	14		40		<=50K
    Male	45	14		40		>50K
    Female	47	14		50		<=50K
    Male	48	9		46		<=50K
    Male	49	11		40		>50K
    Male	49	9		40		>50K
    Female	49	9		40		<=50K
    Male	50	13		55		>50K
    Male	52	9		45		>50K
    Male	52	13		40		<=50K
    Male	54	10		60		>50K

(5 points) Assuming that the missing value (marked with "?") in HOURS_PER_WEEK cannot be ignored, discuss 3 different alternatives to filling in that missing value. In each case, state what the selected value would be and the advantages and disadvantages of the approach. You may assume that the CLASS attribute is the target attribute.
(5 points) Describe a reasonable transformation of the attribute EDUCATION_NUM so that the number of different values for that attribute is reduced to just 3. [First investigate the meaning of this attribute in the dataset webpage provided above.]
(5 points) Discretize the AGE attribute by binning it into 4 equi-width intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.
(5 points) Discretize the AGE attribute by binning it into 4 equi-depth (= equal-frequency) intervals using unsupervised discretization. Perform this discretization by hand (i.e., do not use Weka). Explain your answer.
(10 points) Consider the following new approach to discretizing a numeric attribute: Given the mean and the standard deviation (sd) of the attribute values, bin the attribute values into the following intervals:
```
 [mean - (k+1)*sd, mean - k*sd)   
 for all integer values k, i.e. k = ..., -4, -3, -2, -1, 0, 1, 2, ...
```
Assume that the mean of the attribute AGE above is 42 and that the standard deviation sd of this attribute is 8. Discretize AGE by hand using this new approach. Show your work.
(30 points) Use the supervised discretization filter in Weka (with UseKononorenko=False and UseBetterEncoding=True and default values for the other parameters) to discretize the HOURS_PER_WEEK attribute. Describe the resulting intervals. Find the Java code that implements this filter in the directories that contain the Weka files. (See the instructions to find Weka's source code at the beginning of this project assignment.) Read the code carefully and describe the algorithm followed by this code in your own words in your written report.
- Compare the process followed by this code with one described in Section 2.3.6 of the Tan, Steinbach, Karpatne and Kumar's textbook and the algorithm in: Usama M. Fayyad, Keki B. Irani: Multi-interval discretization of continuousvalued attributes for classification learning. In: Thirteenth International Joint Conference on Articial Intelligence, 1022-1027, 1993.

Problem III. Feature Selection (60 points)

Consider the weather.nominal.arff dataset that comes with the Weka system. In this problem you will explain how Correlation based Feature Selection (CFS) works on this dataset. (See Witten's and Frank's textbook slides - Chapter 7 Slides 5-6 and also Mark A.Hall's phd thesis). See Section 2.4.6 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definition and formulas for Mutual Information.

(5 points) Apply Weka's CfsSubsetEval (available under the Select attributes tab) to this dataset (using BestFirst as the search method, with default parameters) to determine what attributes are selected. Include the results in your project solutions.
Looking at the code that implements CfsSubsetEval, as well as its description in the textbook and in class, describe in detail the process that it follows:
1. (5 points) What's the initial (sub)set of attributes under consideration? Is forward or backward search used?
2. (25 points) Using the latice of attribute subsets below, show step by step the process that the algorithm follows (i.e., show the search process in detail). For this you add print instructions to the Weka code so that it tells you the order in which it considers the subsets and the goodness value of each of these subsets. Explain your answer.
3. (25 points) Use the CfsSubsetEval formulas to calculate the goodness of the "best" (sub)set of attributes considered. Show your work.
  
  Taken from Witten's and Frank's textbook slides - Chapter 7.

Problem IV. Exploring Real Data (65 points)

Consider this given subset of the Adult Dataset extracted from the Adult Dataset available at the UCI Machine Learning Repository.

Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Load it into Python as well.

Dataset Exploration. (40 points) Use Python, your own code, or Weka to complete the following parts. Please state in your report which tool from the above list you used for each part.
1. (5 points) Start by familiarizing yourself with the dataset. Carefully look at the data directly (for this use Excel or a file editor, as well as Weka's and Python's funcionality to explore and to visualize the data). Describe in your report your observations about what is good about this data (mention at least 2 different good things), and what is problematic about this data (mention at least 2 different bad things). If appropriate, include visualizations of those good/bad things.
2. For the AGE attribute:
  1. (5 points) Calculate the quartiles, mean, median, range, and variance of this attribute.
  2. (5 points) Plot a histogram of this attribute using 10 bins.
3. In this part, use use only the following attributes in the dataset: AGE, EDUCATION-NUM, RACE, SEX, CAPITAL-GAIN, CAPITAL-LOSS, HOURS-PER-WEEK, and CLASS. For these attributes calculate:
  1. (10 points) the covariance matrix and
  2. (10 points) the correlation matrix of these attributes.
    Construct a visualization of each of these matrices (e.g., heatmap) using Python to more easily understand them.
    See Section 2.4.5 of the Tan, Steinbach, Karpatne and Kumar's textbook for the definitions and formulas for correlation and covariance.
  3. (5 points) If you had to remove 2 of the attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
Dimensionality Reduction.
You must perform each of the parts of this problem both in Weka and separately in Python.
1. (10 points) For this part, USE ONLY THE CONTINUOUS (denoted as "numeric") attributes in the dataset. Apply Principal Components Analysis in Weka and separately in Python to reduce the dimensionality of the full dataset. In Weka, use the PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.99. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first new attribute(= component) obtained. Look at the results and elaborate on any interesting observations you can make about the results.
2. (5 points) Repeat the PCA experiments above but adding now the MARITAL-STATUS attribute to the dataset (that is, all continuous attributes and MARITAL-STATUS). Explain in your report any changes in the results. Describe also how the MARITAL-STATUS attribute was transformed from discrete to continuous so that PCA could handle it.
Feature Selection. (10 points)
You must perform each of the parts of this problem both in Weka and separately in Python.
For this part, USE ONLY THE DISCRETE attributes in the dataset. Use the CLASS attribute as the target classification attribute. Apply Correlation Based Feature Selection (CFS) (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Separately, use Python for the same purpose. Look at the results to determine which attributes were selected by this method and elaboreate on any interesting observations you can make about the results.

ORAL AND WRITTEN REPORTS AND DUE DATE

Written Report. Please hand in a hardcopy of your report at the beginning of class when the project is due. Only one report submission per group is needed.
Oral Report. We will discuss the results from the individual projects during the class when the project is due. Each group will have approximately 3 minutes to present their project. Prepare SLIDES summarizing the work you did, and including visualizations and graphical depictions of your results. Your slides should be a good summary of your project work. Do NOT use your written report as your slides. Be ready to show your results and to discuss your project in class within the time allowed. Given the time constraints, focus your presentation on the most relevant, unique, or creative parts of your project.
Slides Submission: Please submit a PowerPoint or a PDF file containing your presentation slides via Canvas (submission name: Project1) by the deadline stated at the top of this webpage. Only one of the team members needs to submit the slides.

CS 548 KNOWLEDGE DISCOVERY AND DATA MINING - Fall 2018 Project 1: Data Pre-processing