BCB4003/503 CS4803/583 A Term / Fall 2013

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 1 - A term / Fall 2013
Data Exploration, Data Preprocessing, and Feature Selection

DUE DATE: Friday, Sept. 6, 2013 Slides (by email) by 12 noon and Written Report (hardcopy) at the beginning of class (1:00 pm)
** This is an individual problem set **

Problem Set Description
Problem Set Assignment
Report Submission and Due Date

PROBLEM SET DESCRIPTION

The purpose of this problem set is multi-fold:

To gain experience with data adquisition, data exploration and data pre-processing.
To gain familiarity with the Weka system, its GUI, its code, and its input data format (arff).
To gain familiarity with Matlab.

PROBLEM SET ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

Weka:

Download and install the developer version of the Weka system as described in the Course Webpage. Determine the name/path of the directory created to store the Weka files (e.g., C:\Program Files\Weka-3-7\). We'll call that directory WekaDirectory in the remaining of this problem set description.
You can find the Weka code in a file called "weka-src.jar", which should be located in the directory where Weka was installed. You might need to unzip and/or use jar utilities this file to extract its contents. Inside, you will find the .java files that implement Weka.

Consult the "README" file, the "documentation" webpage, and the "WekaManual" provided with the Weka system (in the same directory where Weka was downloaded). Browse through the "Package Documentation" to become familiar with it.

When needed, use the following command to increase the amount of main memory used by Weka. Here, I'm increasing the amount of main memory used by Weka to 768m, but you can specify any other size instead of 768 if more memory is needed/available:
java -Xmx768m -jar weka.jar
Matlab:
Access Matlab from the CCC as described in the Course Webpage.
Dataset:
1. The dataset for this problem set is GSE7390_transbig2006affy_demo.txt (this is a local copy of the dataset).
  This dataset is part of the NCBI's GSE7390 Data Set. This dataset contains information for 198 untreated patients of the TRANSBIG validation study. See the README.txt file that describes the dataset. (To simplify the data dowloading process, you can find the same files described above in a local copy of the dataset.)
2. Remove the following dataset attributes from consideration: "samplename", "id", "geo_accn", "filename", and "hospital".
3. (10 points) Convert the dataset to the .arff format for Weka. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., Excel, your own script, etc.). Include in your report the file header defining the attributes and the 10 first data instances of the dataset in your .arff file. (Do NOT include the full dataset in your written report - Just the header and the first 10 data instances.) However, use the entire dataset on the remaining parts of this problem set.
  Make sure to use "?" in the .arff file to represent any missing values in the dataset.
4. (10 points) Convert the dataset to the .dat format for Matlab (or if you prefer, any other format that you can upload onto Matlab). For this you may want to convert the Boolean attributes to numeric by replacing "GOOD" with 1, and "POOR" with 0. Include in your report the file header and the 10 first data instances of the dataset in your .dat file. (Do NOT include the full dataset in your written report - Just the header and the first 10 data instances.) However, use the entire dataset on the remaining parts of this problem set.
  Make sure to use the appropriate Matlab convention to represent any missing values in the dataset.
Data Exploration.
(30 points) Use Excel, Matlab, your own code, Weka, R, or other software, to explore the dataset. That is, to become familiar with the different attributes of the dataset, their distributions, and any salient characteristics of the dataset.
1. (10 points) Include in your report any interesting observations and visualizations that you obtain during this exploration. State in your report which tool(s) from the above list you used for each of these observations and visualizations.
2. (15 points) Calculate both the covariance matrix and the correlation matrix of the numeric attributes. See notes on using Matlab and Excel to calculate these matrices. Include these two matrices in your report. Try to construct a visualization of each of these matrices (e.g., heatmap) to more easily understand them.
3. (5 points) If you had to remove 3 of these continuous attributes from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
Data Preprocessing.
(50 points) Create a second version of the .arff file containing the same dataset but with nominal (rather than numeric) values for the following attributes: "Histtype", "Angioinv", "Lymp_infil", and "grade".
For the remainder of this problem set, assume that "veridex_risk" is the target attribute.
1. Sampling.
  1. (5 points) Use Weka's unsupervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute (that is, the percentage of instances with "veridex_risk"=Good, and the percentage of instances with "veridex_risk"=Poor) in the subsample.
  2. (5 points) Use Weka's supervised Resample filter to obtain a 50% subsample of the input dataset without replacement. Include in your report the distribution of the target attribute in the subsample.
  3. (5 points) Are the above two distributions different? Why is that?
2. Attribute Discretization. Starting with the input dataset (before sampling):
  1. (5 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute "age" of the input dataset into 4 bins using equal frequency (i.e., useEqualFrequency=True). Include the results in your report, as well as the distribution of the target attribute in each of the bins.
  2. (5 points) Use Weka's unsupervised Discretize filter to discretize the continuous attribute "age" of the input dataset into 4 bins using equal width (i.e., useEqualFrequency=False). Include the results in your report, as well as the distribution of the target attribute in each of the bins.
  3. (5 points) Use Weka's supervised Discretize filter to discretize the continuous attribute "age" of the input dataset with respect to the class attribute. Include the results in your report, as well as the distribution of the target attribute in each of the resulting bins.
  4. (5 points) Weka Code. Find the Java code that implements the unsupervised discretization filter in the directories that contain the Weka files, following the instructions provided above. Include the first 10 lines of that code in your written report.
3. Missing Values. (5 points) Starting with the input dataset before sampling and before discretization:
  1. Determine if the dataset has any missing values.
  2. If so, use Weka's unsupervised ReplaceMissingValues filter to fill in the missing values.
  3. Compare the distribution of the original attribute(s) with missing values against the distribution of the same attribute after replacing the missing values.
4. Feature Selection. Starting with the input dataset before sampling, before discretization, and before replacing missing values:
  Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6) to the input dataset. For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters.
  1. (3 points) Include in your report which attributes were selected by this method.
  2. (5 points) Also, what can you observe about these selected attributes with respect to the covariance matrix and the correlation matrix you computed above?
  3. (2 points) Were the 3 attributes you chose to remove above kept or removed by CfsSubsetEval?
Optional Part
(20 Extra points) The dataset used in this problem set contains demographic information of the patients. This is part of a larger dataset containing microarray data for each of these patients. See a local copy of the microarray files. See for instance, the GSM177885.cel file.
Investigate on your own what microarray data is, and what the contents of the given GSM177885.cel file mean. Explain in your report. Include also visualizations of the contents of that file, and/or any other intesting observations.

REPORTS AND DUE DATE

Slides We will discuss the results from the problem set during class so you should prepare a few slides (say 4 or 5) summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.

Submit the following file with your slides for your oral report by email to me before 12:00 noon the day the problem set is due (that is, at least 1 hour before class):

[your-lastname]__pbmset1_slides.[ext]

where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset1_slides.pptx

Written Report Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.

Grading sheet for this problem set.

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining Prof. Carolina Ruiz Problem Set 1 - A term / Fall 2013 Data Exploration, Data Preprocessing, and Feature Selection

PROBLEM SET DESCRIPTION

PROBLEM SET ASSIGNMENT

REPORTS AND DUE DATE

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 1 - A term / Fall 2013
Data Exploration, Data Preprocessing, and Feature Selection