- Weka:
- Matlab:
Access Matlab from the CCC
as described in the Course Webpage.
- Dataset:
- The dataset for this problem set is
GSE7390_transbig2006affy_demo.txt (this is a local copy of the dataset).
This dataset is part of the NCBI's
GSE7390 Data Set. This dataset contains information for 198 untreated patients of the TRANSBIG validation study.
See the README.txt file that describes the dataset.
(To simplify the data dowloading process, you can find the same files described above in a local copy of the dataset.)
-
Remove the following dataset attributes from consideration:
"samplename", "id", "geo_accn", "filename", and "hospital".
- (10 points)
Convert the dataset to the .arff format for Weka. For this
you can either use any tools provided by Weka, or you can make the
conversion outside the Weka system using other tools (e.g., Excel, your own
script, etc.).
Include in your report the file header defining the attributes and the 10 first data instances of the dataset
in your .arff file. (Do NOT
include the full dataset in your written report - Just the header and the first 10 data instances.)
However, use the entire dataset on the remaining parts of this problem set.
Make sure to use "?" in the .arff file to represent any missing values in the dataset.
- (10 points)
Convert the dataset to the .dat format for Matlab
(or if you prefer, any other format that you can upload onto Matlab).
For this you may want to convert the Boolean attributes to numeric by replacing
"GOOD" with 1, and "POOR" with 0.
Include in your report the file header and the 10 first data instances of the dataset
in your .dat file.
(Do NOT include the full dataset in your written report
- Just the header and the first 10 data instances.)
However, use the entire dataset on the remaining parts of this problem set.
Make sure to use the appropriate Matlab convention to represent any missing values in the dataset.
- Data Exploration.
(30 points)
Use Excel, Matlab, your own code, Weka, R, or other software, to explore
the dataset. That is, to become familiar with the different attributes
of the dataset, their distributions, and any salient characteristics of
the dataset.
- (10 points) Include in your report any interesting observations and
visualizations that you obtain during this exploration. State in your
report which tool(s) from the above list you used for each of these
observations and visualizations.
- (15 points) Calculate both the covariance matrix and
the correlation matrix of the numeric attributes.
See notes on using Matlab and Excel to calculate these matrices.
Include these two matrices in your report.
Try to construct a visualization of each of these matrices (e.g., heatmap) to more easily understand them.
- (5 points)
If you had to remove 3 of these continuous attributes from the dataset based
on these two matrices, which attributes would you remove and why?
Explain your answer.
- Data Preprocessing.
(50 points)
Create a second version of the .arff file containing the same dataset
but with nominal (rather than numeric) values for the following attributes:
"Histtype", "Angioinv", "Lymp_infil", and "grade".
For the remainder of this problem set, assume that "veridex_risk"
is the target attribute.
- Sampling.
- (5 points) Use Weka's unsupervised Resample filter to obtain a 50%
subsample of the input dataset without replacement.
Include in your report the distribution of the target
attribute
(that is, the percentage of instances with "veridex_risk"=Good,
and the percentage of instances with "veridex_risk"=Poor)
in the subsample.
- (5 points) Use Weka's supervised Resample filter to obtain a 50%
subsample of the input dataset without replacement.
Include in your report the distribution of the target
attribute in the subsample.
- (5 points) Are the above two distributions different? Why is that?
- Attribute Discretization.
Starting with the input dataset (before sampling):
- (5 points) Use Weka's unsupervised Discretize filter to discretize
the continuous attribute "age" of the input dataset into 4 bins
using equal frequency (i.e., useEqualFrequency=True).
Include the results in your report, as well as the distribution
of the target attribute in each of the bins.
- (5 points) Use Weka's unsupervised Discretize filter to discretize
the continuous attribute "age" of the input dataset into 4 bins
using equal width (i.e., useEqualFrequency=False).
Include the results in your report, as well as the distribution
of the target attribute in each of the bins.
- (5 points) Use Weka's supervised Discretize filter to discretize
the continuous attribute "age" of the input dataset with
respect to the class attribute.
Include the results in your report, as well as the distribution
of the target attribute in each of the resulting bins.
- (5 points) Weka Code.
Find the Java code that implements the unsupervised discretization filter
in the directories that contain the Weka files, following the instructions
provided above.
Include the first 10 lines of that code in your written report.
- Missing Values.
(5 points)
Starting with the input dataset before sampling and before discretization:
- Determine if the dataset has any missing values.
- If so, use Weka's unsupervised ReplaceMissingValues filter to fill in the missing
values.
- Compare the distribution of the original attribute(s) with missing values against the
distribution of the same attribute after replacing the missing values.
- Feature Selection.
Starting with the input dataset before sampling, before discretization, and
before replacing missing values:
Apply Correlation Based Feature Selection
(see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6)
to the input dataset.
For this, use Weka's CfsSubsetEval available under the Select attributes tab
with default parameters.
- (3 points) Include in your report which attributes were selected by this method.
- (5 points) Also, what can you observe about these selected attributes with respect to
the covariance matrix and the correlation matrix you computed
above?
- (2 points) Were the 3 attributes you chose to remove
above kept or removed by CfsSubsetEval?
- Optional Part
(20 Extra points)
The dataset used in this problem set contains demographic information of the patients.
This is part of a larger dataset containing microarray data for each
of these patients. See
a local copy of the microarray files. See for instance, the GSM177885.cel file.
Investigate on your own what microarray data is, and what the contents of the given GSM177885.cel file mean. Explain in your report. Include also visualizations of the contents of that file, and/or any other intesting observations.