Written Report: Your written report should consist of your answers to each of the parts in the assignment below.
Assignment:
This dataset is part of the NCBI's GSE7390 Data Set. This dataset contains information for 198 untreated patients of the TRANSBIG validation study. See the README.txt file that describes the dataset. (To simplify the data dowloading process, you can find the same files described above in a local copy of the dataset.)
Use Excel, Matlab, your own code, Weka, or other software, to explore the dataset. That is, to become familiar with the different attributes of the dataset, their distributions, and any salient characteristics of the dataset. Include in your report any interesting observations and/or visualizations that you obtain during this exploration. State in your report which tool(s) from the above list you used for each of these observations.
Create a second version of the .arff file containing the same dataset but with nominal (rather than numeric) values for the following attributes: "Histtype", "Angioinv", "Lymp_infil", and "grade".
Apply the following clustering methods to your dataset. Describe in your written report the results of your experiments.
Experiment both with Weka and with Matlab (see the Statistics Toolbox), using:
Include in your written report (ideally summarized in a table):
Experiment both with Weka and with Matlab (see the Statistics Toolbox), using:
Include in your written report (ideally summarized in a table):
(Extra points) The dataset used in this project contains demographic information of the patients. This is part of a larger dataset containing microarray data for each of these patients. See a local copy of the microarray files. See for instance, the GSM177885.cel file.
Investigate on your own what microarray data is, and what the contents of the given GSM177885.cel file mean. Explain in your report. Include also visualizations of the contents of that file, and/or any other intesting observations.
Submit the following file with your slides for your oral report by email to me before 12:00 noon the day the project is due (that is, at least 1 hour before class):
[your-lastname]__proj1_slides.[ext]where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this prohject would be named ruiz_proj1_slides.pptx