WPI Worcester Polytechnic Institute

------------------------------------------

BCB4003 / BCB503 Biological and Biomedical Database Mining
Project 1 - A term / Fall 2011

PROF. CAROLINA RUIZ 

DUE DATE: Friday, Sept. 16, 2011 Slides (by email) by 12 noon and Written Report (hardcopy) at the beginning of class (1:00 pm)
** This is an individual project **

------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is multi-fold:

PROJECT ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

  1. Dataset.

    1. The dataset for this project is GSE7390_transbig2006affy_demo.txt (this is a local copy of the dataset).

      This dataset is part of the NCBI's GSE7390 Data Set. This dataset contains information for 198 untreated patients of the TRANSBIG validation study. See the README.txt file that describes the dataset. (To simplify the data dowloading process, you can find the same files described above in a local copy of the dataset.)

    2. Remove the following dataset attributes from consideration: "samplename", "id", "geo_accn", "filename", and "hospital".
    3. Convert the dataset to the .arff format for Weka. For this you can either use any tools provided by Weka, or you can make the conversion outside the Weka system using other tools (e.g., Excel, your own script, etc.). Include in your report the 10 first data instances of the dataset in your .arff file. (Do NOT include the full dataset - Just the first 10 data instances.)
    4. Convert the dataset to the .dat format for Matlab. For this you may want to convert the Boolean attributes to numeric by replacing "GOOD" with 1, and "POOR" with 0. Include in your report the 10 first data instances of the dataset in your .dat file. (Do NOT include the full dataset - Just the first 10 data instances.)

  2. Data Exploration.

    Use Excel, Matlab, your own code, Weka, or other software, to explore the dataset. That is, to become familiar with the different attributes of the dataset, their distributions, and any salient characteristics of the dataset. Include in your report any interesting observations and/or visualizations that you obtain during this exploration. State in your report which tool(s) from the above list you used for each of these observations.

  3. Data Preprocessing.

    Create a second version of the .arff file containing the same dataset but with nominal (rather than numeric) values for the following attributes: "Histtype", "Angioinv", "Lymp_infil", and "grade".

  4. Clustering.

    Apply the following clustering methods to your dataset. Describe in your written report the results of your experiments.

    1. K-means.

      Experiment both with Weka and with Matlab (see the Statistics Toolbox), using:

      • different values of k: k = 2 to 6,
      • different distance metrics: Euclidean and Manhattan (= cityblock).
      • different seed values (in Weka): 10, 27, 43; and different 'start' options in Matlab: 'sample', 'uniform', 'cluster'.
      • with and without normalizing each of the attributes before clustering.
      In Weka, use both .arff versions of the dataset (one with numeric and one with nominal values for the attributes "Histtype", "Angioinv", "Lymp_infil", and "grade").

      Include in your written report (ideally summarized in a table):

      • A summary of the results of each experiment (number of instances in each cluster, centroids, ...)
      • The error value ("Within cluster sum of squared errors" in Weka and the sum of values in the "sumd" vector in Matlab) for each clustering
      • Select the best clustering (i.e., lowest error value) obtained with each tool (Weka and Matlab). Let's call the best Weka clustering CWbest, and the best Matlab clustering CMbest.
        1. Produce Scatterplot and MultiDimensional Scaling (MDS) visualizations of CWbest and CMbest.
        2. Investigate what "Cluster Purity" and "Normalized Mutual Information (NMI)" are. See for example Evaluation of clustering which is part of the online book "Introduction to Information Retrieval" By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze. Include a brief description of these metrics in your report.
        3. Calculate Purity and NMI for the pair of clusterings CWbest and CMbest (whenever a notion of "class" is needed, assume that the class of an instance is the cluster number that the instance belongs to).
        Based on the above comparisons, what conclusions can you draw about K-means clustering of this dataset? Elaborate on your answer.

    2. Hierachical Clustering.

      Experiment both with Weka and with Matlab (see the Statistics Toolbox), using:

      • different distance metrics: Euclidean and Manhattan (= cityblock).
      • different link types: single, complete, average, centroid, Ward
      • with and without normalizing each of the attributes before clustering.
      In Weka, use both .arff versions of the dataset (one with numeric and one with nominal values for the attributes "Histtype", "Angioinv", "Lymp_infil", and "grade").

      Include in your written report (ideally summarized in a table):

      • A summary of the results of each experiment
      • Any interesting observations you made about the results of the experiments.
      • Pick your favorite hierachical clustering experiment. Let CW and CM be the hierachical clusterings obtained from Weka and Matlab respectively for the options/settings of your favority experiment. Include in your report a tree visualization (and other visualizations if you wish) of CW and CM.
      Based on the above results and visualizations, what conclusions can you draw about hierachical clustering of this dataset? Elaborate on your answer.

    3. Optional Part

      (Extra points) The dataset used in this project contains demographic information of the patients. This is part of a larger dataset containing microarray data for each of these patients. See a local copy of the microarray files. See for instance, the GSM177885.cel file.

      Investigate on your own what microarray data is, and what the contents of the given GSM177885.cel file mean. Explain in your report. Include also visualizations of the contents of that file, and/or any other intesting observations.


REPORTS AND DUE DATE

  1. Slides We will discuss the results from the project during class so you should prepare slides summarizing your findings, and be prepared to give an oral presentation.

    Submit the following file with your slides for your oral report by email to me before 12:00 noon the day the project is due (that is, at least 1 hour before class):

    [your-lastname]__proj1_slides.[ext]
    where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this prohject would be named ruiz_proj1_slides.pptx

  2. Written Report Hand in a hardcopy of your written report at the beginning of class the day the project is due.