WPI Worcester Polytechnic Institute

------------------------------------------

BCB4003 / BCB503 Biological and Biomedical Database Mining
Project 3 - A term / Fall 2011

PROF. CAROLINA RUIZ 

DUE DATE: Monday, Oct. 10, 2011 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class (1:00 pm)
** This is an individual project **

------------------------------------------


PROJECT DESCRIPTION

The purpose of this project is:

PROJECT ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

  1. SVMs (100 points)
    In this part of the project, we will explore an application of SVMs to slice site recognition. This application is described in the following paper:
    Asa Ben-Hur, Cheng Soon Ong, Sošren Sonnenburg, Bernhard Schošlkopf, Gunnar Raštsch. "Support Vector Machines and Kernel for Computational Biology". PLOS Computational Biology. Vol 4. Issue 10. e1000173. Oct. 2008.
    The data used in this paper for the C. elegans model organism is provided by the authors at http://svmcompbio.tuebingen.mpg.de/splicing.html

    1. (20 points) Read the above paper in great detail (your understanding of the paper will become apparent on both your written and oral reports).
    2. Train several SVMs over each of the 5 datasets listed below, using either Weka or Matlab. Use your good judgement to select combinations of parameter values including kernel functions (use PolyKernel, NormalizedPolyKernel, RBFKernel, and StringKernel), and parameters for these kernels. Use 10-fold crossvalidation if at all possible (if not, use 4-fold crossvalidation). Include in your report detailed description and analysis of your experiments and results. Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices).
      1. (16 points) Use the GC-Content features dataset provided.
      2. Use the Sequences data provided. For this part, we will use l-mers as described in the paper. Create separate input datasets containing the following l-mers (yielding 2x(4^l) attributes):
        • (16 points) single nucleotides: 8 attributes, one count for each DNA letter (A,C,G,T) on each side of the AG acceptor site.
        • (16 points) dimers: 32 attributes, one count for each pair of DNA letters on each side of the AG acceptor site.
        • (16 points) 4-mers: 512 attributes, one count for each sequence of 4 DNA letters on each side of the AG acceptor site. (If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.)
        • (16 points) single nucleotides, dimers, and 4-mers: 552 attributes, by combining the 3 datasets above.

  2. Text Mining(100 points)
    Use PubMed to search for medical abstracts for the following two queries:
    • breast cancer genes
    • prostate cancer genes
    1. (5 points) Download the top 25 abstracts returned for each query.
    2. (15 points) Create a dataset consisting of these 50 abstracts. We will use the Bag of Words representation. That is, a vector of the words occurring in the 50 documents (minus stop words, and other irrelevant words) will be used as the features/attributes. Each document will be represented by the vector of frequencies of each of these words in the document. Add an attribute called class with values "breast" (for the 25 documents obtained for the 1st query) and "prostate"(for the 25 documents obtained for the 2nd query) . For transforming the document to this bag of words representation, you can either write your own code, or use a good, existing software package available to you. Check the resulting list of words to make sure they are a good selection of words. Describe in your report what code you used.
    3. (40 points) Create classification models over this dataset, using Naive Bayes, Bayesian Nets, and Support Vector Machines. Use 10-fold crossvalidation. (If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.) Include in your report detailed description and analysis of your experiments and results. Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices).
    4. (10 points) What words seem more relevant in discriminating between the two classes in the models constructed above? If these are obvious words like "male", "female", "breast", "prostate", ..., eliminate them from the dataset, repeat the classification experiments, and answer this question again.
    5. (10 points) Search the top 5 words identified above on Gene Ontology Do you find any relationships between these words?
    6. (20 points) Now apply clustering methods (k-means, EM, hierarchical clustering) over this dataset with k=2 without using the class attribute. After varying parameters as needed, how well do the 2 resulting clusters represent the 2 classes? Explain your answer.

REPORTS AND DUE DATE

  1. Slides & Class Presentation (20 points)
    We will discuss the results from the project during class so you should prepare slides summarizing your findings, and be prepared to give an oral presentation.

    Submit the following file with your slides for your oral report by email to me before 10:00 am the day the project is due:

    [your-lastname]__proj3_slides.[ext]
    where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this prohject would be named ruiz_proj3_slides.pptx

  2. Written Report
    Hand in a hardcopy of your written report at the beginning of class the day the project is due.