BCB4003/503 CS4803/583 A Term / Fall 2013

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 5 - A term / Fall 2013
Support Vector Machines and Text Mining

DUE DATE: Friday, Oct. 11th, 2013 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class
** This is an individual problem set **

Problem Set Description
Problem Set Assignment
Report Submission and Due Date

PROBLEM SET DESCRIPTION

The purpose of this project is to:

gain familiarity with Support Vector Machines (SVMs) and their applications in biology and biomecine.
gain familiarity with text mining and ontologies, and their applications in biology and biomecine.

PROJECT ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

Materials
Study in detail

Support Vector Machines materials posted on the course webpage.
Text Mining materials posted on the course webpage.
SVMs (100 points)
In this part of the project, we will explore an application of SVMs to slice site recognition. This application is described in the following paper:
Asa Ben-Hur, Cheng Soon Ong, So╮en Sonnenburg, Bernhard Scho╨kopf, Gunnar Ra╰sch. "Support Vector Machines and Kernel for Computational Biology". PLOS Computational Biology. Vol 4. Issue 10. e1000173. Oct. 2008.
The data used in this paper for the C. elegans model organism is provided by the authors at http://svmcompbio.tuebingen.mpg.de/splicing.html (near the bottom of that webpage).
- (10 points) Read the above paper in great detail (your understanding of the paper will become apparent on both your written and oral reports).
- Datasets:
  - Dataset 1: Use the GC-Content features dataset provided.
  - Use the Sequences data provided. For this part, we will use l-mers as described in the paper. Create separate input datasets containing the following l-mers (yielding 2x(4^l) attributes):
    - Dataset 2: single nucleotides: 8 attributes, one count for each DNA letter (A,C,G,T) on each side of the AG acceptor site.
    - Dataset 3: dimers: 32 attributes, one count for each pair of DNA letters on each side of the AG acceptor site.
    - Dataset 4: 4-mers: 512 attributes, one count for each sequence of 4 DNA letters on each side of the AG acceptor site. (If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.)
    - Dataset 5: single nucleotides, dimers, and 4-mers: 552 attributes, by combining the 3 datasets above.
- What you need to do:
  For each of the 5 datasets above:
  1. (5 points) Get familiar with the dataset. Include in your report a brief description as well as visualizations of the dataset.
  2. Train several SVMs over each of the 5 datasets listed below, using either Weka or Matlab. In Weka, use "SMO" available under Classify->functions->SMO. In order to use SMO, the target attribute must be nominal.
    1. (5 points) Use your good judgement to select combinations of parameter values including kernel functions (use PolyKernel, NormalizedPolyKernel, and RBFKernel), and parameters for these kernels.
    2. Use 10-fold crossvalidation if at all possible (if not, use 4-fold crossvalidation).
    3. (5 points) Include in your report detailed description of your experiments and results. In particular,
      - report the model, accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices); summarize and present the results in an organized manner.
      - provide visualizations of the results, including but not limited to ROC curves.
    4. (5 points) Include in your report indepth analysis of your experiments and results.
Text Mining(100 points)
Use PubMed to search for medical abstracts for each of the following two queries:
- breast cancer
- prostate cancer
1. (5 points) Download the top 100 abstracts returned for each query. You can collaborate on the data collection, as long as everyone in the class contributes.
2. (15 points) Create a dataset consisting of these 200 abstracts. We will use the Bag of Words representation. That is, a vector of the words occurring in the 200 documents (minus stop words, and other irrelevant words) will be used as the features/attributes. Each document will be represented by the vector of frequencies of each of these words in the document. Add an attribute called class with values "breast cancer" (for the 100 documents obtained for the 1st query) and "prostate cancer"(for the 100 documents obtained for the 2nd query) . For transforming the document to this bag of words representation, you can either write your own code; use Weka (see the StringToWordVector filter in Weka - to get familiar with it, play with the "Reuters" text arff datasets that come with Weka); or use a good, existing software package available to you Describe in your report what code you used, and cite any resources used. Check the resulting list of words to make sure they are a good selection of words.
3. (40 points) Create classification models over this dataset, using Naive Bayes, Bayesian Nets, and Support Vector Machines. Use 10-fold crossvalidation. (If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.) Include in your report detailed description and indepth analysis of your experiments and results. Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices). Include visualizations as well.
4. (10 points) What words seem more relevant in discriminating between the two classes in the models constructed above? If these are obvious words like "male", "female", "breast", "prostate", ..., eliminate them from the dataset, repeat the classification experiments, and answer this question again.
5. (10 points) Search the top 5 words identified above on Gene Ontology Do you find any relationships between these words?
6. (20 points) Now apply clustering methods (k-means, EM, hierarchical clustering) over this dataset with k=2 without using the class attribute. After varying parameters as needed, how well do the 2 resulting clusters represent the 2 classes? Explain your answer.

REPORTS AND DUE DATE

Slides. We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.
Submit the following file with your slides for your oral report by email to me before the deadline:
[your-lastname]__pbmset5_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset5_slides.pptx
Written Report. Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining Prof. Carolina Ruiz Problem Set 5 - A term / Fall 2013 Support Vector Machines and Text Mining

PROBLEM SET DESCRIPTION

PROJECT ASSIGNMENT

REPORTS AND DUE DATE

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 5 - A term / Fall 2013
Support Vector Machines and Text Mining