WPI Worcester Polytechnic Institute


BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 5 - A term / Fall 2013
Support Vector Machines and Text Mining

DUE DATE: Friday, Oct. 11th, 2013 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class
** This is an individual problem set **



The purpose of this project is to:


Written Report: Your written report should consist of your answers to each of the parts in the assignment below.


  1. Materials
    Study in detail

  2. SVMs (100 points)
    In this part of the project, we will explore an application of SVMs to slice site recognition. This application is described in the following paper:
    Asa Ben-Hur, Cheng Soon Ong, Sošren Sonnenburg, Bernhard Schošlkopf, Gunnar Raštsch. "Support Vector Machines and Kernel for Computational Biology". PLOS Computational Biology. Vol 4. Issue 10. e1000173. Oct. 2008.
    The data used in this paper for the C. elegans model organism is provided by the authors at http://svmcompbio.tuebingen.mpg.de/splicing.html (near the bottom of that webpage).

  3. Text Mining(100 points)
    Use PubMed to search for medical abstracts for each of the following two queries:
    • breast cancer
    • prostate cancer
    1. (5 points) Download the top 100 abstracts returned for each query. You can collaborate on the data collection, as long as everyone in the class contributes.
    2. (15 points) Create a dataset consisting of these 200 abstracts. We will use the Bag of Words representation. That is, a vector of the words occurring in the 200 documents (minus stop words, and other irrelevant words) will be used as the features/attributes. Each document will be represented by the vector of frequencies of each of these words in the document. Add an attribute called class with values "breast cancer" (for the 100 documents obtained for the 1st query) and "prostate cancer"(for the 100 documents obtained for the 2nd query) . For transforming the document to this bag of words representation, you can either write your own code; use Weka (see the StringToWordVector filter in Weka - to get familiar with it, play with the "Reuters" text arff datasets that come with Weka); or use a good, existing software package available to you Describe in your report what code you used, and cite any resources used. Check the resulting list of words to make sure they are a good selection of words.
    3. (40 points) Create classification models over this dataset, using Naive Bayes, Bayesian Nets, and Support Vector Machines. Use 10-fold crossvalidation. (If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.) Include in your report detailed description and indepth analysis of your experiments and results. Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices). Include visualizations as well.
    4. (10 points) What words seem more relevant in discriminating between the two classes in the models constructed above? If these are obvious words like "male", "female", "breast", "prostate", ..., eliminate them from the dataset, repeat the classification experiments, and answer this question again.
    5. (10 points) Search the top 5 words identified above on Gene Ontology Do you find any relationships between these words?
    6. (20 points) Now apply clustering methods (k-means, EM, hierarchical clustering) over this dataset with k=2 without using the class attribute. After varying parameters as needed, how well do the 2 resulting clusters represent the 2 classes? Explain your answer.


  1. Slides. We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.

    Submit the following file with your slides for your oral report by email to me before the deadline:

    where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset5_slides.pptx

  2. Written Report. Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.