PROJECT ASSIGNMENT
Written Report:
Your written report should consist of your answers to each of the
parts in the assignment below.
Assignment:
- SVMs (100 points)
In this part of the project, we will explore an application of SVMs to slice site recognition. This application is described in the following paper:
Asa Ben-Hur, Cheng Soon Ong, Sošren Sonnenburg, Bernhard Schošlkopf, Gunnar Raštsch.
"Support Vector Machines and Kernel for Computational Biology".
PLOS Computational Biology. Vol 4. Issue 10. e1000173. Oct. 2008.
The data used in this paper for the C. elegans model organism
is provided by the authors at
http://svmcompbio.tuebingen.mpg.de/splicing.html
- (20 points)
Read the above paper in great detail (your understanding of the paper
will become apparent on both your written and oral reports).
-
Train several SVMs over each of the 5 datasets listed below,
using either Weka or Matlab.
Use your good judgement to select combinations of parameter values
including kernel functions (use PolyKernel, NormalizedPolyKernel, RBFKernel, and StringKernel), and parameters for these kernels.
Use 10-fold crossvalidation if at all possible (if not, use
4-fold crossvalidation).
Include in your report detailed description and analysis of your
experiments and results.
Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices).
- (16 points)
Use the GC-Content features dataset provided.
-
Use the Sequences data provided.
For this part, we will use l-mers as described in the paper.
Create separate input datasets containing the following l-mers (yielding 2x(4^l) attributes):
- (16 points) single nucleotides:
8 attributes, one count for each DNA letter (A,C,G,T)
on each side of the AG acceptor site.
- (16 points) dimers:
32 attributes, one count for each pair of DNA letters
on each side of the AG acceptor site.
- (16 points) 4-mers:
512 attributes, one count for each sequence of 4 DNA letters
on each side of the AG acceptor site.
(If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.)
- (16 points) single nucleotides, dimers, and 4-mers:
552 attributes, by combining the 3 datasets above.
- Text Mining(100 points)
Use PubMed to search for medical
abstracts for the following two queries:
- breast cancer genes
- prostate cancer genes
- (5 points)
Download the top 25 abstracts returned for each query.
- (15 points)
Create a dataset consisting of these 50 abstracts. We will use the Bag of Words representation. That is, a vector of the words occurring in the 50 documents (minus stop words, and other irrelevant words) will be used as the features/attributes.
Each document will be represented by the vector of frequencies of each of these words in the document. Add an attribute called class with values "breast" (for the
25 documents obtained for the 1st query) and "prostate"(for the
25 documents obtained for the 2nd query) .
For transforming the document to this bag of words representation, you can either write your own code, or use a good, existing software package available to you. Check the resulting list of words to make sure they are a good selection of words.
Describe in your report what code you used.
- (40 points)
Create classification models over this dataset, using Naive Bayes, Bayesian Nets, and Support Vector Machines. Use 10-fold crossvalidation.
(If the data mining method cannot handle the number of attributes, use CfsSubsetEval from the "Select attribute" tab in Weka to select a subset of attributes before training.)
Include in your report detailed description and analysis of your
experiments and results.
Report accuracy values, ROC Area, and any other evaluation metrics you deem relevant (including possibly confusion matrices).
- (10 points)
What words seem more relevant in discriminating between the two classes in the models constructed above? If these are obvious words like "male", "female", "breast", "prostate", ..., eliminate them from the dataset, repeat the classification experiments, and answer this question again.
- (10 points)
Search the top 5 words identified above on
Gene Ontology
Do you find any relationships between these words?
- (20 points)
Now apply clustering methods (k-means, EM, hierarchical clustering) over this dataset with k=2 without using the class attribute.
After varying parameters as needed, how well do the 2 resulting clusters represent the 2 classes?
Explain your answer.