BCB4003 / BCB503 A Term / Fall 2011

BCB4003 / BCB503 Biological and Biomedical Database Mining
Project 2 - A term / Fall 2011

PROF. CAROLINA RUIZ

DUE DATE: Monday, Oct. 3, 2011 Slides (by email) by 12 noon and Written Report (hardcopy) at the beginning of class (1:00 pm)
** This is an individual project **

Project Description
Project Assignment
Report Submission and Due Date

PROJECT DESCRIPTION

The purpose of this project is:

To gain familiarity with the EM algorithm, Bayesian Nets, Markov Chains, and Hidden Markov Models and their applications to biological and biomedical data.
To gain familiarity with Matlab's Bioinformatics Toolbox.

PROJECT ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

EM (15 points)
Study the EM_Example provided on the course Lecture Notes webpage. Following the same process (but with 1,000 data points instead of 100), run at least 3 separate experiments with different values for the means and the standard deviations. What happens as the means of the two clusters approach each other? Include the results of your experiments in your report and elaborate on any interesting results and comparisons.
Bayesian Models (30 points)
For this part of the project, we will use the same dataset that we used in Project 1. The only dataset attributes that you should define as "numeric" in your .arff file are age, size, t.tdm, t.rfs, t.os, t.dmfs, NPI, and AOL_os_10y. All other attributes are nominal, and should be defined as so in your .arff file. For instance, the nominal attribute Histtype should be defined as @attribute Histtype {1,2,3,4,5,6,7}
We will use Weka's Naive Bayes and Bayesian Net classifiers to contruct models for this dataset. Assume that the classification target is "veridex_risk". During model construction, use the %split test option, with 90% split. That is, 90% of randomly selected data instances from the dataset are used to construct the model and the remaining 10% of the data instances are used to test the model.
1. (10 points) Construct Naive Bayes models of the dataset using 90% split. Click on "More options ...", to select "Output predictions" (choose say plain text), and to choose a value for the Random seed (initially use value = 1). Repeat the experiment 3 times with seeds 1, 23, 62. For each of the 3 experiments, record in your report the conditional probability values output by Weka under "Naive Bayes Classifier", the accuracy and the confusion matrix obtained with the model over the 10% test set. Report any interesting observations about the results of each experiment and across the 3 experiments.
2. (15 points) Construct Bayesian Nets over this dataset using 90% split. Use K2 as the algorithm to construct the topology of the Bayesian net. Vary the values of "initAsNaiveBayes", "maxNrOfParents", and "randomOrder" parameters of the K2 algorithm. For each experiment, include in your report the obtained Bayesian net (right-click on the experiment on the left hand-side window, and select "Visualize graph"), the accuracy, the confusion matrix, and any interesting observations on the conditional probability tables (click on the nodes in the graph visualization). Report any interesting observations about the results of each experiment and across the 3 experiments.
3. (5 points) Compare the results obtained with Bayesian Nets and with Naive Bayes.
Markov Chains and Hidden Markov Models (150 points + 100 extra credit points)
1. (50 points) Using the fair/loaded coin example discussed in class (see slide 14 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 hidden states C (fair coin) and D (loaded coin), each one producing H (heads) or T (tails), together with the following probabilities:
```
Transition probabilities:
    C     D
C  0.9   0.1
D  0.1   0.9

Emission probabilities:

    H     T
C  0.5   0.5
D  0.75  0.25
```
  Assume that both hidden states are equally likely to be the initial state. Represent this by including a "fake" Start state that has no emissions, and has one transition to C and one transition to D, each one with 0.5 probability. [For this problem, it would be very useful for you to explore all the resources on Hidden Markov Models posted on the course webpage.]
  Follow the Forward and the Backward algorithms by hand for the following observed sequence x = TTHH. Show your work and record intermediate results of the dynamic programming algorithms in tables F and B, as the algorithms would. Note that:
  - the F table would have an additional column 0, and additional row 0, corresponding to the fake Start state.
  - as discussed in class, multiplying small probabilities can create underflow errors. If you do run into underflow errors, redo your calculations in log₂ space (see Prof. Kellis' Algorithms for Computational Biology course (MIT) lecture notes and Prof. Mneimneh's Computational Biology course (Hunter College) lecture notes), or scaling (see Rabiner's tutorial Section V).
  Once that those tables have been completed, calculate the probability that the 3rd hidden state visited (i.e., the state that produced the leftmost H) was C (the fair coin). That is, calculate:
  p(s₃= C | TTHH) = ?
  Remember that p(s₃= C | TTHH) = p(TTHH, s₃=C)/p(TTHH). Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.
2. (100 points + 100 extra credit points) This part of the assignment is based on a homework assignment from Prof. Subramanian's "From Sequence to Structure: An Introduction to Computational Biology" course (Rice Univ.). Take a look at Prof. Subramanian's useful Markov models and HMMs Matlab demos.
  Follow the instructions in the homework assignment, starting on page 2 (you do NOT need to work on parts (a)-(d) on page 1). Include in your written report answers to parts (a)-(g) on pages 3-4. Credit points are as follows: (a)-(b) 10 points each; (c)-(f) 15 points each; (g)20 points. Please note that the data files have been changed since the date of the above assignment (2009). Current information (as of Sept. 2011) is included below.
  You can download all the needed data files from http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=22. For simplicity, I include the files below (current as of Sept. 2011):
  - ncbimapBuild_37.2_22_cpg_0K.txt: CpG island locations on the human chromosome 22.
    (Taken from the NCBI's CpG island for human chromosome 22 webpage. Also accesible from "CpG island" on the NCBI's human chromosome 22 webpage.)
  - Human chomosome 22's contigs 1, 2, and 3:
    (You can find links to the original files by looking for "Contigs" on the NCBI's human chromosome 22 webpage.)
```
Region Displayed: 0-51M bp Download/View Sequence/Evidence Download Data
Total Contigs On Chromosome: 4
Contigs in Region: 0
	start 		stop		Symbol		O
	16050001 	16697850	NT_028395.3	+
	16847851 	20509431	NT_011519.10	+
	20609432 	50364777	NT_011520.12	+
	50414778 	51244566	NT_011526.7	+
```
  Extra Credit (100 points): Repeat the same steps above, but using a HMM with 8 hidden states: A+, C+, G+, T+, A-, C-, G-, T- where the "+" states represent the nucleotides in a CpG island, and the "-" states represent the nucleotides in regular DNA. Each state emits only the corresponding nucleotide. That is, A+ and A- emit A; C+ and C- emit C; etc. Include transitions from each of the states to all the other 7 states.

REPORTS AND DUE DATE

Slides & Class Presentation (10 points)
We will discuss the results from the project during class so you should prepare slides summarizing your findings, and be prepared to give an oral presentation.
Submit the following file with your slides for your oral report by email to me before 12:00 noon the day the project is due (that is, at least 1 hour before class):
[your-lastname]__proj2_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this prohject would be named ruiz_proj2_slides.pptx
Written Report
Hand in a hardcopy of your written report at the beginning of class the day the project is due.

BCB4003 / BCB503 Biological and Biomedical Database Mining Project 2 - A term / Fall 2011

PROF. CAROLINA RUIZ

PROJECT DESCRIPTION

PROJECT ASSIGNMENT

REPORTS AND DUE DATE

BCB4003 / BCB503 Biological and Biomedical Database Mining
Project 2 - A term / Fall 2011