BCB4003/503 CS4803/583 A Term / Fall 2013

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 4 - A term / Fall 2013
Markov Chains and Hidden Markov Models

DUE DATE: Friday, Oct. 4th, 2013 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class
** This is an individual problem set **

Problem Set Description
Problem Set Assignment
Report Submission and Due Date

PROBLEM SET DESCRIPTION

The purpose of this project is to:

gain familiarity with Markov chains and hidden Markov models, and their applications to biological and biomedical data.
gain familiarity with Matlab's Bioinformatics Toolbox.

PROBLEM SET ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

Materials

Study in detail the Markov models materials posted on the course webpage.
Read Chapter 3 from Durbin, Eddy, Krogh, and Mitchison. "Biological Sequence Analysis". Cambridge University Press. 1998. I placed this book on reserve in the Gordon Library.
Learn more about CpG Islands by researching this topic online.
Matlab:

For this assignment you need both the Statistics and the Bioinformatics Matlab toolboxes. To see if these toolboxes are available in the Matlab installation that you are using, type "ver" on Matlab's command window. If the Bioinformatics toolbox is not listed, use remote desktop connection from your PC to windows terminal servers sunfire3.wpi.edu, sunfire4.wpi.edu or sunfire5.wpi.edu. The version of Matlab installed on those servers contains the Bioinformatics toolbox.
In case they are helpful, you can search for Bioinformatic toolbox webinars on the MathWorks Recorded Webinars webpage (see the "Refine by Product" menu on the left hand side of that webpage).
Markov Chains and Hidden Markov Models
1. (5 points) Consider the Markov Chain of the rain/no-rain example discussed in class (see slides 4-5 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 states Y (rain) and N (no rain) together with the following transition probabilities:
```
Transition probabilities:
    Y     N
Y  0.4   0.6
N  0.2   0.8
```
  Assume that the probability that it rains on a randomly selected day of the year is 30%. That is, p(Y)=0.3.
  1. (3 points) Calculate the probability that in 4 consecutive days, it rains on days 2 and 4 and it doesn't rain on days 1 and 3. That is, calculate p(NYNY).
  2. (2 points) Calculate the probability that in 4 consecutive days, it doesn't rain on day 4, given that it did't rain on day 1 but it rained on days 2 and 3. That is, calculate p(N|NYY).
2. (25 points) Consider hidden Markov model of the fair/loaded coin (sometimes called the "dishonest casino") example discussed in class (see slide 14 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 hidden states C (fair coin) and D (loaded coin), each one producing H (heads) or T (tails), together with the following probabilities:
```
Transition probabilities:
    C     D
C  0.9   0.1
D  0.1   0.9

Emission probabilities:

    H     T
C  0.5   0.5
D  0.75  0.25
```
  Assume that both hidden states are equally likely to be the initial state. Represent this by including a "fake" Start state that has no emissions, and has one transition to C and one transition to D, each one with 0.5 probability.
  1. (20 points) Follow the Forward and the Backward algorithms by hand for the following observed sequence x = TTHH. Show your work and record intermediate results of the dynamic programming algorithms in tables F (Forward) and B (Backward), as the algorithms would. Note that:
    - the F table would have an additional column 0, and additional row 0, corresponding to the fake Start state.
    - as discussed in class, multiplying small probabilities can create underflow errors. If you do run into underflow errors, redo your calculations in log₂ space (see Prof. Kellis' Algorithms for Computational Biology course (MIT) lecture notes and Prof. Mneimneh's Computational Biology course (Hunter College) lecture notes), or scaling (see Rabiner's tutorial Section V).
  2. (5 points) Once that those tables have been completed, calculate the probability that the 3rd hidden state visited (i.e., the state that produced the leftmost H) was C (the fair coin). That is, calculate:
    p(s₃= C | TTHH) = ?
    Remember that p(s₃= C | TTHH) = p(TTHH, s₃=C)/p(TTHH). Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.
3. (100 points) This part of the assignment is based on a homework assignment from Prof. Subramanian's "From Sequence to Structure: An Introduction to Computational Biology" course (Rice Univ.). Take a look at Prof. Subramanian's useful Markov models and HMMs Matlab demos.
  - What you need to do:
    Follow the instructions in Prof. Subramanian's homework assignment. Include in your written report answers to all the questions in that homework assignment. Credit points are as shown in that assignment.
    - See Prof. Subramanian's slides on Markov models
    - See Prof. Subramanian's slides on hidden Markov models pages 63-87.
  - Data Files:
    Note that the data files have been updated since the date of Prof. Subramanian's assignment (2009).
    - Human chromosome 22 contigs:
      Below are the start, stop, and Symbol information for the 4 contigs in the human chromosome 22 (as of Sept. 2013), taken from the NCBI's human chromosome 22 webpage:
```
Region Displayed: 0-51M bp Download/View Sequence/Evidence Download Data
Total Contigs On Chromosome: 4
Contigs in Region: 0
	start 		stop		Symbol		O
	16050001 	16697850	NT_028395.3	+
	16847851 	20509431	NT_011519.10	+
	20609432 	50364777	NT_011520.12	+
	50414778 	51244566	NT_011526.7	+
```
      - You can upload any of these contigs onto Matlab using the "getgenbank" command as described in the homework assignment. Matlab will fetch the correct NCBI file by itself.
      - If you'd like to look at the original files online, you can find them at by looking for "Contigs" on the NCBI's human chromosome 22 webpage.
    - CpG island locations on the human chromosome 22:
      These locations are contained in the file "ncbimapAnnotation Release 104_22_cpg_0K.txt", which specifies the start and stop location of each CpG island.
      - You can download this file by clicking on the "Download Data" link next to "CpG Island" on the NCBI's CpG island for human chromosome 22 webpage.
      - You can upload these start-stop location pairs onto Matlab using the "[cpgStart,cpgStop]=textread('ncbimapAnnotation Release 104_22_cpg_0K.txt','%d %d','headerlines',6)" command as described in the homework assignment. (Note that you can use the file name as is, without using backslashes '\'.)
4. Extra Credit (50 points): Repeat steps (d) through (g) of the HW assignment, but using a HMM with 8 hidden states: A+, C+, G+, T+, A-, C-, G-, T- where the "+" states represent the nucleotides in a CpG island, and the "-" states represent the nucleotides in regular DNA. Each state emits only the corresponding nucleotide. That is, A+ and A- emit A; C+ and C- emit C; etc. Include transitions from each of the states to all the other 7 states.

REPORTS AND DUE DATE

Slides. We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.
Submit the following file with your slides for your oral report by email to me before the deadline:
[your-lastname]__pbmset4_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset4_slides.pptx
Written Report. Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining Prof. Carolina Ruiz Problem Set 4 - A term / Fall 2013 Markov Chains and Hidden Markov Models

PROBLEM SET DESCRIPTION

PROBLEM SET ASSIGNMENT

REPORTS AND DUE DATE

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 4 - A term / Fall 2013
Markov Chains and Hidden Markov Models