Prof. Carolina Ruiz

Problem Set 4 - A term / Fall 2013

Markov Chains and Hidden Markov Models

- gain familiarity with
Markov chains and hidden Markov models,
and their applications to biological and biomedical data.
- gain familiarity with Matlab's Bioinformatics Toolbox.

**Written Report:**
Your written report should consist of your answers to each of the
parts in the assignment below.

**Assignment**:

**Materials**

- Study in detail the Markov models materials posted on the course webpage.
- Read Chapter 3 from Durbin, Eddy, Krogh, and Mitchison. "Biological Sequence Analysis". Cambridge University Press. 1998. I placed this book on reserve in the Gordon Library.
- Learn more about CpG Islands by researching this topic online.
**Matlab:**- For this assignment you need both the Statistics and the Bioinformatics Matlab toolboxes. To see if these toolboxes are available in the Matlab installation that you are using, type "ver" on Matlab's command window. If the Bioinformatics toolbox is not listed, use remote desktop connection from your PC to windows terminal servers sunfire3.wpi.edu, sunfire4.wpi.edu or sunfire5.wpi.edu. The version of Matlab installed on those servers contains the Bioinformatics toolbox.
- In case they are helpful, you can search for Bioinformatic toolbox webinars on the MathWorks Recorded Webinars webpage (see the "Refine by Product" menu on the left hand side of that webpage).

**Markov Chains and Hidden Markov Models**

**(5 points)**Consider the Markov Chain of the rain/no-rain example discussed in class (see slides 4-5 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 states Y (rain) and N (no rain) together with the following transition probabilities:Transition probabilities: Y N Y 0.4 0.6 N 0.2 0.8

Assume that the probability that it rains on a randomly selected day of the year is 30%. That is, p(Y)=0.3.- (3 points) Calculate the probability that in 4 consecutive days, it rains on days 2 and 4 and it doesn't rain on days 1 and 3. That is, calculate p(NYNY).
- (2 points) Calculate the probability that in 4 consecutive days, it doesn't rain on day 4, given that it did't rain on day 1 but it rained on days 2 and 3. That is, calculate p(N|NYY).

**(25 points)**Consider hidden Markov model of the fair/loaded coin (sometimes called the "dishonest casino") example discussed in class (see slide 14 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 hidden states C (fair coin) and D (loaded coin), each one producing H (heads) or T (tails), together with the following probabilities:Transition probabilities: C D C 0.9 0.1 D 0.1 0.9 Emission probabilities: H T C 0.5 0.5 D 0.75 0.25

Assume that both hidden states are equally likely to be the initial state. Represent this by including a "fake" Start state that has no emissions, and has one transition to C and one transition to D, each one with 0.5 probability.- (20 points)
Follow the Forward and the Backward algorithms by hand for the following observed sequence
**x**= TTHH. Show your work and record intermediate results of the dynamic programming algorithms in tables F (Forward) and B (Backward), as the algorithms would. Note that:- the F table would have an additional column 0, and additional row 0, corresponding to the fake Start state.
- as discussed in class, multiplying small probabilities can create underflow errors.
If you do run into underflow errors, redo your calculations in
log
_{2}space (see Prof. Kellis' Algorithms for Computational Biology course (MIT) lecture notes and Prof. Mneimneh's Computational Biology course (Hunter College) lecture notes), or scaling (see Rabiner's tutorial Section V).

- (5 points)
Once that those tables have been completed, calculate the probability that the
3rd hidden state visited (i.e., the state that produced the leftmost H)
was C (the fair coin). That is, calculate:
p(s

Remember that p(s_{3}= C | TTHH) = ?_{3}= C | TTHH) = p(TTHH, s_{3}=C)/p(TTHH). Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.

- (20 points)
Follow the Forward and the Backward algorithms by hand for the following observed sequence
**(100 points)**This part of the assignment is based on a homework assignment from Prof. Subramanian's "From Sequence to Structure: An Introduction to Computational Biology" course (Rice Univ.). Take a look at Prof. Subramanian's useful Markov models and HMMs Matlab demos.**What you need to do:**

Follow the instructions in Prof. Subramanian's homework assignment. Include in your written report answers to all the questions in that homework assignment. Credit points are as shown in that assignment.**Data Files:**

Note that the data files have been updated since the date of Prof. Subramanian's assignment (2009).**Human chromosome 22 contigs**:

Below are the start, stop, and Symbol information for the 4 contigs in the human chromosome 22 (as of Sept. 2013), taken from the NCBI's human chromosome 22 webpage:Region Displayed: 0-51M bp Download/View Sequence/Evidence Download Data Total Contigs On Chromosome: 4 Contigs in Region: 0 start stop Symbol O 16050001 16697850 NT_028395.3 + 16847851 20509431 NT_011519.10 + 20609432 50364777 NT_011520.12 + 50414778 51244566 NT_011526.7 +

- You can upload any of these contigs onto Matlab using the "getgenbank" command as described in the homework assignment. Matlab will fetch the correct NCBI file by itself.
- If you'd like to look at the original files online, you can find them at by looking for "Contigs" on the NCBI's human chromosome 22 webpage.

**CpG island locations on the human chromosome 22:**

These locations are contained in the file "ncbimapAnnotation Release 104_22_cpg_0K.txt", which specifies the start and stop location of each CpG island.- You can download this file by clicking on the "Download Data" link next to "CpG Island" on the NCBI's CpG island for human chromosome 22 webpage.
- You can upload these start-stop location pairs onto Matlab using the "[cpgStart,cpgStop]=textread('ncbimapAnnotation Release 104_22_cpg_0K.txt','%d %d','headerlines',6)" command as described in the homework assignment. (Note that you can use the file name as is, without using backslashes '\'.)

**Extra Credit (50 points):**Repeat steps (d) through (g) of the HW assignment, but using a HMM with 8 hidden states: A+, C+, G+, T+, A-, C-, G-, T- where the "+" states represent the nucleotides in a CpG island, and the "-" states represent the nucleotides in regular DNA. Each state emits only the corresponding nucleotide. That is, A+ and A- emit A; C+ and C- emit C; etc. Include transitions from each of the states to all the other 7 states.

**Slides.**We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.Submit the following file with your slides for your oral report by email to me before the deadline:

[your-lastname]__pbmset4_slides.[ext]

where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset4_slides.pptx**Written Report.**Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.