### BCB4003/503 CS4083/583 Biological and Biomedical Database MiningProf. Carolina Ruiz Problem Set 4 - A term / Fall 2013 Markov Chains and Hidden Markov Models

DUE DATE: Friday, Oct. 4th, 2013 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class
** This is an individual problem set **

#### PROBLEM SET DESCRIPTION

The purpose of this project is to:

• gain familiarity with Markov chains and hidden Markov models, and their applications to biological and biomedical data.

• gain familiarity with Matlab's Bioinformatics Toolbox.

#### PROBLEM SET ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

• Materials

• Markov Chains and Hidden Markov Models

1. (5 points) Consider the Markov Chain of the rain/no-rain example discussed in class (see slides 4-5 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 states Y (rain) and N (no rain) together with the following transition probabilities:
```Transition probabilities:
Y     N
Y  0.4   0.6
N  0.2   0.8
```
Assume that the probability that it rains on a randomly selected day of the year is 30%. That is, p(Y)=0.3.
1. (3 points) Calculate the probability that in 4 consecutive days, it rains on days 2 and 4 and it doesn't rain on days 1 and 3. That is, calculate p(NYNY).
2. (2 points) Calculate the probability that in 4 consecutive days, it doesn't rain on day 4, given that it did't rain on day 1 but it rained on days 2 and 3. That is, calculate p(N|NYY).

2. (25 points) Consider hidden Markov model of the fair/loaded coin (sometimes called the "dishonest casino") example discussed in class (see slide 14 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial), where there are 2 hidden states C (fair coin) and D (loaded coin), each one producing H (heads) or T (tails), together with the following probabilities:
```Transition probabilities:
C     D
C  0.9   0.1
D  0.1   0.9

Emission probabilities:

H     T
C  0.5   0.5
D  0.75  0.25
```
Assume that both hidden states are equally likely to be the initial state. Represent this by including a "fake" Start state that has no emissions, and has one transition to C and one transition to D, each one with 0.5 probability.

1. (20 points) Follow the Forward and the Backward algorithms by hand for the following observed sequence x = TTHH. Show your work and record intermediate results of the dynamic programming algorithms in tables F (Forward) and B (Backward), as the algorithms would. Note that:

2. (5 points) Once that those tables have been completed, calculate the probability that the 3rd hidden state visited (i.e., the state that produced the leftmost H) was C (the fair coin). That is, calculate:
p(s3= C | TTHH) = ?
Remember that p(s3= C | TTHH) = p(TTHH, s3=C)/p(TTHH). Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.

3. (100 points) This part of the assignment is based on a homework assignment from Prof. Subramanian's "From Sequence to Structure: An Introduction to Computational Biology" course (Rice Univ.). Take a look at Prof. Subramanian's useful Markov models and HMMs Matlab demos.

• What you need to do:
Follow the instructions in Prof. Subramanian's homework assignment. Include in your written report answers to all the questions in that homework assignment. Credit points are as shown in that assignment.

• Data Files:
Note that the data files have been updated since the date of Prof. Subramanian's assignment (2009).

• Human chromosome 22 contigs:
Below are the start, stop, and Symbol information for the 4 contigs in the human chromosome 22 (as of Sept. 2013), taken from the NCBI's human chromosome 22 webpage:
```Region Displayed: 0-51M bp Download/View Sequence/Evidence Download Data
Total Contigs On Chromosome: 4
Contigs in Region: 0
start 		stop		Symbol		O
16050001 	16697850	NT_028395.3	+
16847851 	20509431	NT_011519.10	+
20609432 	50364777	NT_011520.12	+
50414778 	51244566	NT_011526.7	+
```
• You can upload any of these contigs onto Matlab using the "getgenbank" command as described in the homework assignment. Matlab will fetch the correct NCBI file by itself.
• If you'd like to look at the original files online, you can find them at by looking for "Contigs" on the NCBI's human chromosome 22 webpage.

• CpG island locations on the human chromosome 22:
These locations are contained in the file "ncbimapAnnotation Release 104_22_cpg_0K.txt", which specifies the start and stop location of each CpG island.

• You can upload these start-stop location pairs onto Matlab using the "[cpgStart,cpgStop]=textread('ncbimapAnnotation Release 104_22_cpg_0K.txt','%d %d','headerlines',6)" command as described in the homework assignment. (Note that you can use the file name as is, without using backslashes '\'.)

4. Extra Credit (50 points): Repeat steps (d) through (g) of the HW assignment, but using a HMM with 8 hidden states: A+, C+, G+, T+, A-, C-, G-, T- where the "+" states represent the nucleotides in a CpG island, and the "-" states represent the nucleotides in regular DNA. Each state emits only the corresponding nucleotide. That is, A+ and A- emit A; C+ and C- emit C; etc. Include transitions from each of the states to all the other 7 states.

#### REPORTS AND DUE DATE

1. Slides. We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.

Submit the following file with your slides for your oral report by email to me before the deadline:

[your-lastname]__pbmset4_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset4_slides.pptx

2. Written Report. Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.