DUE DATE: Friday, Oct. 4th, 2013 Slides (by email) by 10 am and Written Report (hardcopy) at the beginning of class
** This is an individual problem set **
PROBLEM SET DESCRIPTION
The purpose of this project is to:
- gain familiarity with
Markov chains and hidden Markov models,
and their applications to biological and biomedical data.
- gain familiarity with Matlab's Bioinformatics Toolbox.
PROBLEM SET ASSIGNMENT
Written Report:
Your written report should consist of your answers to each of the
parts in the assignment below.
Assignment:
- Materials
- Study in detail the
Markov models materials posted on the course webpage.
- Read Chapter 3 from
Durbin, Eddy, Krogh, and Mitchison.
"Biological Sequence Analysis". Cambridge University Press. 1998.
I placed this book on reserve in the Gordon Library.
- Learn more about CpG Islands by researching this topic online.
- Matlab:
- For this assignment you need both
the Statistics and
the Bioinformatics
Matlab toolboxes.
To see if these toolboxes are available in the Matlab
installation that you are using, type "ver" on Matlab's command window.
If the Bioinformatics toolbox is not listed,
use remote desktop connection from your PC
to windows terminal servers sunfire3.wpi.edu, sunfire4.wpi.edu or
sunfire5.wpi.edu.
The version of Matlab installed on those servers contains the Bioinformatics toolbox.
- In case they are helpful,
you can search for Bioinformatic toolbox webinars on the
MathWorks Recorded Webinars webpage
(see the "Refine by Product" menu on the left hand side of that webpage).
- Markov Chains and Hidden Markov Models
- (5 points)
Consider the Markov Chain of the rain/no-rain example
discussed in class
(see slides 4-5 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial),
where there are 2 states Y (rain) and N (no rain)
together with the following transition probabilities:
Transition probabilities:
Y N
Y 0.4 0.6
N 0.2 0.8
Assume that the probability that it rains on a randomly
selected day of the year is 30%. That is, p(Y)=0.3.
- (3 points) Calculate the probability that in 4 consecutive days,
it rains on days 2 and 4 and it doesn't rain on days 1 and 3.
That is, calculate p(NYNY).
- (2 points) Calculate the probability that in 4 consecutive days,
it doesn't rain on day 4, given that it did't rain on day 1 but it
rained on days 2 and 3.
That is, calculate p(N|NYY).
(25 points)
Consider hidden Markov model of the fair/loaded coin
(sometimes called the "dishonest casino") example discussed in class
(see slide 14 of Ydo Wexler & Dan Geiger's Markov Chain Tutorial),
where there are 2 hidden states C (fair coin) and D (loaded coin), each one producing
H (heads) or T (tails), together with the following probabilities:
Transition probabilities:
C D
C 0.9 0.1
D 0.1 0.9
Emission probabilities:
H T
C 0.5 0.5
D 0.75 0.25
Assume that both hidden states are equally likely to be the initial state.
Represent this by including a "fake" Start state that has no emissions,
and has one transition to C
and one transition to D, each one with 0.5 probability.
- (20 points)
Follow the Forward and the Backward algorithms by hand for the following observed sequence
x = TTHH. Show your work and record intermediate results of the dynamic programming
algorithms in tables F (Forward) and B (Backward), as the algorithms would.
Note that:
- (5 points)
Once that those tables have been completed, calculate the probability that the
3rd hidden state visited (i.e., the state that produced the leftmost H)
was C (the fair coin). That is, calculate:
p(s3= C | TTHH) = ?
Remember that p(s3= C | TTHH) = p(TTHH, s3=C)/p(TTHH).
Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.
(100 points)
This part of the assignment is based on a
homework assignment from
Prof. Subramanian's "From Sequence to Structure: An Introduction to Computational Biology" course (Rice Univ.).
Take a look at
Prof. Subramanian's useful Markov models and HMMs Matlab demos.
- What you need to do:
Follow the instructions in
Prof. Subramanian's homework assignment.
Include in your written report answers to all the questions
in that homework assignment. Credit points are as shown in that assignment.
- Data Files:
Note that the data files have been updated since the date of
Prof. Subramanian's assignment (2009).
Extra Credit (50 points): Repeat steps (d) through (g) of the HW assignment, but using a HMM with 8 hidden
states: A+, C+, G+, T+, A-, C-, G-, T- where
the "+" states represent the nucleotides in a CpG island, and
the "-" states represent the nucleotides in regular DNA.
Each state emits only the corresponding nucleotide.
That is, A+ and A- emit A; C+ and C- emit C; etc.
Include transitions from each of the states to all the other 7 states.
REPORTS AND DUE DATE
- Slides.
We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.
Submit the following file with your slides for your oral report by email to
me before the deadline:
[your-lastname]__pbmset4_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the
name file. For instance, the file with my slides for this problem set would be
named ruiz_pbmset4_slides.pptx
- Written Report.
Hand in a hardcopy of your written report at the beginning of class the
day the problem set is due.