Written Report: Your written report should consist of your answers to each of the parts in the assignment below.
Assignment:
We will use Weka's Naive Bayes and Bayesian Net classifiers to contruct models for this dataset. Assume that the classification target is "veridex_risk". During model construction, use the %split test option, with 90% split. That is, 90% of randomly selected data instances from the dataset are used to construct the model and the remaining 10% of the data instances are used to test the model.
Transition probabilities: C D C 0.9 0.1 D 0.1 0.9 Emission probabilities: H T C 0.5 0.5 D 0.75 0.25Assume that both hidden states are equally likely to be the initial state. Represent this by including a "fake" Start state that has no emissions, and has one transition to C and one transition to D, each one with 0.5 probability. [For this problem, it would be very useful for you to explore all the resources on Hidden Markov Models posted on the course webpage.]
Follow the Forward and the Backward algorithms by hand for the following observed sequence x = TTHH. Show your work and record intermediate results of the dynamic programming algorithms in tables F and B, as the algorithms would. Note that:
Once that those tables have been completed, calculate the probability that the 3rd hidden state visited (i.e., the state that produced the leftmost H) was C (the fair coin). That is, calculate:
p(s3= C | TTHH) = ?Remember that p(s3= C | TTHH) = p(TTHH, s3=C)/p(TTHH). Don't forget to divide by p(TTHH), whose value you can easily calculate from the Forward table.
Follow the instructions in the homework assignment, starting on page 2 (you do NOT need to work on parts (a)-(d) on page 1). Include in your written report answers to parts (a)-(g) on pages 3-4. Credit points are as follows: (a)-(b) 10 points each; (c)-(f) 15 points each; (g)20 points. Please note that the data files have been changed since the date of the above assignment (2009). Current information (as of Sept. 2011) is included below.
You can download all the needed data files from http://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=22. For simplicity, I include the files below (current as of Sept. 2011):
Region Displayed: 0-51M bp Download/View Sequence/Evidence Download Data Total Contigs On Chromosome: 4 Contigs in Region: 0 start stop Symbol O 16050001 16697850 NT_028395.3 + 16847851 20509431 NT_011519.10 + 20609432 50364777 NT_011520.12 + 50414778 51244566 NT_011526.7 +
Extra Credit (100 points): Repeat the same steps above, but using a HMM with 8 hidden states: A+, C+, G+, T+, A-, C-, G-, T- where the "+" states represent the nucleotides in a CpG island, and the "-" states represent the nucleotides in regular DNA. Each state emits only the corresponding nucleotide. That is, A+ and A- emit A; C+ and C- emit C; etc. Include transitions from each of the states to all the other 7 states.
Submit the following file with your slides for your oral report by email to me before 12:00 noon the day the project is due (that is, at least 1 hour before class):
[your-lastname]__proj2_slides.[ext]where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this prohject would be named ruiz_proj2_slides.pptx