BCB4003/503 CS4803/583 A Term / Fall 2013

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 3 - A term / Fall 2013
Bayesian Modeling

DUE DATE: Friday, Sept. 20, 2013 Slides (by email) by 11 am and Written Report (hardcopy) at the beginning of class (1:00 pm)
** This is an individual problem set **

Problem Set Description
Problem Set Assignment
Report Submission and Due Date

PROBLEM SET DESCRIPTION

The purpose of this project is to gain experience with Bayesian modeling.

PROBLEM SET ASSIGNMENT

Written Report: Your written report should consist of your answers to each of the parts in the assignment below.

Assignment:

Dataset.
The dataset for this project is the same GSE7390_transbig2006affy_demo.txt dataset that we used for Problem Sets 1 and 2.
Apply the same pre-processing to this dataset that we did in Problem Sets 1 and 2. The only dataset attributes that you should define as "numeric" in your .arff file are age, size, t.tdm, t.rfs, t.os, t.dmfs, NPI, and AOL_os_10y. All other attributes are nominal, and should be defined as so in your .arff file.
Bayesian Models Materials
Study in detail the Bayesian models materials posted on the course webpage.
Bayesian Models Experiments
We will use Weka's Naive Bayes and Bayesian Net classifiers to contruct models for this dataset. Assume that the classification target is "veridex_risk". During model construction, use 10 fold cross-validation.
1. (20 points) Construct Naive Bayes models of the dataset. Click on "More options ...", to select "Output predictions" (choose say plain text), and to choose a value for the Random seed (initially use value = 1). Repeat the experiment 3 times with seeds 1, 23, 62. For each of the 3 experiments, record in your report the conditional probability values output by Weka under "Naive Bayes Classifier", the accuracy (= % of Correctly Classified Instances) and the confusion matrix obtained. Report any interesting observations about the results of each experiment and across the 3 experiments.
2. (60 points) Construct Bayesian Nets over this dataset. Use K2 as the algorithm to construct the topology of the Bayesian net. Run at least 6 experiments thoughtfully varying the values of "initAsNaiveBayes", "maxNrOfParents", and "randomOrder" parameters of the K2 algorithm. For each experiment, include in your report:
  1. the obtained Bayesian net (right-click on the experiment on the left hand-side window, and select "Visualize graph"),
  2. the accuracy (= % of Correctly Classified Instances) and the confusion matrix,
  3. any interesting observations on the topology of the network. Analyze the biological meaning of this topology.
  4. any interesting observations on the Conditional Probability Tables (CPTs) of the network (click on the nodes in the graph visualization). Analyze the biological meaning of these CPTs.
  5. report any additional interesting observations about the results of each experiment and/or across all of the experiments.
3. (10 points) Compare the results obtained with Bayesian Nets and with Naive Bayes.
4. (10 points) Slides, oral presentation, and class participation during class presentations.

REPORTS AND DUE DATE

Slides We will discuss the results from the problem set during class so you should prepare a few slides summarizing your findings and including any visualizations or graphs you want to share with the rest of the class. Be prepared to give an oral presentation.
Submit the following file with your slides for your oral report by email to me before the deadline:
[your-lastname]__pbmset3_slides.[ext]
where: [ext] is pdf, ppt, or pptx. Please use only lower case letters in the name file. For instance, the file with my slides for this problem set would be named ruiz_pbmset3_slides.pptx
Written Report Hand in a hardcopy of your written report at the beginning of class the day the problem set is due.

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining Prof. Carolina Ruiz Problem Set 3 - A term / Fall 2013 Bayesian Modeling

PROBLEM SET DESCRIPTION

PROBLEM SET ASSIGNMENT

REPORTS AND DUE DATE

BCB4003/503 CS4083/583 Biological and Biomedical Database Mining
Prof. Carolina Ruiz
Problem Set 3 - A term / Fall 2013
Bayesian Modeling