syl

BCB 4002/502 (aka CS 4802/582)- Term A, 2012
Biovisualization
Lectures: SL 411, Tuesday and Friday, 1:00 - 2:50AM

Instructor: Prof. Matthew Ward, FL-231, 508-831-5671, matt@cs
Office Hours: Tuesday: 1:00PM, Thursday and Friday: 10AM, Others by appointment

Course Description: In this course we will study the use of interactive data and information visualization to model and analyze biological information, structures, and processes. Topics will include the fundamental principles, concepts, and techniques of visualization (both scientific and information visualization) and how visualization can be used to study biological data at the genomic, cellular, molecular, organism, and population levels.

Text: The primary text for the course is Interactive Data Visualization: Foundations, Techniques, and Applications, by Ward, Grinstein, and Keim. Supplemental texts will be placed on reserve in the library that may assist you in understanding some of the more difficult concepts.

Exams: There will be no midterm or final exam. Instead, I will hold weekly quizzes (roughly 20 minutes long) covering both readings and lecture material. I will only count your top 5 quiz grades in computing your final grade, so there will be no make-up exams due to illness or other absences. Quizzes will be held at the start of class, so please try to arrive promptly.

Programming Language: Java is the preferred language for projects in this course. You might also consider using Processing (http://processing.org), a programming language that sits above Java and facilitates rapid development of graphics applications. If you want to program in a language other than Java or Processing, please see me before you start your projects to insure the language has the appropriate graphics support. Note that, depending on your choice of languages, I may not be able to provide much help in debugging.

Grade Policy: 50% quizzes, 50% assignments, although low grades early in the term may be forgiven in cases where students are performing very well at the end of the course. You must obtain a passing grade for both the quiz portion and project portion.

Supplemental Material: All handouts can be found on myWPI. You can find links to useful sites for the course here , and some links to datasets that can be used for some of the projects here .

Notes:

Reading is mandatory, working ahead is encouraged.
Quizzes are based on both lectures and readings, so class attendance and keeping up with the reading is strongly encouraged. Over-sleeping is NOT an acceptable excuse for missing a quiz.
Cheating, defined as taking credit for work you did not do, is strictly forbidden. First offenders will receive a zero grade for the assignment or exam in question, and the Office of Student Life will be notified. Repeat offenders will receive an NR for the course, and the case will be brought before the Campus Judiciary System.
All assignments should be submitted using myWPI. Instructions are provided with the assignments. Most of the assignments will have some basic requirements as well as some alternatives for those seeking a challenge. Files MUST include instructions on compiling and running the program and should be WELL documented. Insufficient documentation will result in loss of points (as much as 25% of the assignment). Data files should include a comment line at the start giving your name, the assignment for which it is intended, and the most recent date in which the file was changed. Please, do NOT turn in hardcopies or executables! Any questions regarding the program may be sent to me via e-mail, or you may stop by during my posted office hours.
Assignments are due by 5PM on the dates specified below. There will be a late penalty of 10 percent for each day beyond the due date.
For all projects each person should work independently. It is OK to discuss strategies with others in the class, but there should be no sharing of code.
In order to maintain a classroom environment conducive to effective learning, please refrain from the following activities during class: carrying on conversations (vocal or electronic), browsing the web, listening to music, playing games, eating (unless you brought enough to share with the whole class), or sleeping. Please set cell phones to silent mode. Your consideration for others would be greatly appreciated.

Projects:

The projects for the course are as follows:

Project 0: (due August 30) Getting Started: This is a practice project to make sure you can write, compile, and execute a program that generates graphics. The goal is to be able to generate graphics primitives (points, lines, polygons) at different locations on the screen with different colors. You may write this from scratch, or start with a demo program from a book or the web. If you do start with code that you found, please identify the source of the code and, most importantly, make some non-trivial changes to the code to make it your own. In this case, please describe the change or changes you made in the documentation you submit with your code. For example, you could download one of the Processing examples, read it through so you understand what it is doing, and then change the appearance of the graphical output to use different color schemes, different primitive shapes, different layouts of the primitives, and so on.

Project 1: (due September 6) The Game of Life - the effects of surroundings and contacts: Many processes in biology can be simulated by a set of primitive objects along with rules that dictate their behavior (e.g., when they are formed, when they die, how they move, how they interact with other objects). The Game of Life is a standard programming project for learning how to use arrays and perform simulations. One starts with (generally) a 2-D array where some locations are occupied and others are empty. This could be done randomly, via some explicit patterns, or read in from a file. This is generation 1 of your ``habitat''. You then create a second array to hold the results of the next generation, which is created by applying rules to each location in the array. For example, if location (i,j) is empty in the current generation, but it has at least 2 neighbors that are not empty, then in the next generation we would set that location to occupied (a birth!). Similarly, if a location is occupied, but all of its neighbors are empty, it might die of loneliness, or if all its neighbors are occupied it might die of overcrowding. You get the idea.

For this project, create a basic 2-D Game of Life using graphics to show the state of each location for a given generation. You should have at least 4 rules for birth and death of cells. Run the simulation for a large number of generations, stopping if the number of changes between generations goes to zero. You can either have the user click a key to go to the next generation or let each display stay on the screen for 1 or 2 seconds before advancing. Again, if you find code for this on the web, please indicate the source and what changes you made to it to make it your own work. Generally, you will learn more and gain a better sense of accomplishment if you develop this from scratch.

For those seeking a challenge, there are many variations on this. For example, you might have more than one type of object (predator/prey), with different sets of rules. You could integrate moving objects, such as migrating cells in a developing organism moving into unoccupied cells (careful about collisions!). You could simulate the propogation of a disease based on contacting an infected cell (especially cool to watch when the objects are moving). You can also use non-binary values, or even vectors, at each grid location to represent different degrees and aspects of the state or health of the object at that location.

Project 2: (due September 17) Tree of Life - how are things related: Trees are a mechanism for conveying hierarchical relationships among objects. Family trees, phylogenetic trees, and evolutionary relations are all examples of such trees. Mostly they are created by using some distance function to establish a hierarchical clustering of the data. Different graphical representations can then be used to convey not only the groupings but also the distances within and between groups.

For this project you will start with a set of named objects and a table of variables or a genetic sequence for each, from which you will compute distances between each pair of objects. For example (you don't have to use this dataset), at http://archive.ics.uci.edu/ml/datasets/Zoo you can find a table of 17 attributes of 101 animals. The distance calculation could just be the sum of the mismatches between the characteristics of 2 animals (for the Zoo dataset, you should ignore the first and last columns of the data in calculating the distances).

In the first phase you will implement an algorithm that performs hierarchical clustering of the data. The easiest algorithm is called bottom-up agglomerative clustering. You start with a list of objects and their distances, and compute which pair of objects are closest. You then create a new object list by removing these two objects and put in a new object that is their parent. The new distance from each of the other objects to this node is the average distance to the nodes that were merged. You then repeat the process until all objects are merged. You'll need to keep track of all nodes that are merged into a cluster. This can be done with arrays or linked lists. The names of the cluster objects can be omitted or can be assigned default names, such as cluster1, cluster2, and so on.

In the second phase, you will draw the resulting tree. For a basic view, we can focus just on the structure of the tree. For example, you can line up all the non-cluster objects along the bottom or side of the screen (this makes reading the text names easier) and then position the cluster objects offset up or across from other nodes based on which objects make up their cluster. There are only 3 possibilities - either the cluster is made up of 2 base objects, one base object and one cluster, or 2 clusters. The position of the cluster should be at a distance proportional to the iteration of the clustering algorithm in which it was formed; for the other coordinate you can use the midpoint between the two nodes that make up the cluster. This is just one possible approach - we will discuss others in class.

For those seeking a challenge, there are two obvious enhancements. The first is to write an algorithm to reorder the objects to minimize line crossings. One approach is to compute the tree and then render from the base of the tree (where all nodes are merged into one cluster) to the terminal nodes. The second enhancement is to use the actual distances to determine positioning, so rather than evenly spaced nodes you can have them convey the original or computed distances using positioning. You can also use the color or width of the connecting lines to convey distances.

Project 3: (due September 24) Experiments in Biology - how to analyze tables of numbers: A great deal of biological data is stored in tables of numbers. For example, in microarray data you might store one row of information for each experiment, with each column either being a control variable, such as the presence of a stimulus or whether the cell or tissue is infected, or an output indicator, such as the degree to which a gene is expressed. Entries are generally either binary (yes or no), discrete (a small set of options), or continuous (can be normalized to the range (0.0, 1.0)).

There are so many ways to draw such data (see Chapter 7 of the book) that it is possible that no two members of the class will use the same method. The basic idea is to map each data row to a graphical object (point, line, polygon) and each value in the row to a graphical attribute (position, color, shape, size, ...). Some techniques allow you to use all dimensions at once, such as parallel coordinates and glyphs, while others choose subsets of dimensions for each plot (such as scatterplot matrices) or use dimensionality reduction methods to reduce N dimensions to 2 or 3 dimensions. The key is to make sure everything is scaled to fit on the screen while not leaving too much empty space. If you are using color to convey information, it is important to include a color key to help people interpret the data properly.

For those seeking a challenge, again many possibilities exist. For example, for many types of multivariate data visualization, the ordering or the records or dimensions can help reveal patterns of potential interest in the data. Ordering can be done in many ways, including based on one of the dimensions, distance from a particular point, or relationships between dimensions or records. Another interesting extension is to use animation to show subsets of the data at a time. A cool effect can be achieved by having data appear at a time based on one of the dimensions and then fade away as time progresses. Use your imagination!

Project 4: (due October 1) Structures in Biology - 1D sequences: One of the most fundamental data type in bioinformatics is a sequence of characters, whether it be DNA, RNA, or amino acids. Each has a finite alphabet, possibly with some similarity measure or grouping. Sequences can be short (dozens or hundreds of elements), long (thousands or tens of thousands), or huge (millions or billions) in length. Visualization has been used to study individual sequences, usually using color to convey the value of elements, but a far more common use is to visualize relationships between two or more sequences. A particularly powerful mechanism is to compute a good alignment of the sequences (using insertions, deletions, and substitutions) and displaying this alignment in such a way as to emphasize where the two sequences overlap.

A common way of generating an optimal alignment of two sequences is by using a class of algorithms known as dynamic programming. This algorithm basically computes an alignment score that propogates from the start of the two sequences to the end, penalizing the alignment for mis-matches and adding gaps and rewarding it when the corresponding elements match. In some applications there may be degrees of match; for our purposes, you can consider matching of two elements to be a binary result (1 = match, 0 = mismatch). For the first phase of this project you need to implement the dynamic programming algorithm on two genetic sequences of your choosing. The input would be the 2 sequences plus the scores to be applied for matches, mismatches, and an inserted gap. The output will be an alignment, where for every position in one sequence there is either an index to the corresponding position in the other sequence or some indicator of a gap (e.g., a -1). An excellent tutorial on this algorithm can be found at http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html.

Once you have a good alignment, you want to generate a visualization of the sequences and their alignment. This can be done in a purely textual manner (yes, a visualization can consist of just text, with additional information embedded in the positions), a purely graphical manner, or a combination of the two. To simplify things, you can limit the size of your sequences to less than a couple thousand to make it easier to fit onto the screen. You should look at how various bioinformatics programs display their alignments for inspiration.

For those seeking a challenge, here are a couple possibilities. The first is to deal with the issue of scale. If your sequences are very long, to avoid scrolling you need to find ways of compressing things. Even if you are only showing graphics (e.g., the color of a dot or small rectangle), you will run out of space eventually. One solution is to compress the alignments so that each dot or small rectangle represents the degree of match between two small subsequences, rather than individual sequence elements. Again, looking at existing systems may provide other ideas. The second, and perhaps more interesting challenges, is to expand your system beyond two sequences. For example, for 3 sequences you could make a 3-D scoring array and follow the dynamic programming algorithm with a higher dimension. Another would be to find regions of strong homology (similarity) bewteen pairs of sequences and use one or more of these as foundations for partial alignments of the elements before and after the strong matches. Multiple sequence alignment is still an area of research, especially for large numbers of long sequences. Efficiency becomes a big issue.

Project 5: (due October 11) Structures in Biology - 2D and 3D structures: Thus far we have not dealt with the geometry of biological data, i.e., the spatial relations that exist between the components of biological objects from molecules to organisms. Structures can be determined experimentally or via computational methods. Since this is a course on biovisualization, we are not going to concern ourselves too much on the process of accurately predicting structures, other than we can simplistically model things as a set of forces (attraction, repulsion) and constraints (such as avoiding the situation where two objects are at the same location).

In genetic sequence analysis, the primary structure is the 1-D sequence itself. Secondary structure entails the folding of the sequence into more compact structures, such as alpha helixes and beta sheets, using forces such as hydrogen bonds to link the bases. Tertiary structure consists of folding the secondary structures into a compact 3-D form, again based on forces and constraints, such as the amount of bending or twisting that can be achieved without expending too much energy.

To start, we note that in RNA sequences, bonds are formed between A and U, as well as G and C. Thus conceivably a pairing of an A with any U is possible. In reality, there are other constraints and forces at work. If you have a subsequence of size N, and it is followed by another subsequence of the same size, but with a reversed complemented pattern (e.g., AUUCG followed by CGAAU), this region may fold into a structure called a hairpin, similar to the double-helix of DNA sequences. A stem and loop structure would occur if the 2 subsequences above were separated by a short sequence that didn't bond with itself. Thus the accumulated forces of adjacent matches is stronger than individual base forces. In a sense, we can treat groups of cooperating bases as a force amplifier. Protein sequences bend and fold in similar ways, though predicting the resulting structure can often be quite complex.

For other organisms, you can also treat the process of structure prediction as a complex set of forces. For example, in the development of the cerebral cortex, cells of different types migrate to different locations in the cortex (attractive forces) while at the same time responding to cell collisions (repelling forces).

In this project we will implement a simple spring force model to move basic elements around. In a way this is an extension to the Game of Life project you did earlier in the term, but in 3-D. Start by creating a table of attractions, similar to the distance table you made for Project 2, but matching a sequence against itself. The forces should include both positive and negative entries, but for the positive entries the attraction should be stronger for elements that are closer to each other in the sequence. For each element, you should draw a sphere, using different colors and/or sizes based on the object type. You should support adjusting the scaling of the objects, and drawing lines or tubes between nodes with the strongest attraction forces.

There are two common ways of initializing the positions of nodes: randomly and all in one place. You then should execute many iterations of the spring model calculations, stopping when the amount of change between iterations falls below a threshold you set at the beginning. You might want to visualize the data after each position to see how the movement is progressing (this is a good way to debug your program, as strange movement usually implies a bug). You may have to adjust the bounds of your display to accommodate repulsion forces; alternatively, you can add a constraint that keeps all positions within a 3-D box (just another force).

Challenge options: TBD.

Schedule:

Week 1 (August 24-30)
   Topics: Overview of Visualization, Biological Data Representations 
   Reading: Chapters 1 and 2

Week 2 (August 31-September 6)
   Topics: Particle-Based Methods and Applications
   Reading: Chapter 4 

Week 3 (September 7-13)
   Topics: Trees and Networks
   Reading: Chapter 8

Week 4 (September 14-20)
   Topics: Tables of Data
   Reading: Chapter 7

Week 5 (September 21-27)
   Topics: Sequences and Text
   Reading: Chapter 9

Week 6 (September 28-October 4)
   Topics: 2-D and 3-D structures
   Reading: Chapter 5

Week 7 (October 5-11)
   Topics: Issues of Design, Evaluation
   Reading: Chapters 12, 13

About this document ...

Next: About this document ...

Matthew Ward 2012-08-15