2001 International Worm Meeting abstract 846
| 1 | Worcester Polytechnic Institute, Department of Biology and Biotechnology, 100 Institute Rd., Worcester, MA 01609 |
| 2 | Worcester Polytechnic Institute, Department of Computer Science, 100 Institute Rd., Worcester, MA 01609 |
With the completion of several genome sequencing projects, one of the major challenges in computational biology lies in the discovery of significant features or motifs in the sequences. We are particularly interested in sequences controlling the expression of genes, which are much less understood than sequences coding for proteins. Using data on expression patterns gathered by Shawn Lockery and Oliver Hobert, and found in ACeDB and Wormbase, as well as the published literature, we are attempting to discover DNA motifs important in the control of gene expression in C. elegans. For our initial work, we are searching for motifs only in the non-coding promoter region 5' to the translational start site of genes of interest; we are not yet considering introns or regions 3' to the coding region. Our assumption is that important regulatory motifs presumably correspond to binding sites for transcription factors, and should occur repeatedly among genes with similar expression patterns. We hope both to identify important motifs, and to use knowledge of these motifs to predict expression patterns of genes.
It seems likely that combinations of motifs may be involved in control of gene expression. Thus, our analysis involves two major steps: identifying potential DNA motifs of interest, and trying to determine whether particular combinations of motifs may be required to produce a specific expression pattern. We divided our data into a training and test set in order to evaluate the usefulness of our analysis. In order to identify putative motifs, we made extensive use of a motif discovery tool called MEME (1). This open source package uses the expectation maximization (EM) algorithm to identify the 'best' motifs that are found in common among the input sequences. In preliminary experiments, MEME identified known transcription factor binding sites best when smaller numbers of fairly short non-coding regions were used as input sequences. Therefore, we divided our database into groups of 4-7 genes with similar expression patterns. The promoter regions were further divided into the 1500 bases most proximal to the translation start site, and the remainder of the promoter region. Using these groupings of genes, we are currently using MEME to generate putative motifs. We then plan to use a data mining technique in which rules will be generated that allow classification of genes into expression pattern classes based on the presence of particular combinations and spacing of motifs. We will report on preliminary results of our analysis at the meeting.
(1) Bailey, Elkan, and Grundy. MEME, Multiple EM for Motif Elicitation. http://meme.sdsc.edu/meme/website/meme-input.html