CS 2102 Homework Assignment #2 Context

The examples for this homework assignment are derived from some real, practical examples found within the Microbiology and Molecular Biology domain. While I cannot expect students to be well-versed in this domain, it is a given that a college education provide at least the basis fundamental knowledge of genomic DNA sequencing. This section of this homework explains WHY we asked the questions on this assignment.

Advances in microbiology and molecular biology have made it possible to sequence the genomic DNA of organisms. One of the most widely studied organisms is Escherichia coli (more commonly known as E. Coli). Full information on this organism is widely available from government data centers. For the purpose of this homework, you need only the information I provide (but I encourage you to browse the wealth of information available to anyone with an Internet connection!).

You have been hired as the technical contact to provide the following computations to assist a lead biologist. She has already retrieved from the government databanks the full genomic sequence of the K-12 strain of the E. Coli organism, which consists of 4,639,675 base pairs. The full data set is available directly (rawHTTP) or from within Eclipse via SourceForge. Open the CVS perspective (Window -> Open Perspective -> Other... -> CVS Repository Explorer). You should have already a repository location representing SourceForge (see class Tutorials if you haven't done this yet!). Expand the HEAD branch and right-click on the Project folder, selecting option "Check Out". You can switch back to the Java perspective by Window -> Open Perspective -> Java).

Note that if you download it manually, you will have to move manually the file into your Eclipse Project where you are doing the homework.

This is not domain-specific
The GC content for a particular genome is relevant because "The GC content and length of each DNA molecule dictates the strength of the association; the more complementary bases exist, the stronger and longer-lasting the association, characterized by the temperature required to break the hydrogen bond, its melting temperature (also called T_m value))." [http://en.wikipedia.org/wiki/DNA].

If the input file contains the E. Coli sequence, then the report generated by this question can be used to calculate the C+G content [which in published material is 50%]. How does your calculated value compare with the C+G published value of 50%?

There is a need to search for genetic subsequences from within the E. Coli sequence to identify genes within the genome. As you may know, DNA is a double-helix composed of two strands (which we'll call the "+" and "-" strand). By convention, the "+" strand is used as the primary DNA genomic sequence. Thus the sample E. Coli file starts with the sequence "agcttttcattctgact...". The negative strand is complementary and is constructed by matching each base with its complementary pair (a pairs with t, and c pairs with g). Thus, the "-" strand of the E. Coli DNA is "tcgaaaagtaagactga..." and contains 4,639,675 nucleotides.

For example, within the E. Coli genome, the insC-1 gene exists at location [380530..380940]. If you inspect this substring of the genome (where the first base-pair is number 1. BE CAREFUL!) you will find that the nucleotide sequence reads "gtgatagtct..." This is a gene that can be found simply by searching through the E. Coli genome for the string "gtgatagtct...". To represent this gene, the format "380530..380940 +" is used to declare that it was found on the "+" strand.

However, E. Coli has developed a sophisticated way of storing genes "in reverse order" within the "-" strand. Thus, the gene "atgctgatt...cgttaa" appears in reverse order and in complementary form as "ttaacg...aatcagcat" starting at location [5683] on the "+" strand. To represent this gene, the format "6459..5683 -" is used to declare that it was found on the "-" strand (NOTE that the higher index is listed first).

Your task is to use the Scanner to process a File containing the E. Coli genomic sequence ("eColi") and a set of sample DNA fragments("sequences") stored one-per line (there are ten of them). You are then to search through the DNA genome for each sequence, one at a time. Given the sample set of ten sequences, your report should look like this:

6459..5683 - ttaacgctgc...
21181..21399 + atgtgccggc... 
87860..87946 + atgcgcaatt... 
UNMATCHED
3025143..3026510 + atgtctggag... 
3735200..3734376 - ttacatatgc...
UNMATCHED
4494274..4493213 - tcaaaaatcg...
UNMATCHED
UNMATCHED

Notes

I revised the output above, based on the final description of the output for the '-' fragments.