CS 2102 Homework Assignment #2
Context
The examples for this homework assignment are derived from
some real, practical examples found within the Microbiology and Molecular
Biology domain. While I cannot expect students to be well-versed in this domain,
it is a given that a college education provide at least the basis fundamental
knowledge of genomic DNA sequencing. This section of this homework explains
WHY we asked the questions on this assignment.
Advances in microbiology and molecular
biology have
made it possible to sequence the genomic
DNA of organisms. One of the most
widely studied organisms is
Escherichia coli
(more commonly known as E. Coli). Full information on this organism is
widely available from
government data centers. For the purpose of this homework, you need only
the information I provide (but I encourage you to browse the wealth of
information available to anyone with an Internet connection!).
You have been hired as the technical contact to provide the following
computations to assist a lead biologist. She has already retrieved from the
government databanks the full genomic sequence of the K-12 strain of the E.
Coli organism, which consists of 4,639,675 base pairs. The full data set is
available directly (rawHTTP)
or from within Eclipse via SourceForge. Open the CVS perspective (Window
-> Open Perspective -> Other... -> CVS Repository Explorer). You should
have already a repository location representing SourceForge (see class
Tutorials
if you haven't done this yet!). Expand the HEAD branch and right-click on the
Project folder, selecting option "Check Out". You can switch back
to the Java perspective by Window -> Open Perspective -> Java).
Note
that if you download it manually, you will have to move manually the file
into your Eclipse Project where you are doing the homework.
- This is not domain-specific
- The GC content for a particular genome is relevant
because "The GC content and length of each DNA molecule
dictates the strength of the association; the more complementary bases
exist, the stronger and longer-lasting the association, characterized
by the temperature required to break the hydrogen bond, its
melting temperature (also called Tm value))."
[http://en.wikipedia.org/wiki/DNA].
If the input file contains the E. Coli
sequence, then the report generated by this question can be used to
calculate the C+G content [which in published material is 50%]. How does your calculated value compare with the
C+G published value of
50%?
- There is a need to search for
genetic subsequences from within the E. Coli sequence to identify genes within the genome.
As you may know, DNA is a
double-helix composed of two strands (which we'll call the "+" and "-"
strand). By convention, the "+" strand is used as the primary
DNA genomic sequence.
Thus the sample E. Coli file starts with the sequence "agcttttcattctgact...".
The negative strand is complementary and is constructed by matching each
base with its complementary pair (a pairs with t, and c
pairs with g). Thus, the "-" strand of the E. Coli DNA is "tcgaaaagtaagactga..."
and contains 4,639,675 nucleotides.
For example, within the E. Coli genome, the
insC-1 gene exists at location [380530..380940]. If you inspect this
substring of the genome (where the first base-pair is number 1. BE
CAREFUL!) you will find that the nucleotide sequence reads "gtgatagtct..."
This is a gene that can be found simply by searching through the E. Coli
genome for the string "gtgatagtct...".
To represent this gene, the format "380530..380940 +" is used to declare
that it was found on the "+" strand.
However, E. Coli has developed a sophisticated way of storing genes "in
reverse order" within the "-" strand. Thus, the gene "atgctgatt...cgttaa"
appears in reverse order and in complementary form as "ttaacg...aatcagcat"
starting at location [5683] on the "+" strand. To represent
this gene, the format "6459..5683 -" is used to declare that it was
found on the "-" strand (NOTE that the higher index is listed first).
Your task is to use the Scanner to process a File containing the E. Coli
genomic sequence ("eColi") and a set of sample DNA
fragments("sequences") stored one-per line (there are ten of them). You
are then to search through the DNA genome for each sequence, one at a
time. Given the sample set of ten sequences, your report should look
like this:
6459..5683 - ttaacgctgc...
21181..21399 + atgtgccggc...
87860..87946 + atgcgcaatt...
UNMATCHED
3025143..3026510 + atgtctggag...
3735200..3734376 - ttacatatgc...
UNMATCHED
4494274..4493213 - tcaaaaatcg...
UNMATCHED
UNMATCHED |
Notes
- I revised the output above, based on the final
description of the output for the '-' fragments.
-