CS 110X Feb 07 2014

Lecture Path: 12
Back Next

Expected Readings: 169-173, 175-179
Expected Interactions: Function definition, call Slicing
Clicker:Assessment

What’s in a name? that which we call a rose By any other name would smell as sweet;

William Shakespeare

1 Who’s on First, What’s on Second

1.1 Indexing On Strings

Indexing lets you identify individual elements of a list. The Python str data type shares similar behavior to a list. Let’s see whether indexing works on strings.

>>> x = ’uncopyrightable’ >>> print (x[3]) o >>> print (x[0]) u

It appears that you can index into a string as well. Note that Python considers the first character of a string to be index 0.

So with all certainty, you can say that s[0] for any string s is the first character in the string.

With similar certainty, you know that s[len(s)-1] is the last character (or rightmost one) in the string. That is, if a string, s, has five characters, the last one is s[4]. You need to remember this point always, with the same bewilderment that you learned how to conjugate the "to be" verb in English. Does it really make logical sense to say "I am" and "He is" and "They are" when the verb is the same in all cases!?

To drive this point home, if you try to access a character with an index value that is too high, you will get an error. Thus for a four-letter string, such as s = test, trying to access s[4] causes an error.

>>> s = ’test’ >>> s[4] Traceback (most recent call last): File "<pyshell#21>", line 1, in <module> s[4] IndexError: string index out of range

It might not seem that interesting to extract individual characters from a string (though of course, this is a critical skill to know). Let’s talk, then, about extracting substrings from a string.

1.2 Index function also works on strings

With lists, you can use the index(value) function to identify the index position of a value in a list. You can do the same thing with a string in Python.

>>> x = ’uncopyrightable’ >>> x.index(’copy’) 2

The index(target) function as shown will find the index position of the first occurence of the target in the string x.

>>> x = ’this,that,something else’ >>> x.index(’,’) 4 >>> x.index(’,’, 5) 9

You’ve seen how to concatenate strings together to form larger strings. The opposite operation is to extract substrings from an existing string. In Python, this technique is known as slicing.

The basic slice operation over a string s uses two values s[left:right] inside of brackets, but note the : character separating them. This operation extracts a substring of s starting from characters at index left and running up to but not including right. Once again, note how this range is "open" on the left, but "closed" on the right.

>>> x = ’uncopyrightable’ >>> x[2:11] copyright >>> x[10:] table >> x[:2] + x[11:] unable

And note that if left is omitted, then it is understood to be 0 or the start of the string; if right is omitted, then it is understood to be len(S) or 1 past the end of the string.

One interesting thing to know about slicing is that it is very robust when you give it left and right values that fall outside the normal range.

>>> x = ’uncopyrightable’ >>> x[20] Traceback (most recent call last): File "<pyshell#43>", line 1, in <module> x[20] IndexError: string index out of range >>> x[20:30] ”

1.3 operation ’in’ on strings

We have one more operation on lists that also applies to strings. The in operator was used to determine if a value occurred in a list. You can use it to determine if a substring can be found in a string.

>>> x = ’this,that,something else’ >>> ’,’ in x True >>> ’bad’ in x False

1.4 Split string into a list of substrings

Python also provides a powerful operation called split that lets you extract substrings that are separated by a common delimiter. For example, you might have a string that contains a number of values separated by a comma, or a space.

>>> x = ’this is a test’ >>> x.split() [’this’, ’is’, ’a’, ’test’]

By default, split separates words by whitespace characters. But you can also provide a character to use instead:

>>> x = ’this,that,something else’ >>> x.split(’,’) [’this’, ’that’, ’something else’]

Split is rather single-minded in its processing. What happens, for example, if you have a series of commas with no characters between them?

>>> x = ’1,2,4,,,6,7’ >>> x.split(’,’) [’1’, ’2’, ’4’, ”, ”, ’6’, ’7’]

So you have to be careful when processing these values after the fact.

1.5 Iteration over a string

The final operation that will be useful is iterating over a string’s characters.

The definite for loop has proven useful when given a list of values. You can use it also when processing a list.

What if you wanted to count the vowels in a string entered by the user?

You can easily determine if a string contains a specific letter using the in operator, but that doesn’t seem the right approach. Consider, first, what you can do with a for loop.

x = ’unfathomable’ for char in x: print (x)

So this will print all letters in the word. But how do you count the vowels? I guess you could use a series of if statements:

def countVowels(s): """Count number of vowels in the given string""" numVowels = 0 for char in s: if char == ’a’ or char == ’e’ or char == ’i’ or char == ’o’ or char == ’u’: numVowels = numVowels + 1 return (numVowels)

See? Even the web browser doesn’t like this option! Instead, you can use the in operator described earlier.

def countVowels(s): """Count number of vowels in the given string""" numVowels = 0 for char in s: if char in ’AEIOUaeiou’: numVowels = numVowels + 1 return (numVowels)

Note how this covers all cases, because if the char appears within the vowel string, then it is a vowel.

1.6 Put This Together To Solve A Real-World Problem

Let’s tackle a real problem. Consider dealing with genomic data, which is stored using a variety of formats. Here is a sample GenBank formatted file.

LOCUS GXP_170357 743 bp DNA DEFINITION loc=GXL_141619|sym=TPH2|geneid=121278|acc=GXP_170357| taxid=9606|spec=Homo sapiens|chr=12|ctg=NC_000012|str=(+)| start=70618393|end=70619135|len=743|tss=501,632| homgroup=4612|promset=1|descr=tryptophan hydroxylase 2| comm=GXT_2756574/AK094614/632/gold; GXT_2799672/NM_173353/501/bronze ACCESSION GXP_170357 BASE COUNT 216 a 180 c 147 g 200 t ORIGIN 1 TTGATTACCT TATTTGATCA TTACACATTG TACGCTTGTG TCAAAATATC ACATGTGCCT 61 TATAAATGTG TACAACTATT AGTTATCCAT AAAAATTAAA AATTAAAAAA TCCGTAAAAT 121 GGTTTAAGCA TTCAGCAGTG CTGATCTTTC TTAAATTATT TTTCTAATTT TGGAAAGAAA 181 GCACAAAATC TTTGAATTCA CAATTGCTTA AAGACTGAGG TTAACTTGCC AGTGGCAGGC 241 TTGAGAGATG AGAGAACTAA CGTCAGAGGA TAGATGGTTT CTTGTACAAA TAACACCCCC 301 TTATGTATTG TTCTCCACCA CCCCCGCCCA AAAAGCTACT CGACCTATGA AACAAATCAC 361 ACTATGAGCA CAGATAACCC CAGGCTTCAG GTCTGTAATC TGACTGTGGC CATCGGCAAC 421 CAGAAATGAG TTTCTTTCTA ATCAGTCTTG CATCAGTCTC CAGTCATTCA TATAAAGGAG 481 CCCGGGGATG GGAGGATTCG CATTGCTCTT CAGCACCAGG GTTCTGGACA GCGCCCCAAG 541 CAGGCAGCTG ATCGCACGCC CCTTCCTCTC AATCTCCGCC AGCGCTGCTA CTGCCCCTCT 601 AGTACCCCCT GCTGCAGAGA AAGAATATTA CACCGGGATC CATGCAGCCA GCAATGATGA 661 TGTTTTCCAG TAAATACTGG GCACGGAGAG GGTTTTCCCT GGATTCAGCA GTGCCCGAAG 721 AGCATCAGCT ACTTGGCAGC TCA //

What if you wanted a function that would return the single genomic string given a file encoded as a GenBank file?

You have all of the knowledge, but you likely need more practice to be able to turn this knowledge into the following code.

def loadGenomicSequence(fileName): """ Retrieve single genetic sequence from GenBank-encoded file. For more information on GenBank encoding, visit following URL: http://www.nmpdr.org/FIG/wiki/view.cgi/FIG/GBK """ inFile = open (fileName, ’r’) sequence = ’’ decode = False for line in inFile: if decode: # extract substring from 10th character onwards decades = line[10:] # concatenate all base pairs on line. If index values # exceed the length of the string, the result defaults # to ’’ and the result will be acceptable sequence = sequence + decades[0:10] + decades[11:21] sequence = sequence + decades[22:32] + decades[33:43] sequence = sequence + decades[44:54] + decades[55:65] # once you see this start a line, you know bases are coming if line[0:6] == ’ORIGIN’: decode = True inFile.close() return (sequence)

Try this program on the following sample pathogen description (local file):

This function loads the file and processes each line, one at a time, until it sees a line starting with the word ORIGIN. Then with each subsequent pass through the loop, it grabs the DNA sequence data on each line and groups the decades of base pairs together into a single string.

1.7 Prepare for Monday

In lab3, I had you try the following scenario:

>>> myList = [1, 2, 4, 5] >> yourList = myList >>> myList[2] = 99 >>> print (yourList) [1, 2, 99, 5]

Now what do you think will happen when you print out the value of yourList? before you type the above print statement, see if you can work this out before printing out the value.