1 One Week To Go
1.1 Accessing Sources Of Data On Internet
1.2 Information Retrieval from Web Site
1.3 Scraping Content From Web Site
1.4 Clicker Question
1.5 Programming to API
1.6 Homework 7
1.7 Sequence Operations
1.8 Version : 2014/ 02/ 28

CS 110X Feb 27 2014

Lecture Path: 21
Back Next

Expected Readings: 339-346
Clicker: Assessment
Homework7 Assigned: Assignment

The secret of getting ahead is getting started. The secret of getting started is breaking your complex overwhelming tasks into small manageable tasks, and then starting on the first one.
Mark Twain

1 One Week To Go

Let’s discuss, briefly, your experience in LAB6 yesterday.

1.1 Accessing Sources Of Data On Internet

You will often find that you would like to access data and other information that you can find on the Internet. This access follows two basic forms:

We will discuss each of these in turn.

1.2 Information Retrieval from Web Site

Often web sites post useful information in human-readable form, such as a table or formatted text. In these cases you would certainly like to simply retrieve the information directly, rather than saving the web page to your local disk. Indeed, the information may be changing frequently which makes it necessary to go directly "to the source".

First, let’s discuss sample code that shows how to retrieve information directly from a website.

>>> import urllib2 >>> url = ’http://web.cs.wpi.edu/cgi-bin/heineman/contest/register/current.cgi’ >>> response = urllib2.urlopen(url) >>> html = response.read() >>> print (html) <html><head><title>Contest</title></head> <body> <HTML> <HEAD> <TITLE>WPI Programming Contest: Current Registration</TITLE> <LINK REL=stylesheet TYPE=text/css HREF="/cs-style.css"> </HEAD> ...

If you visit the web site directly, you can see the HTML table that contains the information.

Wouldn’t it be great to be able to retrieve this information and access it from within a Python program?

At the same time, many CSV files are also available online, so could you use a similar approach to simply grab data directly from a website? The answer is yes to both of these questions. Let’s start with solving the easier CSV problem first.

def extractAllRecordsFromURL(url): """ Extract all CSV records from file at given URL Note that the first element in this list contains the description of the columns as defined in the CSV file """ response = urllib2.urlopen(url) html = response.read() # Trim off excess \n that end the file while html[-1] == ’\n’: html = html[:-1] lines = html.split(’\n’) reader = csv.reader(lines) results = csvProcess(reader) return (results) def csvProcess(reader): """ Common helper that iterates over CSV reader to identify rows """ results = [] for row in reader: results.append(row) return (results)

First observe that we simply retrieve all data from the URL as a single large string literal html. To interpret this string as a CSV file, use split(’\n’) to separate the file into lines, each of which will represent a row in the data. We have to be careful to avoid creating blank rows if the data ends with one or more ’\n’ characters, so they are trimmed off before processing the data.

This code looks familiar to the helper.py module you wrote for HW6. The only change is that a new csvProcess function was created to be used both for extractAllRecords and extractAllRecordsFromURL. Whenever you are writing code and see duplicate code, take the time to extract out a function to avoid having duplicate code in your programs.

This helper module will be provided for HW7.

1.3 Scraping Content From Web Site

Before we can solve this problem, let’s try to solve the following question:

You are given a string of the form ’(a,b,c),(d,e,f,g),(h,i)’ and you want to create a list of lists from this data. In this case, [[’a’,’b’,’c’], [’d’,’e’,’f’,’g’], [’h’,’i’]].

To solve this problem, let’s work out manually how this should work. First you need to find the regions that are bracketed by "(" and ")" characters and then for each of these regions you need to convert the comma-separated substrings into lists.

def extract(s): """ Return list of string literal lists for comma-separated ’(items)’ """ while ’(’ in s: start = s.index(’(’) end = s.index(’)’) print s[start+1:end] s = s[end+1:]

As long as the given string contains an open parentheses, find the first ’(’ and next ’)’ and extract out just that substring. Note the above code simply prints out these substrings.

>>> s = ’(a,b,c),(d,e,f,g),(h,i)’ a,b,c d,e,f,g h,i

So instead of just printing these substrings, the final version of the code appends each sublist created using split.

def extract(s): """ Return list of string literal lists for comma-separated ’(items)’ """ finalList = [] while ’(’ in s: start = s.index(’(’) end = s.index(’)’) substr = s[start+1:end] innerList = substr.split(’,’) finalList.append(innerList) s = s[end+1:] return (finalList)

You can use this same approach when processing more complicated structures, such as HTML tables. The full code for this capability can be found here.

1.4 Clicker Question

Takes place now.

1.5 Programming to API

Homework 7 is now released. To get you started on this project, we now discuss how to programmatically access information on the Internet rather than simply "scraping HTML content" as shown earlier.

Many systems offer an Application Programming Interface for retrieving information. For Homework 7, you are to use Forecast.io as a source of information.

Using these APIs you can retrieve an incredibly rich amount of information. Here, for example, is how you can find the current weather forecast for Worcester, MA.

Note: once this course is ended, this API "SECRET" key will be reset and the above request will no longer operate.

Of course this is not useful for you since that information is much more complicated to interpret. It is encoded in an Internet standard known as JavaScript Object Notation or JSON for short.

Fortunately there are many freely available libraries that allow you to parse and interpret this information so you can access it from within a Python program. If you look closely at the information returned earlier you will likely be able to see that it conforms to the standard dictionary format we have already discussed for Python. Now let’s discuss how you can take advantage of this.

I will now install the necessary files I’ve made available for HW7 to demonstrate this capability and prepare you for HW7.

TBA live

1.6 Homework 7

I want to spend a little time describing Homework 7, the final assignment for this course.

1.7 Sequence Operations

We have now seen all of the basic Python operations over list structures. Let’s start with the fundamental operations over sequences.

These same operations apply to lists, because lists can be treated as a sequence. Lists are special, however, and have additional operations, which we will summarize tomorrow.

1.8 Version : 2014/02/28

(c) 2014, George Heineman