Lab 4:   Modeling Early Website Catalogs
1 Lab Objectives
2 Cataloging the World’s URLs
3 Concrete Problems

Lab 4: Modeling Early Website Catalogs

1 Lab Objectives

  1. Give you practice programming with trees

  2. Expose you to the history of how material has been organized for search on the internet

In this lab, we do the first of three exercises (over the rest of the term) that will expose you to a bit of the history of data structures for organizing and searching the web. This week, we do a slightly simplified, but still reasonable, version of the earliest attempts to "organize" the collection of URLs to help people find information.

2 Cataloging the World’s URLs

The World Wide Web debuted in 1991 (history). In the early days, there weren’t too many websites, but there was still a need for people to be able to find what was out there. Google didn’t yet exist as a company, and there were no fast search engines as we have today. Instead, humans maintained catalogs of URLs on the web. If you wanted to find websites about travel, for example, you looked in the catalog for listings of travel websites.

While automatic tools could search the Web and find when new sites had been created, these tools couldn’t figure out what the sites were about. Thus early catalogs required human effort to tag and organize the newly-found URLs (history of web search engines).

These early catalogs were hierarchical – there would be a Sports category, with subcategories for each sport (Baseball, Tennis, Auto Racing, etc). Each subcategory could also have subcategories (such as different leagues under Baseball). It was similar to folders and subfolders as you have on your computer today. Individual pages/URLs were like the files (the "lowest level" items that had no items below them).

In this lab, you will

3 Concrete Problems

We don’t expect that you will get through all of these problems during lab, but we still wanted to post a complete exercise. You should at least get through creating the classes and creating the initial example of a web hierarchy. Ideally, you would also make headway on the addURL method, as this shows whether you are able to write programs that process trees. If you don’t get that far and want practice problems on programming with trees, you can work on these problems later.

  1. Create a class for URLs, which contains three fields: a string with the URL address, a string with the month that the URL was added to the system, and an integer for the year that the URL was added to the system.

  2. Create a class for WebCategory. Each category has a name (a string like "News", "Sports", etc), a list of subcategories, and a list of URLs (ones that don’t fit any subcategory further down).

  3. Create a class TheWeb that has a list of WebCategories – these are like the top-level folders in your computer’s filesystem.

  4. Create a TheWeb object with at least one top-level category, one of which has at least two subcategories. Your subcategories may start out with no URLs (which will let you get to writing a method to do this).

    Pick categories that you find personally interesting. Suggestions include News (with subcategories for different countries or topics) , Entertainment (with subcategories for Music, Movies, and TV), Travel, Arts, Sports, and so on. Don’t spend TOO much time on this. Leave yourself enough time to try to write at least one method.

  5. Add a method called addURL to the TheWeb class. It takes the name of a (sub)category and a URL string. It traverses the hierarchy of subtopics until it finds the named category, then adds the given URL to the list of URLs in that category (use today’s month and year for those fields). You may assume that the named category is in the hierarchy.

  6. Add a method called getCatalog to the TheWeb class. It takes the name of a (sub)category and returns a list of all the URLs that are embedded within that category (both directly, and in subdirectories within that category).

    For an extra twist on this problem, you could instead return a list of strings that embed both the URL and the sequence of categories that one has to traverse to get there, such as:

      currentWeb.getCatalog("Universities") returns

      ["USA-Massachusetts-Private-https://www.wpi.edu",

       "USA-Massachusetts-Public-https://www.umass.edu",

       "UK-England-https://www.ox.ac.uk"

       ]

    The challenge here is building up the string with the category names to insert in front of the URL.