CS525. Advanced Topics in Database Systems
Large-Scale Data Management
Home Textbook & Reading List
Schedule Additional Resources

Class Meetings

         Term: Spring-2013
         Room: FL-320 (Fuller Labs-320)
         Date/Time: Tuesday and Thursday, 4:00pm - 5:20pm.

Instructor/Office Hours
          Prof. Mohamed Eltabakh, FL-235, meltabakh@cs.wpi.edu
         Office Hours:  Tuesday and Thursady, 3:00pm-4:00pm. Students are also welcomed to arrange other meeting times by emails.

Course Overview (Catalog Info)
The advances in technology, science, hardware, software, and communication networks have enabled many emerging applications in business enterprises, scientific and engineering disciplines, social networks, government endeavors, among others to generate and collect data at unprecedented scale and complexity that need to managed and analyzed efficiently. In fact, the progress and innovation in these domains and applications is no longer hindered by their ability to collect data, but by their ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion. In this course, we focus on studying new technologies and infrastructures developed for large-scale data management including MapReduce Infrastructure, Pregel platform, and cloud-enabled computing. We will also cover the query optimizations, access methods, storage layouts, and energy management techniques developed over these infrastructures. As an advanced course, a research-oriented project(s) will be proposed to allow students to explore new directions and research ideas in large-scale data management. This course will be very useful for students pursuing research (either MSc or PhD) in database systems and data management.

Tentative Overview on Topics
The main theme of the topics will be divided into four categories, (i) Motivating applications, (ii) State-of-art infrastructures such as Hadoop MapReduce, (iii) Different optimizations on these infrastructures, (iv) Advanced techniques and algorithms on these infrastructures. Here is the tentative overview:

(i) Motivation and Applications

    - Introduction to Large-Scale Data Management
    - Application I: Scientific Data Management
    - Application II: Social-Media Networks
    - Application III: Business Enterprises and Log Processing

(ii) Infrastructures
    - MapReduce Framework
    - Pregel Platform
    - Could-Enabled Computing
    - Other Industry-Developed Large-Scale Distributed Platforms

(iii) Optimizations (For each Infrastructure mentioned above)
    - Query Optimizations
    - Access Methods
    - Storage Layouts and Optimizations   
    - Energy Management in Big Data

(iv) Advanced techniques and algorithms
    - Integrating MapReduce Framework with Other Data Management Technologies
    - Machine learning, data mining, and statistical algorithms on MapReduce

Course Objectives
There are several objectives from this course including:
   1-  Learning state-of-art techniques in large-scale data management that apply to many modern applications.
   2-  Learning how the prepare and present technical papers which is an essential skill for students and researchers.
   3-  Learning how to review papers. Reviewing technical and scientific papers is a skill that you need to develop. Throughout this course, you will review several papers.
   4-  Working in a semester-long project that can potentially lead to a publication.

The course is organized as series of seminars presented by the instructor and students. The instructor will present several lectures covering the state-of-art techniques in various topics. Each student is expected to present two to three papers in a certain topic. For a given lecture, all non-presenting students are expected to read the presented paper and to submit a one-page review that highlights (1) the main idea of the paper, (2) two/three strong points, and (3) two/three weak points of the paper. Most of the course material will be taken from conferences in database systems such as SIGMOD, VLDB, ICDE, etc.
With respect to the project, students will also form terms of two or three to work on a semester-long research project. An ideal project will involve implementing some of the techniques covered in class along with some modifications/extensions to them, or performing comparative study between alternative techniques. However, the project is not limited to the covered material. A good project would possibly result in writing a publishable paper.


Students are expected to have strong background and knowledge of relational database management systems. Prior courses in databases, e.g., CS542, CS4432, or equivalent courses, are recommended. Also students are expected to have strong skills in programming languages such as C or Java.

Course Load & Grading Policy
Projects (6 or 7)
Each project will be done in teams of two.
Presentations (1 or 2)
Each presentation will be done in teams of two. If the number of teams is large, some teams may do one presentation + an extra project.
Reviews are done individually. Whenever a team is presenting a paper, other students are expected to read the presented paper and submit a review on it.
Class Participation
Includes discussions in class and attendance.

WPI E-System
In addition to this website, the course is also available at blackboard.wpi.edu.

Discussion Board
Please use the discussion board available at blackboard.wpi.edu for any course-related discussion and exchange of emails.