CS585/DS503. Big Data Management

Syllabus Readings
Grading
Projects
Topics + Schedule Additional Resources

Textbook

There is no specific textbook that covers the diverse material of this course. The course will be based on the recent research papers from major database conferences and journals plus selected chapters from different books.


Reference Books For Selected Core Course Material.

0- Database Management Systems, Third Edition
    Raghu Ramakrishnan and Johannes Gehrke
    ISBN: 9780072465631
    URL: webpage
1- Database Systems: The Complete Book, Second Edition
    Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom
    ISBN: 9780131873254
    URL: webpage
2- Modern Database Management, Tenth Edition
    Jeffrey A. Hoffer, V. Ramesh, Heikki Topi
    ISBN-13: 978-0-13-608839-4
    URL: webpage
3- Principles of Distributed Database Systems, Third Edition
    Tamer Ozsu, Patrick Valduriez
    ISBN: 978-1-4419-8833-1
    URL: webpage
       4- Hadoop: The Definitive Guide, Third Edition
            Tom White

            ISBN:
978-1-4493-1152-0
        
   URL: webpage

        5- MongoDB: The Definitive Guide, Third Edition
            Kristina Chodorow

            ISBN:
978-1-449-38156-1
        
   URL: webpage

Readings from the Big Data Literature

In general, professional venues of the highest quality from which such readings in big data can be drawn are either journals (typically published as a volumne once every month) such as:

ACM Transactions on Database Systems (ACM TODS)
IEEE Transactions on Knowledge and Database Systems (ACM TKDE)
The Very Large Databases Journal (VLDBJ) Data and Knowledge Engineering Journal (DKE)
Information Systems Journal (ISJ)

or, professional conference proceedings of same calibre (typically are held and published once a year), such as:

ACM Special Int. Group on Data Management (ACM SIGMOD)
Int. Conf. on Very Large Databases (VLDB)
IEEE Int. Conf. on Data Engineering (ICDE)
IEEE Int. Conference on Big Data (BigData)
Int. Conf on Information and Knowledge Management (CIKM)
ACM Principles on Database Systems (PODS)

In general, quality on-line search engines and sources for finding database related literature include:

  • Link: Michael Ley's DBLP bibliography server for computer science bibliography containing links to many on-line papers (e.g., VLDB).
  • Link: ACM Digital Library for ACM conferences (e.g., SIGMOD, VLDB, SOSP) and journals (e.g., TODS).
  • Link: IEEE Xplore for IEEE conferences (e.g., ICDE) and journals (e.g., TKDE).
  • Link: Springer LINK for Springer and Kluwer publications (e.g., Lecture Notes in Computer Science).
  • Link: USENIX Events for USENIX conferences (e.g., OSDI).
  • Note: WPI subscribes to ACM DIGITAL LIBRARIES and thus has the online versions of most journals and conferences of IEEE and ACM. So go and search for papers on topics that excite you!

    Google SCHOLAR also can contain relevant resources: http://scholar.google.com/

    Other appropriate sources for big data systems work such as, some times vendor or technology specific venues, exist. In the context of this course, however, make sure your sources have been vetted for quality before utilizing the material.

    Below is a subset of possible readings in specific topic areas of data base research. Students can select papers to present from this list, and students are also welcome to suggest other papers that are not in this given list. However, the students should consult the instructor regarding the papers they would like to present before presenting them.
    Category 1: Large-Scale Data Analytics using Hadoop and Map-Reduce Framework
    • Chuan Lei, Zhongfang Zhuang, Elke A. Rundensteiner, Mohamed Y. Eltabakh: Redoop\ Infrastructure for Recurring Big Data Queries. PVLDB 7(13): 1589-1592 (2014)
    • Chuan Lei, Elke Rundensteiner and Mohamed Eltabakh, Redoop: Supporting Recurring Queries in Hadoop, EDBT'2014. pp. 25-36.
    • Chuan Lei, Zhongfang Zhuang, Elke A. Rundensteiner, Mohamed Y. Eltabakh, Shared Execution of Recurring Workloads in MapReduce, PVLDB Proceedings, Volume 8, Issue 7 (2015). The 41st International Conference on Very Large Data Bases, August 31st - September 4th, 2015, Kohala Coast, Hawaii.
    • B. Li, E. Mazur, Y. Diao, A. McGregor, and P. J. Shenoy. A platform for scalable one-pass analytics using mapreduce. In SIGMOD, pages 985–996, 2011.
    • H. Park, R. Ikeda, and J. Widom. Ramp: A system for capturing and tracing provenance in mapreduce workflows. In VLDB. Stanford InfoLab, August 2011.
    • T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, pages 313–328, 2010.
    • V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu. Deduce: at the intersection of mapreduce and stream processing. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 657–662, New York, NY, USA, 2010. ACM.
    • A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao, N. Jain, P. Chakka, S. Anthony, H. Liu, and N. Zhang. Hive - a petabyte scale data warehousing using hadoop. In ICDE, 2010.
    • D. J. Abadi. Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis. In SSDBM, pages 1–3, 2010.
    • S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD Conference, pages 975–986, 2010.
    • J. Dittrich, J.-A. Quiane?-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, volume 3, pages 518–529, 2010.
    • A.Thusoo,J.S.Sarma,N.Jain,Z.Shao,P.Chakka,S.Anthony,H.Liu,P.Wyckoff,andR.Murthy.Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626–1629, 2009.
    • A. Abouzeid, K. Bajda-Pawlikowski, and A. R. Daniel Abadi, Avi Silberschatz. HadoopDB: An Ar- chitectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, pages 922–933, 2009.
    • E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402–1413, 2009.
    • A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414–1425, 2009.
    • C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099–1110, 2008.
    • H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: Simplified relational data processing on large clusters. In SIGMOD, pages 1029–1040, 2007.
    • J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004.

    Category 2: Cloud  and Distributed Computing
    • Umar Farooq Minhas, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Kenneth Salem, and Andrew Warfield. RemusDB: Transparent High Availability for Database Systems. Proceedings of the VLDB Endowment (PVLDB), 2011.
    • David J. DeWitt, Eric Robinson, Srinath Shankar, Erik Paulson, Jeffrey Naughton, Andrew Krioukov, and Joshua Royalty. Clustera: An Integrated Computation and Data Management System.
    • Parag Agrawal, Daniel Kifer, and Christopher Olston. Scheduling Shared Scans of Large Data Files. VLDB 2008.
    • Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. Automatic Optimization of Parallel Dataflow Programs. USENIX Annual Conference 2008.
    • Lei Chen, Christopher Olston, and Raghu Ramakrishnan. Parallel Evaluation of Composite Aggregate Queries. ICDE 2008.
    • Matthias Brantner, Daniela Florescu, David A. Graf, Donald Kossmann, and Tim Kraska. Building a Database on S3. SIGMOD 2008.
    • Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, and Raghu Ramakrishnan. Efficient Bulk Insertion Into a Distributed Ordered Table. SIGMOD 2008.
    • Eric Robinson and David J. DeWitt. Turning Cluster Management into Data Management; A System Overview. CIDR 2007.
    • Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization.
    • Khuzaima Daudjee and Kenneth Salem. Lazy Database Replication with Snapshot Isolation. VLDB 2006.
    Category 3: Parallel and Distributed Databases
    • Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book
    • Book: Tamer Ozsu,Patrick Valduriez, Principles of Distributed Database Systems
    Category 4: Scientific Data Management
    • ArrayStore: A Storage Manager for Complex Parallel Array Processing, Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD'11, June 12-16, 2011, Athens, Greece.
    • Aida Gandara, George Chin, Paulo Pinheiro Da Silva, Chandrika Sivaramakrishnan, Signe White and Terence Critchlow, Knowledge Annotations in Scientific Workflows: An Implementation in Kepler, SSDBM 2011
    • Overview of SciDB, Large Scale Array Storage, Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA
    • Shen-Shyang Ho, Wenqing Tang, W. Timothy Liu, Markus Schneider. A Frame- work for Moving Sensor Data Query and Retrieval of Dynamic Atmospheric Events, SSDBM 2010
    • Arnab Bhattacharya, Abhishek Bhowmick, Ambuj K. Singh. Finding Top-k Similar Pairs of Objects Annotated with Terms from an Ontology, SSDBM 2010
    • David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, Cla ́udio T. Silva. Bridging Workflow and Data Provenance using Strong Links, SSDBM 2010
    • Anish Das Sarma, Martin Theobald, Jennifer Widom. LIVE: A Lineage-Supported Versioned DBMS, SSDBM 2010
    • M. Stonebraker and et. al. Requirements for science data bases and scidb. In CIDR Perspectives, 2009.
    • Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref: bdbms - A Database Management System for Biological Data. CIDR 2007: 196-206
    • J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. CIDR, pages 262–276, 2005.
    • P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In SIGMOD, pages 539–550, 2006.
    • Book: Scientific Data Management: Challenges, Technology, and Deployment (Chapman & Hall/CRC Computational Science)

    Category 5: Data Integration
    • Hazem Elmeleegy, Ahmed K. Elmagarmid, Jaewoo Lee: Leveraging query logs for schema mapping generation in U-MAP. SIGMOD 2011: 121-132
    • Mohamed Yakout, Ahmed K. Elmagarmid, Hazem Elmeleegy, Mourad Ouzzani, Alan Qi: Behavior Based Record Linkage. PVLDB 3(1): 439-448 (2010)
    • L. Cabibbo. On keys, foreign keys and nullable attributes in relational mapping systems. In EDBT, pages 263–274, 2009.
    • G. Gottlob, R. Pichler, and V. Savenkov. Normalization and optimization of schema mappings. PVLDB, 2(1):1102–1113, 2009.
    • G. Mecca, P. Papotti, and S. Raunich. Core schema mappings. In SIGMOD, 2009.
    • Hazem Elmeleegy, Mourad Ouzzani, Ahmed K. Elmagarmid: Usage-Based Schema Matching. ICDE 2008: 20-29
    • Y. An, A. Borgida, R. J. Miller, and J. Mylopoulos. A semantic approach to discovering schema mapping expressions. In ICDE, 2007.
    •  P. Bohannon, E. Elnahrawy, W. Fan, and M. Flaster. Putting context into schema matching. In VLDB, 2006.
    •  C. Yu and L. Popa. Semantic adaptation of schema mappings when schemas evolve. In VLDB, 2005.
    • R. J. Miller, L. M. Haas, and M. A. Herna ́ndez. Schema mapping as query discovery. In VLDB, 2000.
    • E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334–350, 2001.
    • P. Bernstein, A. Halevy, and R. Pottinger. A vision for management of complex models. SIGMOD Record, 29(4):55–63, 2000.
    • Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book

    Category 6: Other Topics (Streaming, Active Databases, Object-Relational Data Model, Semi-Structured XML Data Model, Keyword Search, OLAP)
    • Lei Cao and E. Rundensteiner, High Performance Stream Query Processing With Correlation-Aware Partitioning, The 40th International Conference on Very Large Data Bases (VLDB), PVLDB Volume 7, No. 4, December 2013.
    • Chuan Lei, Elke A. Rundensteiner, and Joshua Guttman, Robust Stream Query Processing, ICDE 2013.
    • Di Wang, E. Rundensteiner, and R. Ellison, Active Complex Event Processing, VLDB, Sept 2011.
    • Di Wang, E. Rundensteiner, and R. Ellison, Active Complex Event Processing, VLDB, Sept 2011.
    • Caitlin Kuhlman, Yizhou Yan, Lei Cao, and Elke Rundensteiner, Pivot-based Distributed K-Nearest Neighbor Mining, European Conference on Machine Learning, Principles and Practice of Knowledge D iscovery (ECML-PKDD) 2017, Research Track, Springer LNCS,
    • Yizhou Yan, Lei Cao, Caitlin Kuhlman and Elke Rundensteiner, Distributed Local Outlier Detection in Big Data, ACM KDD, Research Track, 2017.
    • Di Wang, E. Rundensteiner, and R. Ellison, Active Complex Event Processing, VLDB, Sept 2011.
    • Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B. and Vernon, A., Scalable Trigger Processing. in International Conference on Data Engineering (ICDE), (1999), 266-275.
    • J. Widom and S. Ceri, editors. Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, 1996.
    • A. Aiken, J. Widom, and J. Hellerstein. Behavior of database production rules: Termination, confluence, and observable determinism. In SIGMOD, pages 59–68, 1992.
    • Schreier, U., Pirahesh, H., Agrawal, R. and Mohan, C.,  Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS. in International Confer ence on Very Large Databases (VLDB), (1991), 469-478.
    • U. Dayal. Active database management systems. SIGMOD Rec., 18(3):150–169, 1989.
    • Guoliang Li , Jianhua Feng , Xiaofang Zhou , Jianyong Wang, Providing built-in keyword search capabilities in RDBMS, The VLDB Journal — The International Journal on Very Large Data Bases, v.20 n.1, p.1-19, February 2011
    • Akanksha Baid , Ian Rae , Jiexing Li , AnHai Doan , Jeffrey Naughton, Toward scalable keyword search over relational data, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010
    • Joel Coffman , Alfred C. Weaver, A framework for evaluating database keyword search strategies, Proceedings of the 19th ACM international conference on Information and knowledge management, October 26-30, 2010, Toronto, ON, Canada
    • Sonia Bergamaschi , Elton Domnori , Francesco Guerra , Raquel Trillo Lado , Yannis Velegrakis, Keyword search over relational databases: a metadata approach, Proceedings of the 2011 international conference on Management of data, June 12-16, 2011, Athens, Greece
    • Lu Qin, Jeffrey Xu Yu,     Lijun Chang, Keyword search in databases: the power of RDBMS, SIGMOD 2009
    • S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping? In Proc. of CIDR, 2005.
    • DBXplorer: A System for Keyword-Based Search over Relational Databases, Proceedings of the 18th International Conference on Data Engineering, p.5, February 26-March 01, 2002
    • Vagelis Hristidis , Yannis Papakonstantinou, Discover: keyword search in relational databases, Proceedings of the 28th international conference on Very Large Data Bases, p.670-681, August 20-23, 2002, Hong Kong, China
    • Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book