CS585/DS503: Big Data Management

CS585/DS503. Big Data Management

Textbook
There is no specific textbook that covers the diverse material of this course. The course will be based on the recent research papers from major database conferences and journals plus some selected chapters from different books. Presentations given by students and the instructor should be self-contained.

Candidate Textbooks

1- Database Systems: The Complete Book, Second Edition

Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom

ISBN: 9780131873254

URL: webpage

2- Modern Database Management, Tenth Edition

Jeffrey A. Hoffer, V. Ramesh, Heikki Topi

ISBN-13: 978-0-13-608839-4

URL: webpage

3- Principles of Distributed Database Systems, Third Edition

Tamer Ozsu, Patrick Valduriez

ISBN: 978-1-4419-8833-1

URL: webpage

4- Hadoop: The Definitive Guide, Third Edition
          Tom White
          ISBN: 978-1-4493-1152-0
          URL: webpage

        5- MongoDB: The Definitive Guide, Third Edition
          Kristina Chodorow
          ISBN: 978-1-449-38156-1
          URL: webpage

Reading List
Below is a candidate reading list from which students can select papers to present in specific research areas. Students should consult the instructor regarding the papers they would like to present. Students are welcome to suggest other papers not in the list.

Category 1: Large-Scale Data Analytics using Hadoop and Map-Reduce Framework

B. Li, E. Mazur, Y. Diao, A. McGregor, and P. J. Shenoy. A platform for scalable one-pass analytics using mapreduce. In SIGMOD, pages 985–996, 2011.
H. Park, R. Ikeda, and J. Widom. Ramp: A system for capturing and tracing provenance in mapreduce workflows. In VLDB. Stanford InfoLab, August 2011.
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, pages 313–328, 2010.
V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu. Deduce: at the intersection of mapreduce and stream processing. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 657–662, New York, NY, USA, 2010. ACM.
A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao, N. Jain, P. Chakka, S. Anthony, H. Liu, and N. Zhang. Hive - a petabyte scale data warehousing using hadoop. In ICDE, 2010.
D. J. Abadi. Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis. In SSDBM, pages 1–3, 2010.
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD Conference, pages 975–986, 2010.
J. Dittrich, J.-A. Quiane?-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, volume 3, pages 518–529, 2010.
A.Thusoo,J.S.Sarma,N.Jain,Z.Shao,P.Chakka,S.Anthony,H.Liu,P.Wyckoff,andR.Murthy.Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626–1629, 2009.
A. Abouzeid, K. Bajda-Pawlikowski, and A. R. Daniel Abadi, Avi Silberschatz. HadoopDB: An Ar- chitectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, pages 922–933, 2009.
E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402–1413, 2009.
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414–1425, 2009.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099–1110, 2008.
H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: Simplified relational data processing on large clusters. In SIGMOD, pages 1029–1040, 2007.
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004.

Category 2: Cloud and Distributed Computing

Umar Farooq Minhas, Shriram Rajagopalan, Brendan Cully, Ashraf Aboulnaga, Kenneth Salem, and Andrew Warfield. RemusDB: Transparent High Availability for Database Systems. Proceedings of the VLDB Endowment (PVLDB), 2011.
David J. DeWitt, Eric Robinson, Srinath Shankar, Erik Paulson, Jeffrey Naughton, Andrew Krioukov, and Joshua Royalty. Clustera: An Integrated Computation and Data Management System.
Parag Agrawal, Daniel Kifer, and Christopher Olston. Scheduling Shared Scans of Large Data Files. VLDB 2008.
Christopher Olston, Benjamin Reed, Adam Silberstein, and Utkarsh Srivastava. Automatic Optimization of Parallel Dataflow Programs. USENIX Annual Conference 2008.
Lei Chen, Christopher Olston, and Raghu Ramakrishnan. Parallel Evaluation of Composite Aggregate Queries. ICDE 2008.
Matthias Brantner, Daniela Florescu, David A. Graf, Donald Kossmann, and Tim Kraska. Building a Database on S3. SIGMOD 2008.
Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Ramana Yerneni, and Raghu Ramakrishnan. Efficient Bulk Insertion Into a Distributed Ordered Table. SIGMOD 2008.
Eric Robinson and David J. DeWitt. Turning Cluster Management into Data Management; A System Overview. CIDR 2007.
Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, and Andrew Warfield. Xen and the Art of Virtualization.
Khuzaima Daudjee and Kenneth Salem. Lazy Database Replication with Snapshot Isolation. VLDB 2006.

Category 3: Parallel and Distributed Databases

Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book
Book: Tamer Ozsu,Patrick Valduriez, Principles of Distributed Database Systems

Category 4: Scientific Data Management

ArrayStore: A Storage Manager for Complex Parallel Array Processing, Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD'11, June 12-16, 2011, Athens, Greece.
Aida Gandara, George Chin, Paulo Pinheiro Da Silva, Chandrika Sivaramakrishnan, Signe White and Terence Critchlow, Knowledge Annotations in Scientific Workflows: An Implementation in Kepler, SSDBM 2011
Overview of SciDB, Large Scale Array Storage, Processing and Analysis, The SciDB Development team, SIGMOD'10, June 6-11, 2010, Indianapolis, Indiana, USA
Shen-Shyang Ho, Wenqing Tang, W. Timothy Liu, Markus Schneider. A Frame- work for Moving Sensor Data Query and Retrieval of Dynamic Atmospheric Events, SSDBM 2010
Arnab Bhattacharya, Abhishek Bhowmick, Ambuj K. Singh. Finding Top-k Similar Pairs of Objects Annotated with Terms from an Ontology, SSDBM 2010
David Koop, Emanuele Santos, Bela Bauer, Matthias Troyer, Juliana Freire, Cla ́udio T. Silva. Bridging Workflow and Data Provenance using Strong Links, SSDBM 2010
Anish Das Sarma, Martin Theobald, Jennifer Widom. LIVE: A Lineage-Supported Versioned DBMS, SSDBM 2010
M. Stonebraker and et. al. Requirements for science data bases and scidb. In CIDR Perspectives, 2009.
Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref: bdbms - A Database Management System for Biological Data. CIDR 2007: 196-206
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. CIDR, pages 262–276, 2005.
P. Buneman, A. Chapman, and J. Cheney. Provenance management in curated databases. In SIGMOD, pages 539–550, 2006.
Book: Scientific Data Management: Challenges, Technology, and Deployment (Chapman & Hall/CRC Computational Science)

Category 5: Data Integration

Hazem Elmeleegy, Ahmed K. Elmagarmid, Jaewoo Lee: Leveraging query logs for schema mapping generation in U-MAP. SIGMOD 2011: 121-132
Mohamed Yakout, Ahmed K. Elmagarmid, Hazem Elmeleegy, Mourad Ouzzani, Alan Qi: Behavior Based Record Linkage. PVLDB 3(1): 439-448 (2010)
L. Cabibbo. On keys, foreign keys and nullable attributes in relational mapping systems. In EDBT, pages 263–274, 2009.
G. Gottlob, R. Pichler, and V. Savenkov. Normalization and optimization of schema mappings. PVLDB, 2(1):1102–1113, 2009.
G. Mecca, P. Papotti, and S. Raunich. Core schema mappings. In SIGMOD, 2009.
Hazem Elmeleegy, Mourad Ouzzani, Ahmed K. Elmagarmid: Usage-Based Schema Matching. ICDE 2008: 20-29
Y. An, A. Borgida, R. J. Miller, and J. Mylopoulos. A semantic approach to discovering schema mapping expressions. In ICDE, 2007.
P. Bohannon, E. Elnahrawy, W. Fan, and M. Flaster. Putting context into schema matching. In VLDB, 2006.
C. Yu and L. Popa. Semantic adaptation of schema mappings when schemas evolve. In VLDB, 2005.
R. J. Miller, L. M. Haas, and M. A. Herna ́ndez. Schema mapping as query discovery. In VLDB, 2000.
E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB J., 10(4):334–350, 2001.
P. Bernstein, A. Halevy, and R. Pottinger. A vision for management of complex models. SIGMOD Record, 29(4):55–63, 2000.
Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book

Category 6: Other Topics (Active Databases, Object-Relational Data Model, Semi-Structured XML Data Model, Keyword Search, OLAP)

Hanson, E.N., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B. and Vernon, A., Scalable Trigger Processing. in International Conference on Data Engineering (ICDE), (1999), 266-275.
J. Widom and S. Ceri, editors. Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, 1996.
A. Aiken, J. Widom, and J. Hellerstein. Behavior of database production rules: Termination, confluence, and observable determinism. In SIGMOD, pages 59–68, 1992.
Schreier, U., Pirahesh, H., Agrawal, R. and Mohan, C., Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS. in International Confer ence on Very Large Databases (VLDB), (1991), 469-478.
U. Dayal. Active database management systems. SIGMOD Rec., 18(3):150–169, 1989.
Guoliang Li , Jianhua Feng , Xiaofang Zhou , Jianyong Wang, Providing built-in keyword search capabilities in RDBMS, The VLDB Journal — The International Journal on Very Large Data Bases, v.20 n.1, p.1-19, February 2011
Akanksha Baid , Ian Rae , Jiexing Li , AnHai Doan , Jeffrey Naughton, Toward scalable keyword search over relational data, Proceedings of the VLDB Endowment, v.3 n.1-2, September 2010
Joel Coffman , Alfred C. Weaver, A framework for evaluating database keyword search strategies, Proceedings of the 19th ACM international conference on Information and knowledge management, October 26-30, 2010, Toronto, ON, Canada
Sonia Bergamaschi , Elton Domnori , Francesco Guerra , Raquel Trillo Lado , Yannis Velegrakis, Keyword search over relational databases: a metadata approach, Proceedings of the 2011 international conference on Management of data, June 12-16, 2011, Athens, Greece
Lu Qin, Jeffrey Xu Yu, Lijun Chang, Keyword search in databases: the power of RDBMS, SIGMOD 2009
S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Integrating db and ir technologies: What is the sound of one hand clapping? In Proc. of CIDR, 2005.
DBXplorer: A System for Keyword-Based Search over Relational Databases, Proceedings of the 18th International Conference on Data Engineering, p.5, February 26-March 01, 2002
Vagelis Hristidis , Yannis Papakonstantinou, Discover: keyword search in relational databases, Proceedings of the 28th international conference on Very Large Data Bases, p.670-681, August 20-23, 2002, Hong Kong, China
Book: Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer D. Widom, Database Systems: The Complete Book