CS525. Advanced Topics in Database Systems
Large-Scale Data Management
Home Textbook & Reading List
Schedule Additional Resources


There is no specific textbook that covers the diverse material of this course. The course will be based on the recent research papers from major database conferences and journals plus some selected chapters from different books. Presentations given by students and the instructor should be self-contained.

Class Presentations
Students will be divided into teams of two. Each team will select one (or two) papers from the list below to present them in class. So each team will give either one or two presentations, and each presentation will be divided between both students.

Reading List
The instructor will provide a list of candidate papers covering the various topics in large-scale data management. Students may select their papers to present from this list or they may suggest other papers of interest to them.

Map-Reduce Platform
  • J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004
  • S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003.
  • T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 3rd edition, 2012.

Map-Reduce High-Level Languages
  • K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Ozcan, and E. Shekita. Jaql: A scripting language for large scale semi-structured data analysis. In PVLDB, volume 4, 2011.
  • E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self- describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow., 2(2):1402– 1413, 2009.
  • A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414–1425, 2009.
  • A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626–1629, 2009.
Map-Reduce Workflow Managment
  • Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, pages 346–357, 2011.
  • H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation-based Optimizer for MapReduce Workflows. PVLDB, 5(11):1196–1207, 2012.
  • K. Morton, M. Balazinska, and D. Grossman. Paratimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 international conference on Management of data, pages 507–518, 2010.
  • K. Morton, A. Friesen, M. Balazinska, and D. Grossman. Estimating the progress of mapreduce pipelines. In ICDE, pages 681–684, 2010.
  • C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankara- subramanian, S. Seth, C. Tian, T. ZiCornell, and X. Wang. Nova: continuous pig/hadoop workflows. In SIGMOD Conference, pages 1081–1090, 2011.
  • Oozie. http://incubator.apache.org/oozie/map-reduce-cookbook.html.
Map-Reduce Indexing and Query Optimization
  • D. J. Abadi. Tradeoffs between parallel database systems, hadoop, and hadoopdb as platforms for petabyte-scale analysis. In SSDBM, pages 1–3, 2010
  • A. Abouzeid, K. Bajda-Pawlikowski, and A. R. Daniel Abadi, Avi Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, pages 922–933, 2009.
  • F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology, pages 99–110, 2010.
  • S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 international conference on Management of data, pages 975–986, 2010.
  • Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 3(1-2):285–296, 2010.
  • S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow., pages 1459–1468, 2010.
  • J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). In VLDB, volume 3, pages 518–529, 2010.
  • J. Dittrich, J.-A. Quiane-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB, 5(11):1591–1602, 2012.
  • I. Elghandour and A. Aboulnaga. Restore: reusing results of mapreduce jobs. Proc. VLDB Endow., 5(6):586–597, 2012.
  • H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111–1122, 2011.
  • D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: an in-depth study. Proc. VLDB Endow., pages 472–483, 2010.
  • D. Jiang, A. K. H. Tung, and G. Chen. Map-join-reduce: Toward scalable and efficient data analysis on large clusters. IEEE Trans. on Knowl. and Data Eng., pages 1299–1311, 2011.
  • B. Li, E. Mazur, Y. Diao, A. McGregor, and P. Shenoy. A platform for scalable one-pass analytics using mapreduce. In SIGMOD, pages 985–996, 2011.
  • T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow., pages 494–505, 2010.
  • A. Pavlo and et al. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165–178, 2009.
  • R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Pro- ceedings of the 2010 international conference on Management of data, pages 495–506, 2010.
  • H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029–1040, 2007.
  • M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 29–42, 2008.
  • J. Jestes, K. Yi, and F. Li. Building wavelet histograms on large data in mapreduce. PVLDB, pages 109–120, 2011.
  • Rares Vernica, Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac: Adaptive MapReduce using situation-aware mappers. EDBT 2012: 420-431
Map-Reduce Physical Layout Optimizations
  • J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 810–818, 2010.
  • A. Jindal, J.-A. Quiane-Ruiz, and J. Dittrich. Trojan data layouts: right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC), pages 1–14, 2011.
  • M.Y. Eltabakh,Y. Tian, F. Ozcan, R. Gemulla, A. Krettek, and J. McPherson. Cohadoop : Flexible data placement and its exploitation in hadoop. PVLDB, 4(9):575–585, 2011.
  • H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for big data analytics. In CIDR, pages 261–272, 2011.
  • Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the 2011 international conference on Management of data, pages 961–972, 2011.
  • Y. Xu, P. Kostamaa, and L. Gao. Integrating hadoop and parallel dbms. In Proceedings of the 2010 international conference on Management of data, pages 969–974, 2010.
  • Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, Sandeep Tata: Column-Oriented Storage Techniques for MapReduce. PVLDB 4(7): 419-429 (2011)
Map-Reduce Statistical, Mining, and Approximation Algorithms
  • B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. PVLDB, pages 454–465, 2012.
  • S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD, pages 987–998, 2010.
  • R.Gemulla, E. Nijkamp,  P .J . Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In KDD, pages 69–77, 2011.
  • R. Groverand M. J. Carey. Extending Map-Reduce for Efficient Predicate-Based Sampling .In ICDE, pages 486–497, 2012.
  • N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow., 5(10):1028–1039, 2012.
  • N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. PVLDB, 4(11):1135–1145, 2011.
  • The Apache Software Foundation. Mahout. http://mahout.apache.org/.
  • The RevolutionAnalytics Foundation. Rhadoop. https://github.com/RevolutionAnalytics/RHadoop.

Map-Reduce Online Processing and Provenance Management
  • T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI, pages 313–328, 2010.
  • D. Crawl, J. Wang, and I. Altintas. Provenance for MapReduce-based data-intensive workflows. In Proceedings of the 6th workshop on Workflows in support of large-scale science (WORKS), pages 21–30, 2011.
  • R. Ikeda, H. Park, and J. Widom. Provenance for generalized map and reduce workflows. In CIDR, pages 273–283, 2011.
  • N. Khoussainova, M.Balazinska, and D. Suciu. PerfXplain: debugging MapReduce job performance. Proc. VLDB Endow., 5(7):598–609, 2012.
  • H.Park, R. Ikeda, and J.Widom. Ramp: Asystemforcapturingandtracingprovenanceinmapreduce workflows. In VLDB. Stanford InfoLab, August 2011.