M149: Database Systems - DSIT

People

  • Instructor: Georgia Koutrika
  • TAs: Christos Tsapelas, George Katsogiannis-Meimarakis, Mike Xydas

Syllabus

  • Introduction to DBMS, Relational Algebra, SQL (Operators, Operations, Nested Queries)
  • Storage (Pages, Buffers, Heap Files, Data representation of Tuples in Postgres, Indexes, Column Stores, Memory Networks)
  • Execution Algos (Joins, Sorts, Aggregations)
  • Query Optimization (Rewriting, Cardinality Estimation, Cost estimation, Join Orders)
  • Query Execution (Transactions, ACID Properties, Concurrency control, Locks, Parallel and Distributed Execution, Knob tuning)
  • Recovery
  • Integration of Deep Learning in DBMS
  • Natural Language Data Interfaces (NL-to-SQL, SQL-to-NL, Data-to-Text) ​
  • Data Exploration (Query Recommendations) ​
  • NoSQL DBMSes

Bibliography

  • Database System Concepts (Avi Silberschatz, Henry F. Korth, S. Sudarshan, 7th edition, ISBN 9780078022159)
  • Mention Memory: incorporating textual knowledge into Transformers through entity mention attention (M. D. Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, W. Cohen, ArXiv abs/2110.06176 (2021))
  • AI Meets Database: AI4DB and DB4AI (Guoliang Li, Xuanhe Zhou, Lei Cao, SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data, June 2021)
  • Bao: Making Learned Query Optimization Practical (Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska, SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data)
  • Zero-shot cost models for out-of-the-box learned cost prediction (Benjamin Hilprecht, Carsten Binning, Proceedings of the VLDB Endowment, Volume 15, Issue 11, July 2022)
  • Data-to-Text Generation with Content Selection and Planning (Ratish Puduppully, Li Dong, Mirella Lapata, AAAI'19: AAAI Conference on Artificial Intelligence Honolulu Hawaii USA 27 January 2019- 1 February 2019)
  • Text-to-Text Pre-Training for Data-to-Text Tasks (Mihir Kale, Abhinav Rastogi, Proceedings of the 13th International Conference on Natural Language Generation, December 2020, Dublin, Ireland)
  • Explaining Natural Language Query Results (Daniel Deutch, Nave Frost, Amir Gilad, The VLDB Journal (2020) 29:485–508)
  • Explaining Queries over Web Tables to Non-Experts (Jonathan Berant, Daniel Deutch, Amir Globerson, Tova Milo, Tomer Wolfson, 2019 IEEE 35th International Conference on Data Engineering (ICDE) (2018): 1570-157)
  • A Survey on Deep Learning Approaches for Text-to-SQL (George Katsogiannis, Georgia Koutrika, The VLDB Journal (2023))
  • Automating Exploratory Data Analysis via Machine Learning: An Overview. (Tova Milo, Amit Somech: SIGMOD Conference 2020: 2617-2622)
  • Overview of Data Exploration Techniques (Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri:. SIGMOD Conference 2015)

DBTalks (Spring 2021)

Name Title Date
Sebastian Schelter Towards Automated Validation and Inspection of Machine Learning Pipelines April 7th at 18:30 EET
Sihem Amer-Yahia Humans in Online Labor Markets April 14th, 18:30 EET
Magdalena Balazinska Video Data Management: From Data Models to Data Storage and Benchmarking April 21st, 18:30 EET
Arnab Nandi Hallucinating Analytics over Real-World Data using Augmented Reality May 12, 18:30 EET
Carsten Binnig Towards Democratizing Data Science May 19, 18:30 EET

Talk Details

Connection details will be posted.

Title : Towards Automated Validation and Inspection of Machine Learning Pipelines

Abstract: Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. In this lecture, I will introduce some of the practical problems in this area and give an overview over two recent approaches on tackling such issues. Deequ is a library for automating the verification of data quality at scale. It provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables ‘unit tests’ for data. Deequ efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark, and also supports the incremental validation of data quality on growing datasets. mlinspect is a library that enables the lightweight lineage-based inspection of ML preprocessing pipelines. The key idea is to extract a directed acyclic graph representation of the dataflow from ML preprocessing pipelines in Python, and to use this representation to automatically instrument the code with predefined inspections based on a lightweight annotation propagation approach. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation.

Bio : Sebastian Schelter is an Assistant Professor with the University of Amsterdam, conducting research at the intersection of data management and machine learning. He manages the AI for Retail Lab Amsterdam, and has a joint appointment as Research Fellow at Ahold Delhaize, an international retailer based in the Netherlands. His work covers many aspects, such as automating data quality validation, optimizing programs that combine operations from linear and relational algebra or tracking the lineage of machine learning pipelines. In the past, he has been a Faculty Fellow with the Center for Data Science at New York University and a Senior Applied Scientist at Amazon Research, after obtaining his Ph.D. at the database group of TU Berlin with Volker Markl. He is active in open source as an elected member of the Apache Software Foundation, and has extensive experience in building real world systems from his time at Amazon, Twitter, IBM Research, and Zalando.

Title : Humans in Online Labor Markets

Abstract: Online labor markets are increasingly becoming a destination for work. These marketplaces include freelancing platforms such as Qapa and MisterTemp' in France, and TaskRabbit and Fiverr in the USA. On those platforms, workers can find temporary jobs in the physical world such as moving furniture, or in the form of virtual micro-gigs such as helping with designing a website. I will present the results of a study of fairness on those platforms, and discuss how human factors affect algorithm design. The talk will end with a summary of open questions on the Future of Work.

Bio : Sihem Amer-Yahia is a Silver Medal CNRS Research Director and Deputy Director of the Lab of Informatics of Grenoble. She works on exploratory data analysis and fairness in job marketplaces. Before joining CNRS, she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at at&t Labs. In 2021, Sihem is PC chair for ICDE, EDBT demos, and Associate Editor for SIGMOD. Sihem currently leads the Diversity&Inclusion initiative for the data management community.

Title : Video Data Management: From Data Models to Data Storage and Benchmarking

Abstract: The proliferation of inexpensive high-quality cameras coupled with recent advances in machine learning and computer vision have enabled new applications on video data. This in turn has renewed interest in video data management systems. In this talk, we explore several challenges related to video data management. We start by discussing data models. How should we expose video data to make it queryable by applications? We look in particular at the case of 360-degree videos. Second, we explore components of video data storage. How can we store videos in a way that makes them efficiently queryable? Finally, we discuss the problem of benchmarking video data management systems.

Bio : Magdalena Balazinska is Professor and Director of the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Magdalena's research interests are in the field of database management systems. Her current research focuses on data management for data science, big data systems, cloud computing, and image and video analytics. Prior to her leadership of the Allen School, Magdalena was the Director of the eScience Institute, the Associate Vice Provost for Data Science, and the Director of the Advanced Data Science PhD Option. She also served as Co-Editor-in-Chief for Volume 13 of the Proceedings of the Very Large Data Bases Endowment (PVLDB) journal and as PC co-chair for the corresponding, prestigious VLDB'20 conference. Magdalena is an ACM Fellow. She holds a Ph.D. from the Massachusetts Institute of Technology (2006). Magdalena received the inaugural VLDB Women in Database Research Award (2016) for her work on scalable distributed data systems. She also received an ACM SIGMOD Test-of-Time Award (2017) for her work on fault-tolerant distributed stream processing and a 10-year most influential paper award (2010) from her earlier work on reengineering software clones.

Title : Hallucinating Analytics over Real-World Data using Augmented Reality

Abstract: In addition to the virtual universe, there is a vast amount of data present in the real world. Given recent advances in computer vision, augmented reality, and cloud services, we are faced with a tremendous opportunity to augment the structured data around end-users with insights. Coinciding with these trends, the number of data-rich end-user activities is also rapidly increasing. Thus, it is useful to investigate the process of data exploration and analysis in augmented and mixed reality settings. In this talk, we describe a data exploration platform that utilizes augmented reality to enable querying over real-world data.

Bio : Arnab Nandi is an Associate Professor of Computer Science & Engineering at The Ohio State University. Arnab's work focuses on bridging human interaction and data infrastructure, spanning areas of database systems, human-in-the-loop data analytics, and next-generation query interfaces. At Ohio State, he co-founded the OHI/O Program, that fosters a tech culture through hackathons and informal learning, and The STEAM Factory, an interdisciplinary research and collaboration network. Arnab is a recipient of the US National Science Foundation's CAREER Award, a Google Faculty Research Award, and IEEE's TCDE Early Career Award for his contributions towards user-focused data interaction. Arnab holds a PhD in Computer Science & Engineering from the University of Michigan.

Title : Towards Democratizing Data Science

Abstract: Technology has been the key enabler of the current Big Data movement. Without open-source tools like TensorFlow and Spark, as well as the advent of cheap, abundant computing and storage in the cloud, the trend toward datafication of almost every field in research and industry could never have happened. However, the current Big Data tool set is ill-suited for an efficient knowledge discovery by domain experts with only limited IT skills and thus represents a major bottleneck in our data-driven society. In this talk, I will present an overview of my current research efforts to revisit the Big Data stack from the user interface to the underlying hardware for making Big Data tools more efficient and easier to use for domain experts and thus enable the democratization of data science.

Bio : Prof. Dr. Carsten Binnig is a Full Professor in the Computer Science department at at TU Darmstadt and an Adjunct Associate Professor in the Computer Science department at Brown University. Carsten received his PhD at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of data management systems for modern hardware as well as modern workloads such as interactive data exploration and machine learning. His work has been awarded with a Google Faculty Award, as well as multiple best paper and best demo awards for his research.

Spring 2020

Name Title Date
Ryan Marcus, MIT Can Machine Learning Improve Query Optimization? Wednesday May 20, 17:30 – 19:00
Stratos Idreos, Harvard University The periodic table of data structures Wednesday May 27: 17:30 – 19:00
Kurt Stockinger, ZHAW Title: Building Natural Language Interfaces for Databases – From Research to Innovation Wednesday May 27: 19:00 – 20:00