M149: Database Systems - DSIT
People
- Instructor: Georgia Koutrika
- TAs: Christos Tsapelas, George Katsogiannis-Meimarakis, Mike Xydas
Syllabus
- Introduction to DBMS, Relational Algebra, SQL (Operators, Operations, Nested Queries)
- Storage (Pages, Buffers, Heap Files, Data representation of Tuples in Postgres, Indexes, Column Stores, Memory Networks)
- Execution Algos (Joins, Sorts, Aggregations)
- Query Optimization (Rewriting, Cardinality Estimation, Cost estimation, Join Orders)
- Query Execution (Transactions, ACID Properties, Concurrency control, Locks, Parallel and Distributed Execution, Knob tuning)
- Recovery
- Integration of Deep Learning in DBMS
- Natural Language Data Interfaces (NL-to-SQL, SQL-to-NL, Data-to-Text)
- Data Exploration (Query Recommendations)
- NoSQL DBMSes
Bibliography
- Database System Concepts (Avi Silberschatz, Henry F. Korth, S. Sudarshan, 7th edition, ISBN 9780078022159)
- Mention Memory: incorporating textual knowledge into Transformers through entity mention attention (M. D. Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Fei Sha, W. Cohen, ArXiv abs/2110.06176 (2021))
- AI Meets Database: AI4DB and DB4AI (Guoliang Li, Xuanhe Zhou, Lei Cao, SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data, June 2021)
- Bao: Making Learned Query Optimization Practical (Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska, SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data)
- Zero-shot cost models for out-of-the-box learned cost prediction (Benjamin Hilprecht, Carsten Binning, Proceedings of the VLDB Endowment, Volume 15, Issue 11, July 2022)
- Data-to-Text Generation with Content Selection and Planning (Ratish Puduppully, Li Dong, Mirella Lapata, AAAI'19: AAAI Conference on Artificial Intelligence Honolulu Hawaii USA 27 January 2019- 1 February 2019)
- Text-to-Text Pre-Training for Data-to-Text Tasks (Mihir Kale, Abhinav Rastogi, Proceedings of the 13th International Conference on Natural Language Generation, December 2020, Dublin, Ireland)
- Explaining Natural Language Query Results (Daniel Deutch, Nave Frost, Amir Gilad, The VLDB Journal (2020) 29:485–508)
- Explaining Queries over Web Tables to Non-Experts (Jonathan Berant, Daniel Deutch, Amir Globerson, Tova Milo, Tomer Wolfson, 2019 IEEE 35th International Conference on Data Engineering (ICDE) (2018): 1570-157)
- A Survey on Deep Learning Approaches for Text-to-SQL (George Katsogiannis, Georgia Koutrika, The VLDB Journal (2023))
- Automating Exploratory Data Analysis via Machine Learning: An Overview. (Tova Milo, Amit Somech: SIGMOD Conference 2020: 2617-2622)
- Overview of Data Exploration Techniques (Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri:. SIGMOD Conference 2015)
DBTalks (Spring 2021)
Name | Title | Date |
---|---|---|
Sebastian Schelter | Towards Automated Validation and Inspection of Machine Learning Pipelines | April 7th at 18:30 EET |
Sihem Amer-Yahia | Humans in Online Labor Markets | April 14th, 18:30 EET |
Magdalena Balazinska | Video Data Management: From Data Models to Data Storage and Benchmarking | April 21st, 18:30 EET |
Arnab Nandi | Hallucinating Analytics over Real-World Data using Augmented Reality | May 12, 18:30 EET |
Carsten Binnig | Towards Democratizing Data Science | May 19, 18:30 EET |
Talk Details
Connection details will be posted.
Title : Towards Automated Validation and Inspection of Machine Learning Pipelines Abstract: Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. In this lecture, I will introduce some of the practical problems in this area and give an overview over two recent approaches on tackling such issues. Deequ is a library for automating the verification of data quality at scale. It provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables ‘unit tests’ for data. Deequ efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark, and also supports the incremental validation of data quality on growing datasets. mlinspect is a library that enables the lightweight lineage-based inspection of ML preprocessing pipelines. The key idea is to extract a directed acyclic graph representation of the dataflow from ML preprocessing pipelines in Python, and to use this representation to automatically instrument the code with predefined inspections based on a lightweight annotation propagation approach. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. ![]() |
![]() Title : Humans in Online Labor Markets Abstract: Online labor markets are increasingly becoming a destination for work. These marketplaces include freelancing platforms such as Qapa and MisterTemp' in France, and TaskRabbit and Fiverr in the USA. On those platforms, workers can find temporary jobs in the physical world such as moving furniture, or in the form of virtual micro-gigs such as helping with designing a website. I will present the results of a study of fairness on those platforms, and discuss how human factors affect algorithm design. The talk will end with a summary of open questions on the Future of Work. Bio : Sihem Amer-Yahia is a Silver Medal CNRS Research Director and Deputy Director of the Lab of Informatics of Grenoble. She works on exploratory data analysis and fairness in job marketplaces. Before joining CNRS, she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at at&t Labs. In 2021, Sihem is PC chair for ICDE, EDBT demos, and Associate Editor for SIGMOD. Sihem currently leads the Diversity&Inclusion initiative for the data management community. |
![]() Title : Video Data Management: From Data Models to Data Storage and Benchmarking Abstract: The proliferation of inexpensive high-quality cameras coupled with recent advances in machine learning and computer vision have enabled new applications on video data. This in turn has renewed interest in video data management systems. In this talk, we explore several challenges related to video data management. We start by discussing data models. How should we expose video data to make it queryable by applications? We look in particular at the case of 360-degree videos. Second, we explore components of video data storage. How can we store videos in a way that makes them efficiently queryable? Finally, we discuss the problem of benchmarking video data management systems. Bio : Magdalena Balazinska is Professor and Director of the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Magdalena's research interests are in the field of database management systems. Her current research focuses on data management for data science, big data systems, cloud computing, and image and video analytics. Prior to her leadership of the Allen School, Magdalena was the Director of the eScience Institute, the Associate Vice Provost for Data Science, and the Director of the Advanced Data Science PhD Option. She also served as Co-Editor-in-Chief for Volume 13 of the Proceedings of the Very Large Data Bases Endowment (PVLDB) journal and as PC co-chair for the corresponding, prestigious VLDB'20 conference. Magdalena is an ACM Fellow. She holds a Ph.D. from the Massachusetts Institute of Technology (2006). Magdalena received the inaugural VLDB Women in Database Research Award (2016) for her work on scalable distributed data systems. She also received an ACM SIGMOD Test-of-Time Award (2017) for her work on fault-tolerant distributed stream processing and a 10-year most influential paper award (2010) from her earlier work on reengineering software clones. |
Title : Hallucinating Analytics over Real-World Data using Augmented Reality Abstract: In addition to the virtual universe, there is a vast amount of data present in the real world. Given recent advances in computer vision, augmented reality, and cloud services, we are faced with a tremendous opportunity to augment the structured data around end-users with insights. Coinciding with these trends, the number of data-rich end-user activities is also rapidly increasing. Thus, it is useful to investigate the process of data exploration and analysis in augmented and mixed reality settings. In this talk, we describe a data exploration platform that utilizes augmented reality to enable querying over real-world data. ![]() |
Title : Towards Democratizing Data Science Abstract: Technology has been the key enabler of the current Big Data movement. Without open-source tools like TensorFlow and Spark, as well as the advent of cheap, abundant computing and storage in the cloud, the trend toward datafication of almost every field in research and industry could never have happened. However, the current Big Data tool set is ill-suited for an efficient knowledge discovery by domain experts with only limited IT skills and thus represents a major bottleneck in our data-driven society. In this talk, I will present an overview of my current research efforts to revisit the Big Data stack from the user interface to the underlying hardware for making Big Data tools more efficient and easier to use for domain experts and thus enable the democratization of data science. ![]() |
Spring 2020
Name | Title | Date |
---|---|---|
Ryan Marcus, MIT | Can Machine Learning Improve Query Optimization? | Wednesday May 20, 17:30 – 19:00 |
Stratos Idreos, Harvard University | The periodic table of data structures | Wednesday May 27: 17:30 – 19:00 |
Kurt Stockinger, ZHAW | Title: Building Natural Language Interfaces for Databases – From Research to Innovation | Wednesday May 27: 19:00 – 20:00 |