Award Abstract # 2008107
III: Small: Helping Novices Learn and Debug Relational Queries

NSF Org: IIS
Division of Information & Intelligent Systems
Recipient: DUKE UNIVERSITY
Initial Amendment Date: August 20, 2020
Latest Amendment Date: July 22, 2021
Award Number: 2008107
Award Instrument: Continuing Grant
Program Manager: Hector Munoz-Avila
IIS
 Division of Information & Intelligent Systems
CSE
 Directorate for Computer and Information Science and Engineering
Start Date: October 1, 2020
End Date: September 30, 2024 (Estimated)
Total Intended Award Amount: $499,972.00
Total Awarded Amount to Date: $499,972.00
Funds Obligated to Date: FY 2020 = $333,711.00
FY 2021 = $166,261.00
History of Investigator:
  • Jun Yang (Principal Investigator)
    junyang@cs.duke.edu
  • Sudeepa Roy (Co-Principal Investigator)
  • Kristin Stephens-Martinez (Co-Principal Investigator)
Recipient Sponsored Research Office: Duke University
2200 W MAIN ST
DURHAM
NC  US  27705-4640
(919)684-3030
Sponsor Congressional District: 04
Primary Place of Performance: Duke University
Durham
NC  US  27708-0129
Primary Place of Performance
Congressional District:
04
Unique Entity Identifier (UEI): TP7EK8DZV6N5
Parent UEI:
NSF Program(s): Info Integration & Informatics
Primary Program Source: 01002021DB NSF RESEARCH & RELATED ACTIVIT
01002122DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s): 7364, 7923
Program Element Code(s): 736400
Award Agency Code: 4900
Fund Agency Code: 4900
Assistance Listing Number(s): 47.070

ABSTRACT

In a world where decisions are increasingly driven by data, data analytics skills have become an indispensable part of any education that seeks to prepare its students for the modern workforce. Essential in this skill set is the ability to work with structured data. The standard "tools of trade" for manipulating structured data include the venerable and ubiquitous SQL language as well as popular libraries heavily influenced by relational query languages, e.g., dplyr for R, DataFrame for pandas and Spark. Learning and debugging relational queries, however, pose challenges to novices. Even computer science students with programming backgrounds are often not used to thinking in terms of logic (e.g., when writing SQL queries) or functional programming (e.g., when writing queries using operators that resemble relational algebra). This project proposes to build a system called HNRQ (Helping Novices Learn and Debug Relational Queries) to address these challenges, by explaining why a query is wrong, and helping users to fix and learn relational queries in the process.

The first step in the project is to automatically construct small database instances as counterexamples to illustrate why queries return wrong results, and allow users to trace query execution over these instances. Going beyond convincing users that the queries are wrong, HNRQ further aims to guide users towards the next level of understanding---by helping them generalize from specific counterexamples to semantic descriptions of what cause wrong results, and by providing useful hints on how to approach the problems correctly. This ambitious goal will push the boundaries of existing research and will likely lead to the development of novel methodologies for providing explanations and hints. The project will make HNRQ general and practical by embracing the full complexity of real-world query languages and by delivering interactive performance for users to experiment with changes to queries and database instances, observe their effects, and obtain automated feedback and hints all in real time even for complex queries and large databases. The project plans to evaluate HNRQ not only through user studies but also by measuring its direct impact on learning outcomes. The project is committed to making HNRQ open-source and easy to adopt by educators around the world.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note:  When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Gilad, Amir and Miao, Zhengjie and Roy, Sudeepa and Yang, Jun "Understanding Queries by Conditional Instances" Proceedings of the 2022 International Conference on Management of Data , 2022 https://doi.org/10.1145/3514221.3517898 Citation Details
Hu, Yihao and Gilad, Amir and Stephens-Martinez, Kristin and Roy, Sudeepa and Yang, Jun "Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries" Proceedings of the ACM on Management of Data , v.2 , 2024 https://doi.org/10.1145/3654995 Citation Details
Meng, Hanze and Miao, Zhengjie and Gilad, Amir and Roy, Sudeepa and Yang, Jun "Characterizing and Verifying Queries Via CINSGEN" SIGMOD/PODS '23: International Conference on Management of Data , 2023 https://doi.org/10.1145/3555041.3589721 Citation Details
Miao, Zhengjie and Chen, Tiangang and Bendeck, Alexander and Day, Kevin and Roy, Sudeepa and Yang, Jun "I-Rex: Interactive Relational Query Explainer for SQL" Proceedings of the VLDB Endowment , v.13 , 2020 https://doi.org/10.14778/3415478.3415528 Citation Details
Roy, Sudeepa and Gilad, Amir and Hu, Yihao and Meng, Hanze and Miao, Zhengjie and Stephens-Martinez, Kristin and Yang, Jun "How Database Theory Helps Teach Relational Queries in Database Education (Invited Talk)" , v.290 , 2024 https://doi.org/10.4230/LIPICS.ICDT.2024.2 Citation Details
Shen, Fangzhu and Heravi, Kayvon and Gomez, Oscar and Galhotra, Sainyam and Gilad, Amir and Roy, Sudeepa and Salimi, Babak "Causal What-If and How-To Analysis Using HypeR" 2023 IEEE 39th International Conference on Data Engineering (ICDE) , 2023 https://doi.org/10.1109/ICDE55515.2023.00293 Citation Details
Wang, Tingyu and Tao, Yuchao and Gilad, Amir and Machanavajjhala, Ashwin and Roy, Sudeepa "Explaining Differentially Private Query Results with DPXPlain" Proceedings of the VLDB Endowment , v.16 , 2023 https://doi.org/10.14778/3611540.3611596 Citation Details
Xiu, Haibo and Agarwal, Pankaj K and Yang, Jun "PARQO: Penalty-Aware Robust Plan Selection in Query Optimization" Proceedings of the VLDB Endowment , v.17 , 2024 Citation Details
Yang, Jun and Gilad, Amir and Hu, Yihao and Meng, Hanze and Miao, Zhengjie and Roy, Sudeepa and Stephens-Martinez, Kristin "What Teaching Databases Taught Us about Researching Databases: Extended Talk Abstract" , 2024 https://doi.org/10.1145/3663649.3664375 Citation Details

PROJECT OUTCOMES REPORT

Disclaimer

This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content.

When decisions are increasingly driven by data, data analytics skills have become an indispensable part of any education seeking to prepare students for the modern workforce. Essential in this skill set is the ability to work with structured data using the venerable and ubiquitous SQL language as well as popular libraries heavily influenced by relational query languages, e.g., dplyr for R, DataFrame for pandas and Spark. Learning and debugging relational queries, however, pose challenges to novices: even if they have a programming background, they are often not used to thinking in terms of relational logic or operators.

This project, named HNRQ (Helping Novices Learn and Debug Relational Queries), has built a suite of powerful software tools for database educators and students alike. In an educational setting, we are often given a reference query defined by the teacher and a potentially incorrect query written by a student. First, if the two queries return different results on some test database, the RATest tool automatically constructs a small instance that illustrates the difference between the queries but is much simpler for the student to understand. Second, CInsGen finds “conditional instances,” which are abstract instances that illustrate all possible ways of how to satisfy a complex query or to differentiate two queries. Compared with concrete instances, conditional instances hide unnecessary details and articulate general conditions, making it easier to spot logical differences between queries. Third, Qr-Hint provides actionable hints to fix a working query so that it becomes semantically equivalent to the reference query. These hints purposefully guide the student through a sequence of steps that incrementally transform the working query such that it becomes correct in the end. Together, these three tools offer help that are specifically tailored to students’ individual mistakes, but do so automatically without revealing the reference query or requiring extensive personal tutoring. Finally, in settings where no reference query is known, i-Rex is a novel debugger that helps students understand SQL query evaluation and debug SQL queries. It allows students to trace query evaluation and study the lineage among input, output, and intermediate result rows. It has a “pinning” feature that focuses on relevant parts of executions to examine, as well as pagination and “teleporting” features that allow the system to reproduce relevant parts of execution without starting from the beginning, significantly improving scalability of debugging on massive databases. This project has also started to investigate challenges and opportunities posed by the rise of Generative AI to database education. Specifically, it has produced preliminary results on how to leverage large language models to help students decompose complex queries into simpler steps and describe them, and how to verify the correctness of automatically generated SQL code.

The HNRQ suite of tools have been deployed in undergraduate and graduate database courses at Duke University, benefiting more than 1,800 students during the project period, and will be continued in the future. The project has provided research experiences for learners at many levels, including one postdoctoral fellow, 4 PhD students, 6 MS students, 12 undergraduate students, and one high school student. Two alumni of the project are now Assistant Professors. In addition to the educational impact, the research carried out under the HNRQ project has also deepened the understanding of many fundamental problems in databases, resulting in numerous research papers and system demonstrations at top publication venues, 3 keynote speeches at international workshops and conferences, as well as 5 invited talks at research lab/universities.

 


Last Modified: 01/15/2025
Modified by: Jun Yang

Please report errors in award information by writing to: awardsearch@nsf.gov.

Print this page

Back to Top of page