
NSF Org: |
DMS Division Of Mathematical Sciences |
Recipient: |
|
Initial Amendment Date: | December 28, 2022 |
Latest Amendment Date: | August 23, 2024 |
Award Number: | 2239448 |
Award Instrument: | Continuing Grant |
Program Manager: |
Yong Zeng
yzeng@nsf.gov (703)292-7299 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences |
Start Date: | June 1, 2023 |
End Date: | May 31, 2028 (Estimated) |
Total Intended Award Amount: | $450,001.00 |
Total Awarded Amount to Date: | $175,410.00 |
Funds Obligated to Date: |
FY 2024 = $90,938.00 |
History of Investigator: |
|
Recipient Sponsored Research Office: |
1 NASSAU HALL PRINCETON NJ US 08544-2001 (609)258-3090 |
Sponsor Congressional District: |
|
Primary Place of Performance: |
98 Charlton Street PRINCETON NJ US 08544-2001 |
Primary Place of
Performance Congressional District: |
|
Unique Entity Identifier (UEI): |
|
Parent UEI: |
|
NSF Program(s): | STATISTICS |
Primary Program Source: |
01002425DB NSF RESEARCH & RELATED ACTIVIT 01002526DB NSF RESEARCH & RELATED ACTIVIT 01002627DB NSF RESEARCH & RELATED ACTIVIT 01002728DB NSF RESEARCH & RELATED ACTIVIT |
Program Reference Code(s): |
|
Program Element Code(s): |
|
Award Agency Code: | 4900 |
Fund Agency Code: | 4900 |
Assistance Listing Number(s): | 47.049 |
ABSTRACT
As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience.
This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI?s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
Note:
When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external
site maintained by the publisher. Some full text articles may not yet be available without a
charge during the embargo (administrative interval).
Some links on this page may take you to non-federal websites. Their policies may differ from
this site.
Please report errors in award information by writing to: awardsearch@nsf.gov.