NSF Award Search: Award # 2239448

Award Abstract # 2239448

CAREER: Statistical Learning with Recursive Partitioning: Algorithms, Accuracy, and Applications

NSF Org:	DMS Division Of Mathematical Sciences
Recipient:	THE TRUSTEES OF PRINCETON UNIVERSITY
Initial Amendment Date:	December 28, 2022
Latest Amendment Date:	August 23, 2024
Award Number:	2239448
Award Instrument:	Continuing Grant
Program Manager:	Yong Zeng yzeng@nsf.gov (703)292-7299 DMS Division Of Mathematical Sciences MPS Directorate for Mathematical and Physical Sciences
Start Date:	June 1, 2023
End Date:	May 31, 2028 (Estimated)
Total Intended Award Amount:	$450,001.00
Total Awarded Amount to Date:	$175,410.00
Funds Obligated to Date:	FY 2023 = $84,472.00 FY 2024 = $90,938.00
History of Investigator:	Jason Klusowski (Principal Investigator)
Recipient Sponsored Research Office:	Princeton University 1 NASSAU HALL PRINCETON NJ US 08544-2001 (609)258-3090
Sponsor Congressional District:	12
Primary Place of Performance:	Princeton University 98 Charlton Street PRINCETON NJ US 08544-2001
Primary Place of Performance Congressional District:	12
Unique Entity Identifier (UEI):	NJ1YPQXQG7U5
Parent UEI:
NSF Program(s):	STATISTICS
Primary Program Source:	01002324DB NSF RESEARCH & RELATED ACTIVIT 01002425DB NSF RESEARCH & RELATED ACTIVIT 01002526DB NSF RESEARCH & RELATED ACTIVIT 01002627DB NSF RESEARCH & RELATED ACTIVIT 01002728DB NSF RESEARCH & RELATED ACTIVIT
Program Reference Code(s):	1045
Program Element Code(s):	126900
Award Agency Code:	4900
Fund Agency Code:	4900
Assistance Listing Number(s):	47.049

ABSTRACT

As data-driven technologies continue to be adopted and deployed in high-stakes decision-making environments, the need for fast, interpretable algorithms has never been more important. As one such candidate, it has become increasingly common to use decision trees, a hierarchically organized data structure, for building a predictive or causal model. This trend is spurred by the appealing connection between decision trees and rule-based decision-making, particularly in clinical, legal, or business contexts, as the tree structure mimics the sequential way a human user may think and reason, thereby facilitating human-machine interaction. To make them fast to compute, decision trees are popularly constructed with an algorithm called recursive partitioning, in which the decision nodes of the tree are learned from the data in a greedy, top-down manner. The overarching goal of this project is to develop a precise understanding of the strengths and limitations of decision trees based on recursive partitioning, and, in doing so, gain insights on how to improve their performance in practice. In addition to this impact, high-school, undergraduate, and graduate research assistants will be vertically integrated and benefit both academically and professionally. Innovative curricula, workshops, and data and methods competitions involving students, academics, and industry professionals will facilitate outreach and encourage participation from a broad audience.

This proposal aims to provide a comprehensive study of the statistical properties of greedy recursive partitioning algorithms for training decision trees, as is demonstrated in two fundamental contexts. The first thrust of the project will develop a theoretical framework for the analysis of oblique decision trees, where, in contrast to conventional axis-aligned splits involving only a single covariate, the splits at each decision node occur at linear combinations of the covariates. While this methodology has garnered significant attention from the computer science and optimization communities since the mid-80s, the advantages they offer over their axis-aligned counterparts remain only empirically justified, and explanations for their success are largely based on heuristics. Filling this long-standing gap between theory and practice, the PI will investigate how oblique regression trees, constructed by recursively minimizing squared error, can adapt to a rich class of regression models consisting of linear combinations of ridge functions. This provides a quantitative baseline for a statistician to compare and contrast decision trees with other less interpretable methods, such as projection pursuit regression and neural networks, that target similar model forms. Crucially, to address the combinatorial complexity of finding the optimal splitting hyperplane at each decision node, the PI?s framework can accommodate many existing computational tools in the literature. A major component of the research is derived from connections between recursive partitioning and sequential greedy approximation algorithms for convex optimization problems (e.g., orthogonal greedy algorithms). The second thrust focuses on the delicate pointwise properties of axis-aligned recursive partitioning, with implications for heterogeneous causal effect estimation, where accurate pointwise estimates over the entire support of the covariates are essential for valid inference (e.g., testing hypotheses and constructing confidence intervals). Motivated by simple setting where decision trees provably fail to achieve optimal performance, the PI will investigate how the signal-to-noise ratio affects the quality of pointwise estimation. While the focus is on causal effect estimation directly using decision trees, the PI will also investigate implications for multi-step semi-parametric settings, where preliminary unknown functions (e.g., propensity scores) are estimated with machine learning tools, as well as conditional quantile regression, both of which require estimators with high pointwise accuracy.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cattaneo, Matias D and Chandak, Rajita and Klusowski, Jason M "Convergence rates of oblique regression trees for flexible function libraries" The Annals of Statistics , v.52 , 2024 https://doi.org/10.1214/24-AOS2354 Citation Details

Cattaneo, Matias D and Klusowski, Jason M and Shigida, Boris "On the implicit bias of Adam" , 2024 Citation Details

Please report errors in award information by writing to: awardsearch@nsf.gov.

Success

Error