Pragya Sur

I am an Assistant Professor of Statistics at Harvard University, currently on leave as a Visiting Professor at MIT’s Laboratory for Information and Decision Systems. I work on high-dimensional and overparametrized problems arising in statistics, machine learning, and data science. My research has been supported by the NSF CAREER Award, an NSF DMS Award, the Eric and Wendy Schmidt Fund for Strategic Innovation, a William F. Milton Fund Award, and a Dean’s Competitive Fund for Promising Scholarship. I am a recipient of the 2026 IMS Thelma and Marvin Zelen Emerging Women Leaders in Data Science Award. Among other honors, in ’21, I was invited to speak at the National Academies’ of Sciences, Engineering, and Medicine symposium on Mathematical Challenges for Machine Learning and Artificial Intelligence (AI). In ’23, I was named an International Stategy Forum (ISF) Fellow—an 11-month, non-residential fellowship program for rising leaders ages 25 – 35. From ’22-’24, I led the Institute of Mathematical Statistics (IMS) New Researchers Group. Currently, I serve as an AE for Statistical Science and as an invited Guest Co-Editor for their special issue on statistics and AI. I am also an incoming AE for Journal of the Royal Statistical Society Series B. Here are links to my CV, Google Scholar, and Math Genealogy.

Previously, I was an Invited Long-Term Participant at the Simons Institute for the Theory of Computing, UC Berkeley for their Computational Complexity of Statistical Inference Program during Fall ’21. I was a postdoc at the Center for Research on Computation and Society, Harvard John A. Paulson School of Engineering and Applied Sciences (hosted by Prof. Cynthia Dwork) during ’19-’20. I obtained my Ph.D. in Statistics (’19) from Stanford University, where I was honored to receive the Theodore W. Anderson Theory of Statistics Dissertation Award (’19) and the Ric Weiland Graduate Fellowship (’17). My advisor was Prof. Emmanuel Candès. I obtained my B.Stat (’12) and M.Stat (’14) from the Indian Statistical Institute, Kolkata.

Openings

I am currently looking for motivated students with a strong theoretical background, interested in high-dimensional statistics, machine learning, or AI theory. Interested aspiring graduate students should apply here.

Research Highlights and Representative Publications

My research spans two primary areas: high-dimensional statistics and machine learning theory. Below, I highlight my top three contributions in each direction, along with a representative publication.

High-Dimensional Statistics

1. Establishing a Modern Maximum-Likelihood Framework for High-Dimensional Logistic Regression

A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences 116.29 (2019): 14516-14525 [talk][slides][code][arXiv] [journal] (with Emmanuel Candès).

This paper provides a complete overhaul of classical statistics, demonstrating that classical likelihood-based inference—which underlies every software package—yields grossly inaccurate inference in moderate to high dimensions. Specifically, classical estimators such as the MLE become heavily biased and exhibit far greater variability than predicted by textbook statistical theories. We develop a modern maximum likelihood theory that resolves these issues and provides reliable uncertainty quantification in high dimensions. I received the Theodore W. Anderson Theory of Statistics Dissertation Award for this research.

2. Developing Uncertainty Quantification Techniques for Two-stage Estimators Agnostic to Sparsity

A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-fit Covariance and Beyond (2025) The Annals of Statistics, 53(2), pp.647-675 [talk][code][arXiv] (with Kuanhao Jiang, Rajarshi Mukherjee and Subhabrata Sen).

This paper introduces new central limit theorems for the augmented IPW estimator—commonly used in causal inference—under high-dimensional settings. Notably, we do not assume sparsity in any nuisance model. As a key technical contribution, we develop novel leave-one-out proof techniques that track errors in nuisance estimation and quantify their impact on uncertainty in treatment effect estimation. These leave-one-out methods offer broad utility for analyzing prevalent two-stage estimators in causal inference. This research earned my student Kuanhao Jiang the New England Statistical Society’s Student Research Award.

3. Introducing Spectrum-Aware Inference for Dependent, Heavy-Tailed Data

Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression (2025) The Annals of Statistics (to appear) [talk][code][arXiv] (with Yufan Li).

This paper introduces a new framework for high-dimensional inference that simultaneously accommodates structured forms of dependent data and heavy-tails in covariates. Furthermore, we introduce a novel strategy for debiasing principal components regression estimators in high dimensions that comes with rigorous guarantees. To establish this framework, we uncover new properties of R-transforms for a class of random matrices that allow us to exploit the spectrum of sample covariance matrices for such assumption-lean high-dimensional inference. This research earned my student Yufan Li the 2025 Dempster Award.

Machine Learning Theory

1. Quantifying the Prediction Performance of Overparametrized Machine Learning Algorithms

A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-L1-Norm Interpolated Classifiers. The Annals of Statistics 50.3 (2022): 1669-1695. [talk] [arXiv] [journal] (with Tengyuan Liang).

This paper provides the first mathematical characterization of the prediction error of minimum-L1-norm interpolators under overparametrization. These interpolators arise as implicit regularized limits of diverse machine learning algorithms; understanding their behavior allows us to glean new insights into algorithm performance in overparametrized settings. On the technical front, we introduce novel uniform deviation arguments that prove particularly useful for analyzing high-dimensional problems with non-L2 geometries.

2. Addressing Distribution Shift and Data Integration Challenges for Minimum Norm Interpolation

Generalization Error of Min-Norm Interpolators in Transfer Learning (2024+) Under Revision for the Annals of Statistics [code][arXiv] (with Yanke Song and Sohom Bhattacharya).

This paper analyzes the behavior of minimum-norm interpolators under distribution shifts. Specifically, we develop principled approaches to leverage distribution shifts between training and test data to improve prediction performance at test time. To establish these results, we derive new anisotropic local laws for sums of random matrices coming from different distributions. These tools are widely applicable for studying other heterogeneous data problems in high dimensions.

3. Introducing Bregman-optimal Calibration for Modern Machine Learning Algorithms

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling NeurIPS, Spotlight (2025) [arXiv] (with Yufan Li).

This paper develops angular calibration, a strategy that provably calibrates classifiers in overparametrized settings while preserving desirable accuracy properties, such as low Bregman divergence to the true label distribution. Calibration is critical for reliable machine learning—calibrated algorithms ensure that predicted probabilities match true empirical frequencies. Our work presents the first calibration approach that simultaneously satisfies provable calibration and accuracy guarantees under overparametrization.

Tutorial Lecture

I was recently invited to deliver a tutorial lecture on my work in calibration and data integration to a machine learning audience; see slides here.

Contact

Science Center 712
One Oxford Street
Cambridge, MA 02138
pragya at fas dot harvard dot edu