Research – Pragya Sur

Contributions

My research develops the theoretical and methodological underpinnings of high-dimensional problems in statistics and machine learning. During my Ph.D., I uncovered that classical likelihood-based inference techniques yield highly inaccurate uncertainty measures in moderate to high dimensions. This renders p-values/confidence intervals from standard statistical packages unreliable (see illustrations here). To remedy this, I introduced a modern maximum likelihood theory (with focus on generalized linear models) that provides valid inference in high dimensions, resolving the issues with classical procedures. Since then, my focus has centered on modern machine learning (ML), learning under distribution shifts, dependent data, and causal inference.

In recent years, my work has introduced an eigenvalue-based framework for high-dimensional inference under structured dependence, and potentially heavy-tailed covariates. Additionally, I have developed novel central limit theorems that quantify uncertainty in two-stage estimation, with applications in causal inference. On the ML front, I have established precise high-dimensional theories that capture the prediction behavior of popular ML algorithms and classifiers. I have developed these theories under both the traditional ML paradigm, where training and test data share the same distribution, and the more modern paradigm where training and test distributions differ. The technical contributions in my work rely on insights from high-dimensional probability, optimization theory, and statistical physics.

Preprints and Publications

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling (2025+) In Review at Conference on Learning Theory [arXiv] (with Yufan Li).

Generalization Error of Min-Norm Interpolators in Transfer Learning (2024+) In Review at the Annals of Statistics [code][arXiv] (with Yanke Song and Sohom Bhattacharya).

HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators (2024+) In Review at The Annals of Applied Statistics [code][arXiv] (with Yanke Song and Xihong Lin).

ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data (2025) AISTATS (to appear) [code][arXiv] (with Kevin Luo and Yufan Li).

Predictive Inference in Multi-Environment Scenarios (2024) Statistical Science (to appear) [code][arXiv] (with John Duchi, Suyash Gupta and Kuanhao Jiang).

A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-fit Covariance and Beyond (2024) The Annals of Statistics (to appear) [talk][code][arXiv] (with Kuanhao Jiang, Rajarshi Mukherjee and Subhabrata Sen). Kuanhao Jiang won the 2022 New England Statistical Society’s Student Research Award for this work.

Universality in block dependent linear models with applications to nonparametric regression (2024) IEEE Transactions on Information Theory, Volume 70, Issue 12, December 2024, Pages 8975-9000 [arXiv] [journal](with Samriddha Lahiry).

Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression (2023+) Reject and Resubmit at the Annals of Statistics [talk][code][arXiv] (with Yufan Li).

High-dimensional Asymptotics of Langevin Dynamics in Spiked Matrix Models (2023) Information and Inference: A Journal of the IMA, Volume 12, Issue 4, December 2023 Pages 2720–2752 [arXiv] [journal] (with Tengyuan Liang and Subhabrata Sen).

Multi-study boosting: Theoretical Considerations for Merging vs. Ensembling (2022) [arXiv] (with Cathy Shyr, Giovanni Parmigiani and Prasad Patil).

A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-L1-Norm Interpolated Classifiers. The Annals of Statistics 50.3 (2022): 1669-1695. [talk] [arXiv] [journal] (with Tengyuan Liang).

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models. Neural Information Processing Systems (NeurIPS) 2022. [NeurIPS version] (with Lijia Zhou, Frederic Koehler, Danica J. Sutherland and Nathan Srebro).

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance. Bernoulli, 28.3 (2022): 1835-1861. [arXivarXiv about The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance] [journal] [codecode about The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance] (with Qian Zhao and Emmanuel Candès).

Representation via Representations: Domain generalization via Adversarially Learned Invariant Representations (2020) [arXiv] (with Zhun Deng, Frances Ding, Cynthia Dwork, Rachel Hong, Giovanni Parmigiani and Prasad Patil).

Abstracting Fairness: Oracles, Metrics, and Interpretability. Foundations of Responsible Computing, volume LIPIcs, Volume 156, FORC 2020. [conferenceconference about Abstracting Fairness: Oracles, Metrics, and Interpretability] (with Cynthia Dwork, Christina Ilvento and Guy Rothblum).

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48, no. 1 (2020): 27-42. [arXiv about The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression] [journal about The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regressi] (with Emmanuel Candès).

A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences 116.29 (2019): 14516-14525. [Supplement for A modern maximum-likelihood theory for high-dimensional logistic regressi] [talk about A modern maximum-likelihood theory for high-dimensional logistic regression][code][arXiv about A modern maximum-likelihood theory for high-dimensional logistic regression] [journal about A modern maximum-likelihood theory for high-dimensional logistic regression] (with Emmanuel Candès).

The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probability Theory and Related Fields 175.1 (2019): 487-558. [Supplement for The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square] [talk about The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square] [arXiv about The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square] [journal about The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-squar] (with Yuxin Chen and Emmanuel Candès).

Modeling bimodal discrete data using Conway- Maxwell-Poisson mixture models. Journal of Business & Economic Statistics 33.3 (2015): 352-365. [arXiv][journal about Modeling bimodal discrete data using Conway- Maxwell-Poisson mixture models] (with Galit Shmueli, Smarajit Bose and Paromita Dubey).

Fitting COM-Poisson mixtures to bimodal count data. Proceedings of the 2013 International Conference on Information, Operations Management and Statistics (ICIOMS 2013). Winner of Best Paper Award (with Smarajit Bose, Galit Shmueli and Paromita Dubey).

Ph.D. Thesis

A modern maximum likelihood theory for high-dimensional logistic regression (2019). Recipient of the Theodore W. Anderson Theory of Statistics Dissertation Award.