Lightning talks

Morning sessions: Thursday 11:45–12:30 pm

Guannan Qu, Assistant Professor, CMU

Locally Interdependent Multi-Agent MDP

Many multi-agent systems in practice are decentralized and have dynamically varying dependencies. There has been a lack of attempts in the literature to analyze these systems theoretically. In this paper, we propose and theoretically analyze a decentralized model with dynamically varying dependencies called the Locally Interdependent Multi-Agent MDP. This model can represent problems in many disparate domains such as cooperative navigation, obstacle avoidance, and formation control. Despite the intractability that general partially observable multi-agent systems suffer from, we propose three closed-form policies that are theoretically near-optimal in this setting and can be scalable to compute and store. Consequentially, we reveal a fundamental property of Locally Interdependent Multi-Agent MDP’s that the partially observable decentralized solution is exponentially close to the fully observable solution with respect to the visibility radius. We then discuss extensions of our closed-form policies to further improve tractability. We conclude by providing simulations to investigate some long horizon behaviors of our closed-form policies.

Bo Dai, Assistant Professor, Georgia Tech

Representation-based Reinforcement Learning

Reinforcement learning often faces a trade-off between model flexibility and computational tractability. Flexible models can capture complex dynamics and policy but often introduce nonlinearity, but making planning and exploration challenging. In this talk, we explore how representation learning can help overcome this dilemma. We present algorithms that extract flexible representations, which enabling practical and provable planning and exploration. We provide theoretical guarantees our algorithm for RL in MDP and POMDP settings, and empirical results demonstrating the superiority of our approach on various benchmarks.

Jia Liu, Assistant Professor, Harvard University

An AI-Cyborg System for Adaptive Intelligent Modulation of Biological Systems

Recent advancements in flexible bioelectronics have enabled continuous, long-term stable interrogation and intervention of biological systems. However, effectively utilizing the interrogated data to modulate biological systems to achieve specific biomedical and biological goals remains a challenge. We introduce an AI-driven bioelectronics system that integrates tissue-like, flexible bioelectronics with cyber learning algorithms to create a long-term, real-time bidirectional bioelectronic interface with optimized adaptive intelligent modulation (BIO-AIM). When integrated with biological systems as an AI-cyborg system, BIO-AIM continuously adapts and optimizes stimulation parameters based on stable cell state mapping, allowing for real-time, closed-loop feedback through tissue-embedded flexible electrode arrays. Applied to human pluripotent stem cell-derived cardiac organoids, BIO-AIM identifies optimized stimulation conditions that accelerate functional maturation. The effectiveness of this approach is validated through enhanced extracellular spike waveforms, increased conduction velocity, and improved sarcomere organization.

Chuchu Fan, Associate Professor, MIT

Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF, which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance and high safety rates with a constant set of hyperparameters across all environments.

Yilun Du, Assistant Professor, Harvard University

Learning Compositional Models of the World

I’ll talk about how learning and planning with generative world models enable us to solve a variety of decision-making tasks even when we do not have explicit data for each task.

Stephanie Gil, Assistant Professor, Harvard University

Reinforcement Learning-Based Framework for Autonomous Whale Rendezvous via In-Situ Sensing

Rendezvous with sperm whales for biological observations is made challenging by their prolonged dive patterns. Our algorithmic framework co-develops multi-agent reinforcement learning– based routing (autonomy module) and Synthetic Aperture Radar-based Very High Frequency (VHF) signal-based bearing estimation (sensing module) for maximizing rendezvous opportunities of autonomous robots with sperm whales. The sensing module is compatible with low-energy VHF tags, commonly used for tracking wildlife. The autonomy module leverages in-situ noisy bearing measurements of whale vocalizations, VHF tags, and whale dive behaviors to enable time–critical rendezvous of a robot team with multiple whales in simulation. We conduct experiments at sea in the native habitat of sperm whales demonstrating rendezvous with an “engineered whale” and with actual whales. Using bearing measurements to the engineered whale from an acoustic sensor and our sensing module, our autonomy module gives an aggregate successful rendezvous of 81.31% for a 500 meter rendezvous distance using three robots in post-processing. Our most recent work demonstrated rendezvous in real time with whales using an autonomous drone that reached within 200 meters of the whales’ location.

Yingying Li, Assistant Professor, UIUC

New Advances on Set Membership Parameter Estimation: Convergence Rates, Unknown Disturbance Bounds, and Nonlinear Systems.

Uncertainty characterization and reduction in system identification/model estimation is essential for safety-critical applications that call for robust decision-making, such as robust optimization, robust reinforcement learning, and robust control, etc. Two major methodologies for uncertainty reduction are confidence regions from concentration inequalities and set membership estimation (SME). SME directly estimates the uncertainty sets from data and usually enjoys superior empirical performance under bounded disturbances. This talk provides theoretical insights into SME’s convergence rates for linear dynamical systems and introduces UCB-SME, a novel algorithm that overcomes a major limitation of classical SME: its lack of convergence when the tight disturbance bound is unknown. Our UCB-SME algorithm adaptively learns the disturbance bound and achieves provable convergence to the true parameters. Further, we extend these results from linear dynamical systems to linearly parameterized nonlinear systems with real-analytic feature functions. Our analysis provides convergence rates for not only SME but also least square estimation (LSE), a famous point estimator, which fills a critical gap in the literature on nonlinear system identification

Eugene Vinitsky, Assistant Professor, NYU

Goal reaching at scale for human behavioral modeling

Building models of human behavior generally operates in a data-poor regime that makes it challenging to apply data-hungry ML approaches. We investigate how to combine human behavioral priors from supervised learning models with the ability of RL to strictly enforce objectives to create agents that are both human-like and capable. Using a dataset of human driving behavior and a GPU-accelerated simulator, we show that we can construct agents that both score highly on measures of human similarity while also having the excellent collision avoidance capabilities of human drivers.

Chi Jin, Assistant Professor, Princeton University

Beyond Equilibrium Learning

While classical game theory primarily focuses on finding equilibria, modern machine learning applications introduce a series of new challenges where standard equilibrium notions are no longer sufficient, and the development of new efficient algorithmic solutions is urgently needed. In this talk, we will demonstrate two such scenarios: (1) a natural goal in multiagent learning is to learn rationalizable behavior, which avoids iteratively dominated actions. Unfortunately, such rationalizability is not guaranteed by standard equilibria, especially when approximation errors are present. Our work presents the first line of efficient algorithms for learning rationalizable equilibria with sample complexities that are polynomial in all problem parameters, including the number of players; (2) In multiplayer symmetric constant-sum games like Mahjong or Poker, a natural baseline is to achieve an equal share of the total reward. We demonstrate that the self-play meta-algorithms used by existing state-of-the-art systems can fail to achieve this simple baseline in general symmetric games. We will then discuss the new principled solution concept required to achieve this goal. [I will present lightning talk only (without posters)]

Qiaomin Xie, Assistant Professor, University of Wisconsin-Madison

Stochastic Approximation: Constant Stepsizes Go a Long Way

Stochastic Approximation (SA) provides a foundational framework for numerous reinforcement learning and machine learning problems. While previous studies have primarily focused on mean-squared error bounds under diminishing stepsize schemes, constant stepsize algorithms have gained practical traction due to their robustness and simplicity. Viewing the iterates of these algorithms as a Markov chain, we study their fine-grained probabilistic behavior. In particular, we establish finite-time geometric convergence of the iterates distribution, and relate the ergodicity properties of the Markov chain to the characteristics of SA algorithm and data.

Using coupling techniques and adjoint relationships, we characterize the limit distribution and quantify its bias as a function of the stepsize. This probabilistic understanding enables variance reduction through tail-averaging and bias reduction via Richardson-Romberg extrapolation. The combination of constant stepsizes with averaging and extrapolation achieves a favorable trade-off between rapid convergence and low long-term error. Empirical results in statistical inference illustrate the effectiveness of this approach compared to traditional diminishing stepsize schemes. Additionally, we extend our analysis to two-timescale linear SA, broadening the applicability of our findings.

Chris Amato, Associate Professor, Northeastern University

What is the ‘Right’ Way to Perform (Cooperative) Multi-Agent Reinforcement Learning? Does it Matter?

Multi-agent reinforcement learning (MARL) has become a popular topic as reinforcement learning methods have improved and more artificially intelligent ‘agents’ (e.g., robots, autonomous vehicles or software agents) are used in different domains. Unfortunately, many MARL methods are extended from single-agent, fully observable approaches without fully understanding the consequences in multi-agent, partially observable settings. I’ll discuss issues with the theory and practice of many of the most common cooperative MARL approaches. These issues involve centralized vs. decentralized critics and the use of state in actor-critic methods, the use of state in value factorization methods, and using single observations vs observation histories in both types of approaches. I’ll discuss the theoretically correct way to solve these issues and when it matters in practice.

Yang Zheng, Assistant Professor, UC San Diego

Benign Nonconvex Landscapes in Optimal and Robust Control

Direct policy search has achieved great empirical success in reinforcement learning. Many recent studies have revisited its theoretical foundation for continuous control, which reveals elegant nonconvex geometry in various benchmark problems.

In this talk/poster, we introduce a new and unified Extended Convex Lifting (ECL) framework to reveal hidden convexity in classical optimal and robust control problems from a modern optimization perspective. Our ECL offers a bridge between nonconvex policy optimization and convex reformulations, enabling convex analysis for nonconvex problems. Despite non-convexity and non-smoothness, the existence of an ECL not only reveals that minimizing the original function is equivalent to a convex problem but also certifies a class of first-order non-degenerate stationary points to be globally optimal. Therefore, no spurious stationarity exists in the set of non-degenerate policies.

This ECL framework can cover many benchmark control problems, including state feedback linear quadratic regulator (LQR), dynamic output feedback linear quadratic Gaussian (LQG) control, and Hinf robust control. ECL can also handle a class of distributed control problems when the notion of quadratic invariance (QI) holds. We believe that the new ECL framework may be of independent interest for analyzing nonconvex problems beyond control.

Zhuoran Yang, Assistant Professor, Yale University

Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks

We study reinforcement learning (RL) for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure. In specific, at the outset of the game, the leader announces her policy to the follower and commits to it. The follower observes the leader’s policy and, in turn, adopts a quantal response policy by solving an entropy-regularized policy optimization problem induced by leader’s policy. The goal of the leader is to find her optimal policy, which yields the optimal expected total return, by interacting with the follower and learning from data. A key challenge of this problem is that the leader cannot observe the follower’s reward, and needs to infer the follower’s quantal response model from his actions against leader’s policies. We propose sample-efficient algorithms for both the online and offline settings, in the context of function approximation. Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader’s decision making problem, and we show that they achieve sublinear regret upper bounds. Moreover, we quantify the uncertainty of these estimators and leverage the uncertainty to implement optimistic and pessimistic algorithms for online and offline settings. Besides, when specialized to the linear and myopic setting, our algorithms are also computationally efficient. Our theoretical analysis features a novel performance-difference lemma which incorporates the error of quantal response model, which might be of independent interest.This is a joint work with Siyu Chen and Mengdi Wang.

Afternoon Session: Thursday 4:45–5:30 pm

Guanya Shi, Assistant Professor, CMU

Adaptive, Structured, and Reliable RL and Control for Real-World Agile Robotics: Humanoids and Beyond

Recent breathtaking advances in AI and robotics have brought us closer to building general-purpose robots in the real world, e.g., humanoids capable of performing a wide range of human tasks in complex environments. Two key challenges in realizing such general-purpose robots are: (1) achieving “breadth” in task/environment diversity, i.e., the generalist aspect, and (2) achieving “depth” in task execution, i.e., the agility aspect.
In this talk, I will present recent works that aim to achieve both generalist-level adaptability and specialist-level agility, demonstrated across various real-world robots, including full-size humanoids, quadrupeds, aerial robots, and ground vehicles. The first part of the talk focuses on learning agile and general-purpose humanoid whole-body control using sim2real reinforcement learning. The second part will discuss the limitations of such end2end sim2real pipelines and how combining learning with control can enhance safety, efficiency, and adaptability.
More details on the presented works are available on the CMU LeCAR Lab website: https://lecar-lab.github.io/

Benjamin Eysenbach, Assistant Professor of Computer Science, Princeton University

Exploration with a Single Goal

In this talk, present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics.

Heng Yang, Assistant Professor, Harvard University

I will overview two strands of research related to (model) predictive control in the Harvard Computational Robotics Group.

First, in case when the dynamics model is given as a polynomial (e.g., in state-based robot control), I will challenge the common belief that “nonlinear optimal control is hard” and show that “sparsity-rich” convex relaxations can solve many nonlinear, nonconvex, and even nonsmooth optimal control problems to certifiable global optimality.

Second, in case when the dynamics is unknown or hard to model (e.g., in vision-based robot control), I will show that generative AI tools such as diffusion models enable learning a “world model”–in the form of action-conditioned video prediction–from diverse data. Such a world model empowers a generalist robot foundation policy to online plan “in imagination”.

Thomas Walsh, Senior Research Scientist, Sony AI

Deploying Reinforcement Learning Agents in Gran Turismo

This talk briefly describes milestones that Sony AI researchers and engineers reached on their journey to deploying RL agents in the PlayStation game Gran Turismo. The team first achieved success by outracing world champion eSports divers in a head to head competition and publishing their results in Nature and several follow-up publications. But perhaps an equally challenging task was deploying the agents in the game itself, where today players of all calibers around the world can compete against trained RL agents in a commercial video game.

Kaiqing Zhang, Assistant Professor; University of Maryland, College Park

Towards Principled (Multi-)AI Agents for Large-Scale Autonomy

Recent years have witnessed tremendous successes of learning for decision-making in dynamic environments, and in particular, Reinforcement Learning (RL). Prominent application examples include playing Go and video games, robotics, autonomous driving, and recently large language models. Most such success stories naturally involve multiple agents. Hence, there has been surging research interest in advancing Multi-Agent Learning in Dynamic Environments, and particularly, multi-agent RL (MARL), the focus of my research. In this talk, I will review some of my previous works on this topic, revealing some unique challenges of multi-agent learning in dynamic environments. Time permitting, I will also mention our recent works on the applications on robot fleet learning and on multi-agent learning with large-language-model agents.

Ermin Wei, Associate Professor of Electrical and Computer Engineering, Industrial Engineering and Management Sciences, Northwestern University

Improved Lower bounds for First-order Methods under Markov Sampling

Unlike its vanilla counterpart with i.i.d. samples, stochastic optimization with Markovian sampling allows the sampling scheme following a Markov chain. This problem encompasses various applications that range from asynchronous distributed optimization to reinforcement learning. In this work, we lower bound the sample complexity of finding $\epsilon$-approximate stationary points for any first-order methods when sampling is Markovian. We show that for samples drawn from Markov processes with countable state space, any algorithm that accesses smooth, non-convex functions through queries to a stochastic gradient oracle, requires at least $\epsilon^{-4}$ samples. Moreover, for finite Markov chains, we show an $\epsilon^2$ lower bound and propose a new algorithm that is proven to be nearly min-max optimal.

Zhong-Ping Jiang, Professor, New York University

Small-disturbance input-to-state stability in learning and control

We present the new notion of small-disturbance input-to-state stability to quantify the effects of disturbance inputs on the learning and control algorithms. It is shown how this notion of small-disturbance input-to-state stability can serve as a basic tool to address robustness issues arising from reinforcement learning, policy optimization, dynamic games and learning-based control.

Lin Yang, Assistant Professor, UCLA

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

We study deployment efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of exploration policies used during the algorithm. We design a new reinforcement learning (RL) algorithm whose sample complexity is polynomial in the feature dimension and horizon length, while achieving nearly optimal deployment efficiency for linear MDPs under the reward-free exploration setting. More specifically, our algorithm explores a linear MDP in a reward-free manner, while using at most H exploration policies during its execution where H is the horizon length. Compared to previous algorithms with similar deployment efficiency guarantees, the sample complexity of our algorithm does not depend on the reachability coefficient or the explorability coefficient of the underlying MDP, which can be arbitrarily small for certain MDPs. Our result addresses an open problem proposed in prior work. To achieve such a result, we show how to truncate state-action pairs of the underlying linear MDP in a data-dependent manner, and devise efficient offline policy evaluation and offline policy optimization algorithms in the truncated linear MDP. We further show how to implement reward-free exploration mechanisms in the linear function approximation setting by carefully combines these offline RL algorithms without sacrificing the deployment efficiency.

Manxi Wu, Assistant Professor, UC Berkeley

Centralized and Decentralized Learning with Strategic Agents

This talk examines two distinct learning settings in strategic multi-agent environments: centralized and decentralized learning, each with its own unique challenges. In the centralized setting, we analyze how an information platform facilitates learning by updating beliefs about an unknown payoff-relevant parameter and influencing agents’ strategy adjustments. We present convergence results of coupled belief-strategy dynamics and characterize the stability properties of fixed points, providing conditions under which learning leads to complete information equilibria. In the decentralized setting, we focus on learning in general-sum Markov games, where agents operate independently using actor-critic dynamics without coordination. We establish guarantees on the asymptotic behavior of decentralized learning algorithms. By developing tailored tools for these settings, we advance our understanding of strategic learning dynamics in diverse environments.

Max Simchowitz, Assistant Professor, CMU

The Pitfalls of Imitation Learning when Actions are Continuous

We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action space control system. We show that, even if the dynamics are stable (i.e. contracting exponentially quickly), and the expert is smooth and deterministic, any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly “improper” imitator policies — those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity — or unless the expert trajectory distribution is sufficiently “spread”.

We complement these results with experimental evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. action-chunking and Diffusion-policies). We further contextualize our findings by establishing a host of complementary negative and positive results for imitation in control systems.

Mingyi Hong, Associate Professor, University of Minnesota

Efficient Algorithms for Inverse RL and Application to LLM Finetuning.

We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model for LLM alignment, as opposed to the standard behavior cloning/supervised fine-tuning (SFT) approach of directly training LLM as policy. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune. Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.

Jennie Si, Professor, Arizona State University

Exploring how reinforcement learning can assist in control synthesis

Some learning paradigms have proven both critical and robust, driving significant advancements in AI capabilities. While reinforcement learning (RL) has demonstrated exceptional performance in simulation-intensive applications, its potential to enrich control theory and address real-world control problems has yet to be fully demonstrated and realized. This gap highlights the urgent need for solutions that are pervasive in everyday scenarios and across nearly all industrial sectors. Drawing inspiration from controlling wearable robots, I will discuss how RL could lead to new and powerful control methods.

Vijay Reddi, Associate Professor, Harvard University

A2Perf: Benchmarking Real-World Autonomous Agents

How do we measure the real-world effectiveness of autonomous agents? Autonomous agents span diverse applications, from robotics to digital assistants, but face common challenges. Beyond solving tasks, agents must adapt to new environments, uphold reliability, and optimize hardware usage. Current methods, like reinforcement learning and imitation learning, address these needs but lack a unified benchmarking suite. This lightning talk introduces A2Perf, a benchmarking suite with realistic environments—spanning computer chip floorplanning to autonomous web navigation—that evaluates several critical metrics beyond the usual: performance, generalization, efficiency, and reliability, providing a foundation for real-world comparisons.

Animesh Garg, Assistant Professor, Gatech

Decision making with world models

Well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization.
We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task.