Benjamin Eysenbach, Assistant Professor of Computer Science, Princeton University
Exploration with a Single Goal
In this talk, present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics.
Vijay, Associate Professor, and Ikechukwu Uchendu, PhD Student, Harvard University
A2Perf: Real-World Autonomous Agents Benchmark
Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges.
It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements.
Several major classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs.
However, there is currently no benchmarking suite that defines the environments, datasets, and metrics which can be used to develop reference implementations and seed leaderboards with baselines, providing a meaningful way for the community to compare progress.
We introduce A2Perf –a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion.
A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications.
In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches.
A2Perf also contains baseline implementations of standard algorithms, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy.
As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.
Bo Dai, Assistant Professor, Georgia Tech
Representation-based Reinforcement Learning
Reinforcement learning often faces a trade-off between model flexibility and computational tractability. Flexible models can capture complex dynamics and policy but often introduce nonlinearity, but making planning and exploration challenging. In this talk, we explore how representation learning can help overcome this dilemma. We present algorithms that extract flexible representations, which enabling practical and provable planning and exploration. We provide theoretical guarantees our algorithm for RL in MDP and POMDP settings, and empirical results demonstrating the superiority of our approach on various benchmarks.
Chris Amato, Associate Professor, Northeastern University
What is the ‘Right’ Way to Perform (Cooperative) Multi-Agent Reinforcement Learning? Does it Matter?
Multi-agent reinforcement learning (MARL) has become a popular topic as reinforcement learning methods have improved and more artificially intelligent ‘agents’ (e.g., robots, autonomous vehicles or software agents) are used in different domains. Unfortunately, many MARL methods are extended from single-agent, fully observable approaches without fully understanding the consequences in multi-agent, partially observable settings. I’ll discuss issues with the theory and practice of many of the most common cooperative MARL approaches. These issues involve centralized vs. decentralized critics and the use of state in actor-critic methods, the use of state in value factorization methods, and using single observations vs observation histories in both types of approaches. I’ll discuss the theoretically correct way to solve these issues and when it matters in practice.
Heng Yang, Assistant Professor, Harvard University
Predictive Control: From Global Optimization to Generative World Modeling
I will overview two strands of research related to (model) predictive control in the Harvard Computational Robotics Group.
First, in case when the dynamics model is given as a polynomial (e.g., in state-based robot control), I will challenge the common belief that “nonlinear optimal control is hard” and show that “sparsity-rich” convex relaxations can solve many nonlinear, nonconvex, and even nonsmooth optimal control problems to certifiable global optimality.
Second, in case when the dynamics is unknown or hard to model (e.g., in vision-based robot control), I will show that generative AI tools such as diffusion models enable learning a “world model”–in the form of action-conditioned video prediction–from diverse data. Such a world model empowers a generalist robot foundation policy to online plan “in imagination”.
Zhong-Ping Jiang, Professor, New York University and Leilei Cui, Postdoc, MIT
Small-disturbance input-to-state stability in learning and control
We present the new notion of small-disturbance input-to-state stability to quantify the effects of disturbance inputs on the learning and control algorithms. It is shown how this notion of small-disturbance input-to-state stability can serve as a basic tool to address robustness issues arising from reinforcement learning, policy optimization, dynamic games and learning-based control.
Eugene Vinitsky, Assistant Professor, NYU
Blending imitation learning and RL at scale to model human behavior
Building models of human behavior generally operates in a data-poor regime that makes it challenging to apply data-hungry ML approaches. We investigate how to combine human behavioral priors from supervised learning models with the ability of RL to strictly enforce objectives to create agents that are both human-like and capable. Using a dataset of human driving behavior and a GPU-accelerated simulator, we show that we can construct agents that both score highly on measures of human similarity while also having the excellent collision avoidance capabilities of human drivers.
Yingying Li, Assistant Professor, UIUC
New Advances on Set Membership Parameter Estimation: Convergence Rates, Unknown Disturbance Bounds, and Nonlinear Systems
Uncertainty characterization and reduction in system identification/model estimation is essential for safety-critical applications that call for robust decision-making, such as robust optimization, robust reinforcement learning, and robust control, etc. Two major methodologies for uncertainty reduction are confidence regions from concentration inequalities and set membership estimation (SME). SME directly estimates the uncertainty sets from data and usually enjoys superior empirical performance under bounded disturbances. This talk provides theoretical insights into SME’s convergence rates for linear dynamical systems and introduces UCB-SME, a novel algorithm that overcomes a major limitation of classical SME: its lack of convergence when the tight disturbance bound is unknown. Our UCB-SME algorithm adaptively learns the disturbance bound and achieves provable convergence to the true parameters. Further, we extend these results from linear dynamical systems to linearly parameterized nonlinear systems with real-analytic feature functions. Our analysis provides convergence rates for not only SME but also least square estimation (LSE), a famous point estimator, which fills a critical gap in the literature on nonlinear system identification.
Chuchu Fan, Associate Professor, MIT
Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control
Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF, which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance and high safety rates with a constant set of hyperparameters across all environments.
Jia Liu, Assistant Professor, and Xinhe Zhang, PhD Student, Harvard University
Decoding intrinsic long-term neural dynamics via flexible brain-computer interface
Understanding the intrinsic, long-term neural dynamics in the brain is pivotal for advancing brain-computer interface (BCI) applications and developing AI algorithms that emulate the robustness and efficiency found in biological systems. Prior studies using rigid BCIs revealed drifts in recorded neural activities; however, the origins of these drifts remained elusive since both recording instability and the natural evolution of neural dynamics could contribute to this phenomenon. Our research introduces a flexible BCI capable of chronic, stable tracking of single-neuron activities over extended periods. By leveraging a multi-layered mesh electrode array, we achieved consistent, long-term neuronal recordings while minimizing immune responses and preserving signal integrity. Behavioral studies conducted on mice with flexible BCIs implanted in the visual and motor cortices revealed patterns of neural representational drift. To decode these long-term intrinsic neural dynamics, we developed a multi-timescale dynamical system that captures and models the evolution of neural dynamics over time. This adaptive neural representation drift is then incorporated into the design of an AI lifelong learning algorithm, enabling it to retain prior knowledge while adapting to new tasks without catastrophic forgetting. Our findings lay a strong foundation for designing neural-inspired AI that utilizes the intrinsic long-term neural dynamics, offering a promising framework for robust, efficient and effective AI-based learning systems.
Mingyi Hong, Associate Professor, University of Minnesota
Efficient Algorithms for Inverse RL and Application to LLM Finetuning
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model for LLM alignment, as opposed to the standard behavior cloning/supervised fine-tuning (SFT) approach of directly training LLM as policy. This approach leads to new SFT algorithms that are not only efficient to implement, but are robust to the presence of low-quality supervised learning data. Moreover, we discover a connection between the proposed IRL based approach, and a recent line of works called Self-Play Fine-tune. Theoretically, we show that the proposed algorithms converge to the stationary solutions of the IRL problem. Empirically, we align 1B and 7B models using proposed methods and evaluate them on a reward benchmark model and the HuggingFace Open LLM Leaderboard. The proposed methods show significant performance improvement over existing SFT approaches. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
Zhuoran Yang, Assistant Professor, Yale University
Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks
We study reinforcement learning (RL) for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure. In specific, at the outset of the game, the leader announces her policy to the follower and commits to it. The follower observes the leader’s policy and, in turn, adopts a quantal response policy by solving an entropy-regularized policy optimization problem induced by leader’s policy. The goal of the leader is to find her optimal policy, which yields the optimal expected total return, by interacting with the follower and learning from data. A key challenge of this problem is that the leader cannot observe the follower’s reward, and needs to infer the follower’s quantal response model from his actions against leader’s policies. We propose sample-efficient algorithms for both the online and offline settings, in the context of function approximation. Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader’s decision making problem, and we show that they achieve sublinear regret upper bounds. Moreover, we quantify the uncertainty of these estimators and leverage the uncertainty to implement optimistic and pessimistic algorithms for online and offline settings. Besides, when specialized to the linear and myopic setting, our algorithms are also computationally efficient. Our theoretical analysis features a novel performance-difference lemma which incorporates the error of quantal response model, which might be of independent interest.This is a joint work with Siyu Chen and Mengdi Wang.
Yilun Du, Assistant Professor, Harvard University
Learning Compositional Models of the World
I’ll talk about how learning and planning with generative world models enable us to solve a variety of decision-making tasks even when we do not have explicit data for each task.
Lin Yang, Assistant Professor, UCLA
Deployment Efficient Reward-Free Exploration with Linear Function Approximation
We study deployment efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of exploration policies used during the algorithm. We design a new reinforcement learning (RL) algorithm whose sample complexity is polynomial in the feature dimension and horizon length, while achieving nearly optimal deployment efficiency for linear MDPs under the reward-free exploration setting. More specifically, our algorithm explores a linear MDP in a reward-free manner, while using at most H exploration policies during its execution where H is the horizon length. Compared to previous algorithms with similar deployment efficiency guarantees, the sample complexity of our algorithm does not depend on the reachability coefficient or the explorability coefficient of the underlying MDP, which can be arbitrarily small for certain MDPs. Our result addresses an open problem proposed in prior work. To achieve such a result, we show how to truncate state-action pairs of the underlying linear MDP in a data-dependent manner, and devise efficient offline policy evaluation and offline policy optimization algorithms in the truncated linear MDP. We further show how to implement reward-free exploration mechanisms in the linear function approximation setting by carefully combines these offline RL algorithms without sacrificing the deployment efficiency.
Guannan Qu, Assistant Professor, CMU
Locally Interdependent Multi-Agent MDP
Many multi-agent systems in practice are decentralized and have dynamically varying dependencies. There has been a lack of attempts in the literature to analyze these systems theoretically. In this paper, we propose and theoretically analyze a decentralized model with dynamically varying dependencies called the Locally Interdependent Multi-Agent MDP. This model can represent problems in many disparate domains such as cooperative navigation, obstacle avoidance, and formation control. Despite the intractability that general partially observable multi-agent systems suffer from, we propose three closed-form policies that are theoretically near-optimal in this setting and can be scalable to compute and store. Consequentially, we reveal a fundamental property of Locally Interdependent Multi-Agent MDP’s that the partially observable decentralized solution is exponentially close to the fully observable solution with respect to the visibility radius. We then discuss extensions of our closed-form policies to further improve tractability. We conclude by providing simulations to investigate some long horizon behaviors of our closed-form policies.
Ermin Wei, Associate Professor of Electrical and Computer Engineering, Industrial Engineering and Management Sciences, Northwestern University
Improved Lower bounds for First-order Methods under Markov Sampling
Unlike its vanilla counterpart with i.i.d. samples, stochastic optimization with Markovian sampling allows the sampling scheme following a Markov chain. This problem encompasses various applications that range from asynchronous distributed optimization to reinforcement learning. In this work, we lower bound the sample complexity of finding $\epsilon$-approximate stationary points for any first-order methods when sampling is Markovian. We show that for samples drawn from Markov processes with countable state space, any algorithm that accesses smooth, non-convex functions through queries to a stochastic gradient oracle, requires at least $\epsilon^{-4}$ samples. Moreover, for finite Markov chains, we show an $\epsilon^2$ lower bound and propose a new algorithm that is proven to be nearly min-max optimal.
Yang Zheng, Assistant Professor, UC San Diego
Benign Nonconvex Landscapes in Optimal and Robust Control
Direct policy search has achieved great empirical success in reinforcement learning. Many recent studies have revisited its theoretical foundation for continuous control, which reveals elegant nonconvex geometry in various benchmark problems.
In this talk/poster, we introduce a new and unified Extended Convex Lifting (ECL) framework to reveal hidden convexity in classical optimal and robust control problems from a modern optimization perspective. Our ECL offers a bridge between nonconvex policy optimization and convex reformulations, enabling convex analysis for nonconvex problems. Despite non-convexity and non-smoothness, the existence of an ECL not only reveals that minimizing the original function is equivalent to a convex problem but also certifies a class of first-order non-degenerate stationary points to be globally optimal. Therefore, no spurious stationarity exists in the set of non-degenerate policies.
This ECL framework can cover many benchmark control problems, including state feedback linear quadratic regulator (LQR), dynamic output feedback linear quadratic Gaussian (LQG) control, and Hinf robust control. ECL can also handle a class of distributed control problems when the notion of quadratic invariance (QI) holds. We believe that the new ECL framework may be of independent interest for analyzing nonconvex problems beyond control.
Guanya Shi, Assistant Professor, CMU
Adaptive, Structured, and Reliable RL and Control for Real-World Agile Robotics: Humanoids and Beyond
Recent breathtaking advances in AI and robotics have brought us closer to building general-purpose robots in the real world, e.g., humanoids capable of performing a wide range of human tasks in complex environments. Two key challenges in realizing such general-purpose robots are: (1) achieving “breadth” in task/environment diversity, i.e., the generalist aspect, and (2) achieving “depth” in task execution, i.e., the agility aspect.
In this talk, I will present recent works that aim to achieve both generalist-level adaptability and specialist-level agility, demonstrated across various real-world robots, including full-size humanoids, quadrupeds, aerial robots, and ground vehicles. The first part of the talk focuses on learning agile and general-purpose humanoid whole-body control using sim2real reinforcement learning. The second part will discuss the limitations of such end2end sim2real pipelines and how combining learning with control can enhance safety, efficiency, and adaptability.
More details on the presented works are available on the CMU LeCAR Lab website: https://lecar-lab.github.io/
Qiaomin Xie, Assistant Professor, University of Wisconsin-Madison
Stochastic Approximation: Constant Stepsizes Go a Long Way
Stochastic Approximation (SA) provides a foundational framework for numerous reinforcement learning and machine learning problems. While previous studies have primarily focused on mean-squared error bounds under diminishing stepsize schemes, constant stepsize algorithms have gained practical traction due to their robustness and simplicity. Viewing the iterates of these algorithms as a Markov chain, we study their fine-grained probabilistic behavior. In particular, we establish finite-time geometric convergence of the iterates distribution, and relate the ergodicity properties of the Markov chain to the characteristics of SA algorithm and data.
Using coupling techniques and adjoint relationships, we characterize the limit distribution and quantify its bias as a function of the stepsize. This probabilistic understanding enables variance reduction through tail-averaging and bias reduction via Richardson-Romberg extrapolation. The combination of constant stepsizes with averaging and extrapolation achieves a favorable trade-off between rapid convergence and low long-term error. Empirical results in statistical inference illustrate the effectiveness of this approach compared to traditional diminishing stepsize schemes. Additionally, we extend our analysis to two-timescale linear SA, broadening the applicability of our findings.
Kaiqing Zhang, Assistant Professor; University of Maryland, College Park
Towards Principled (Multi-)AI Agents for Large-Scale Autonomy
Recent years have witnessed tremendous successes of learning for decision-making in dynamic environments, and in particular, Reinforcement Learning (RL). Prominent application examples include playing Go and video games, robotics, autonomous driving, and recently large language models. Most such success stories naturally involve multiple agents. Hence, there has been surging research interest in advancing Multi-Agent Learning in Dynamic Environments, and particularly, multi-agent RL (MARL), the focus of my research. In this talk, I will review some of my previous works on this topic, revealing some unique challenges of multi-agent learning in dynamic environments. Time permitting, I will also mention our recent works on the applications on robot fleet learning and on multi-agent learning with large-language-model agents.
Max Simchowitz, Assistant Professor, CMU
Simple Policies Cannot Imitate Simple Experts (when the action-space is continuous)
We show that for continuous state-space control problems, such as those encountered in robotics, the deployment error can be exponentially larger than the training error, even if the system dynamics are stable. This is in stark contradistinction to both seminal and recent work on the mathematical foundations of behavior cloning which finds polynomially (at worst) error compounding in problem horizon under circumstances which, we argue, are unrealistic in control systems. We focus on imitation of a “simple'” — smooth, stabilizing, deterministic — expert policy in discrete-time, continuous state control systems. We show any learning algorithm which returns a similarly “simple'” imitator policy suffers from exponential compounding error. Our negative result rules out both behavior cloning and offline-RL approaches, unless they return highly non-smooth policies or those with complex stochasticity. We complement this result with evidence of the benefits of “non-simple policies,” explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. Diffusion-policies), as well as host of other negative and positive results for imitation in control systems.
Manxi Wu, Assistant Professor, UC Berkeley
Decentralized learning in General Markov Games
Abstract: Markov games provide a powerful framework for modeling strategic multi-agent interactions in dynamic environments. Traditionally, convergence properties of decentralized learning algorithms in these settings have been established only for special cases, such as Markov zero-sum and potential games, which do not fully capture real-world interactions. In this paper, we address this gap by studying the asymptotic properties of learning algorithms in general-sum Markov games. In particular, we focus on a decentralized algorithm where each agent adopts an actor-critic learning dynamic with asynchronous step sizes. This decentralized approach enables agents to operate independently, without requiring knowledge of others’ strategies or payoffs. We introduce the concept of a Markov Near-Potential Function and demonstrate that it serves as an approximate Lyapunov function for the policy updates in the decentralized learning dynamics, which allows us to characterize the convergent set of strategies. We further strengthen our result under specific regularity conditions and with finite Nash equilibria.
Stephanie Gil, Assistant Professor, Harvard University
Reinforcement Learning-Based Framework for Autonomous Whale Rendezvous via In-Situ Sensing
Rendezvous with sperm whales for biological observations is made challenging by their prolonged dive patterns. Our algorithmic framework co-develops multi-agent reinforcement learning– based routing (autonomy module) and Synthetic Aperture Radar-based Very High Frequency (VHF) signal-based bearing estimation (sensing module) for maximizing rendezvous opportunities of autonomous robots with sperm whales. The sensing module is compatible with low-energy VHF tags, commonly used for tracking wildlife. The autonomy module leverages in-situ noisy bearing measurements of whale vocalizations, VHF tags, and whale dive behaviors to enable time–critical rendezvous of a robot team with multiple whales in simulation. We conduct experiments at sea in the native habitat of sperm whales demonstrating rendezvous with an “engineered whale” and with actual whales. Using bearing measurements to the engineered whale from an acoustic sensor and our sensing module, our autonomy module gives an aggregate successful rendezvous of 81.31% for a 500 meter rendezvous distance using three robots in post-processing. Our most recent work demonstrated rendezvous in real time with whales using an autonomous drone that reached within 200 meters of the whales’ location.
Cathy Wu, Associate Professor, MIT
Model-Based Transfer Learning for Contextual Reinforcement Learning
Deep reinforcement learning (RL) is a powerful approach to complex decision making. However, one issue that limits its practical application is its brittleness, sometimes failing to train in the presence of small changes in the environment. Motivated by the success of zero-shot transfer–where pre-trained policies perform well on related tasks–we wish to select a set of training tasks so that the policies generalize best. Given the high cost of training, it is critical to select training tasks strategically, but not well understood how to do so. We hence introduce Model-Based Transfer Learning (MBTL), which layers on top of existing RL methods to effectively solve contextual RL problems. MBTL models the generalization performance in two parts: 1) the performance set point, modeled using Gaussian processes, and 2) performance loss (generalization gap), modeled as a linear function of contextual similarity. MBTL combines these two pieces of information within a Bayesian optimization (BO) framework to strategically select training tasks. We show theoretically that the method exhibits sublinear regret in the number of training tasks and discuss conditions to further tighten regret bounds. We experimentally validate our methods using urban traffic and standard continuous control benchmarks. The experimental results suggest that MBTL can achieve up to 50x improved sample efficiency compared with canonical independent training and multi-task training. Further experiments demonstrate the efficacy of BO and the insensitivity to the underlying RL algorithm and hyperparameters. This work lays the foundations for investigating explicit modeling of generalization, thereby enabling principled yet effective methods for contextual RL.
John Baras, Distinguished University Professor, University of Maryland College Park, and Erfaun Noorani, Technical Staff, MIT
Making RL Robust under Time and Data Constraints
A critical weakness of RL methods is its brittleness (non-robustness). We describe several recently developed methods that can result in provably robust RL. These include: (a) Linkage and use of risk-sensitive optimization methods; (b) Rigorous duality between performance and risk and its use in tradeoff analysis; (c) Using knowledge representation and reasoning (KRR) and associated integration of knowledge graphs (KG) and domain specific large language models (DS-LLMs). We consider use of the resulting robust RL methods as key components of composable (i.e. hierarchical or multi-level) decision making. We consider both single and multiple agents (collaborative) problems. We include robustness with respect to finite time specifications. We demonstrate with a few examples in the domain of trusted autonomy and safety; autonomous robotic vehicles, autonomous robotic manipulators.
Xuezhou Zhang, Assistant Professor, Boston University
Proper Hyper-parameter tuning in RL: Impossibility result and the road ahead
The cost of hyper-parameter optimization (HPO) has historically been overlooked as a metric in algorithm design, both in supervised learning and RL. While cross validation has been a widely adopted as the standard approach to HPO in supervised learning, its counterpart does not exist in RL. In fact, sample efficient model selection has been an active area of research in RL. In this work, however, we present a negative result, showing that there is a fundamental exponential separation in the sample complexity cost of HPO between supervised learning and RL. In particular, we prove that in order to select the best model among K candidates, the sample complexity will necessarily blow up by a multiplicative K, in contrast to supervised learning where the blow-up factor is only log(K). This result has profound implications for future research in RL, and we will talk about those.
Shahriar Talebi, Postdoc, Harvard University
Geometric Trustworthy AI for Resilient, and Adaptable Autonomy
Efficient decision-making in critical dynamical systems marks a transformative approach to optimizing performance and synthesizing feedback. By integrating machine learning and control techniques, this paradigm leverages small but valuable datasets to develop adaptive strategies capable of dynamically responding to changing conditions in real-time. These strategies enable informed decision-making, allowing systems to effectively navigate complex and uncertain environments while achieving desired outcomes across diverse domains, including robotics, manufacturing, healthcare, and finance. By advancing trustworthy AI through geometric techniques, we further enhance resilience and adaptability in autonomous systems. Key contributions include Riemannian policy optimization, geometric learning for filtering without noise statistics, ergodic risk-aware control, scalable multi-agent reinforcement learning, and geometric online stabilization. These innovations not only bolster system robustness but also establish a foundation for scalable and reliable autonomy in real-world applications.
Brian Plancher, Assistant Professor, Barnard College, Columbia University
Quantized and Differentially Encoded Observation Spaces for Edge Reinforcement Learning
Deep reinforcement learning (DRL) has lead to many recent breakthroughs for complex AI systems. Applications of these results range from super-human level video game agents to dexterous, physically intelligent robots. However, training these systems remains incredibly compute and memory intensive, often requiring huge training datasets and large experience replay buffers. This poses a challenge for the next generation of field robots that will need to be able to learn on the edge in order to adapt to their environments. In this poster, we summarize our recent work to address this issue through both quantization and differentially encoded observation spaces. By quantizing continous action spaces and leveraging lossless differential video encoding schemes to compress image-based observations, we can drastically reduce the memory requirements of DRL replay buffers without impacting training performance. We evaluate our approach on a number of state-of-the-art DRL algorithms and find that quantization reduces overall memory costs by as much as 4.2x and differential image encoding reduces the memory footprint by as much as 16.7x across tasks from the Atari 2600 benchmark, the OpenAI Gym, and the DeepMind Control Suite (DMC). These savings also enable large-scale perceptive DRL that previously required paging between flash and RAM to be run entirely in GPU RAM, improving the latency of DMC tasks by as much as 32%.
Runyu Zhang, PhD Student, Harvard University
Efficient and Resilient Coordination of Multi-agent Systems
Efficient and resilient coordination among autonomous agents plays an important role in various domains such as energy management, robotic swarms, autonomous vehicles and beyond. As these systems grow in complexity and scale, the challenge of achieving optimal coordination becomes increasingly difficult. The first part of the poster will focus on tackling scalability issue by leveraging network structure. I will discuss how leveraging spatially-exponential decaying (SED) structures in networked systems enables scalable and near-optimal decentralized control. We will show that under certain mild assumption, the optimal controller also demonstrates a similar SED structure, which leads to theoretical guarantees for efficient distributed strategies and offers practical insights for large-scale coordination. The second part will focus on efficient Nash equilibrium seeking for multi-agent systems. The key element guiding our approach is the concept of ‘marginalized environment’ which provides more flexible algorithm design and tractable theoretical analysis. We first establish a fundamental observation: the equivalence between first-order stationary points and Nash equilibria. Building on this insight, we introduce a policy-gradient algorithm with provable sample complexity guarantees. Lastly I will briefly present our work ongoing works on safe and robust reinforcement learning and outline a roadmap for future work including closing the loop between theory development and applications for AI-enabled multi-agent societal systems.
Riley Simmons-Edler, Postdoc, Harvard University
Bio-inspired reinforcement learning
TBD
Wilka Carvalho, Research Fellow, Kempner Institute, Harvard University
Preemptive Solving of Future Problems: Multitask Preplay in Human and Machines
We live in a world filled with many co-occurring tasks—stove and fridge tasks are commonly co-located in kitchens, coffee shops and shopping centers are commonly co-located in city centers, and colleagues with similar specialties may sit near each other in office buildings. We hypothesize that humans leverage experience on one task to preemptively learn solutions to other tasks that were accessible but not achieved. We formalize this with “Multitask Preplay”, a novel computational theory that replays experience on one task as the starting point to “preplay” behavior, which aims to solve other available tasks. We present 4 object acquisition experiments in a small artificial world where human behavioral choices and response times are more consistent with Multitask Preplay than alternative hypotheses based on planning or transferring of predictive representations. We then show that these results generalize to craftax, a rich open-world environment with many more possible states and tasks. Finally, to showcase the utility of Multitask Preplay as a theory for human intelligence, we leverage craftax to demonstrate that, compared to traditional preplay and predictive representation methods, Multitask Preplay enables learning of behaviors that best transfer to novel worlds that share task co-occurrence structure.
M Ganesh Kumar, Postdoc, Harvard University
A Model of Place Field Reorganization During Reward Maximization
When rodents learn to navigate in a novel environment, a high density of place fields emerges at reward locations, fields elongate against the trajectory, and individual fields change spatial selectivity while demonstrating stable behavior. Why place fields demonstrate these characteristic phenomena during learning remains elusive. We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modeled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actor-critic for policy learning. Each field’s amplitude, center, and width, as well as downstream weights, are updated online at each time step to maximize cumulative reward. We demonstrate that this framework unifies three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improve policy convergence when learning to navigate to a single target and relearning multiple new targets. To conclude, we develop a normative model that recapitulates several aspects of hippocampal place field learning dynamics and unifies mechanisms to offer testable predictions for future experiments.
Preprint: https://www.biorxiv.org/content/10.1101/2024.12.12.627755v1.abstract
Jian Qian, PhD Student, MIT
Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
TBD
Moses C. Nah, Postdoc, MIT
Modular Robot Control with Motor Primitives
Human motor skills far surpass modern robot technology despite the limitations of the neuromuscular system. This poster introduces a modular control framework inspired by motor primitives, the fundamental building blocks of human motor behavior. Using Elementary Dynamic Actions and the Norton equivalent network model, the framework simplifies robot control by formulating functional units called “modules.” These modules enable a wide range of control tasks through modular combinations, eliminating the need for Inverse Kinematics and avoiding the challenges of kinematic singularity and redundancy. The framework integrates modular Imitation Learning with Dynamic Movement Primitives for efficient motion planning and command-level modularity. The presented modular framework enables a divide-and-conquer strategy to simplify complex robot control tasks. It can also explicitly regulate the dynamics of physical interaction, facilitating contact-rich manipulation. Simulation results and real robot implementations highlight the framework’s potential to bridge the performance gap between humans and robots.
Erhan Can Ozcan, PhD Student, Boston University
A Model-Based Approach for Improving Reinforcement Learning Efficiency Leveraging Expert Observations
This study investigates how to incorporate expert observations (without explicit information on expert actions) into a deep reinforcement learning setting to improve sample efficiency. First, we formulate an augmented policy loss combining a maximum entropy reinforcement learning objective with a behavioral cloning loss that leverages a forward dynamics model. Then, we propose an algorithm that automatically adjusts the weights of each component in the augmented loss function. Experiments on a variety of continuous control tasks demonstrate that the proposed algorithm outperforms various benchmarks by effectively utilizing available expert observations.
Yassir Jedra, Postdoc, MIT
Model-free Low-rank RL via Leveraged Entry-wise Matrix Estimation
We consider the problem of learning an -optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness.
Vijay Ganesh, Professor, Georgia Tech
Reinforcement Learning with Symbolic Feedback
TBD
Anusha Srikanthan, PhD Student, Harvard University
A Data-Driven Approach to Synthesizing Dynamics-Aware Trajectories for Underactuated Robotic Systems
We consider joint trajectory generation and tracking control for under-actuated robotic systems. A common solution is to use a layered control architecture, where the top layer uses a simplified model of system dynamics for trajectory generation, and the low layer ensures approximate tracking of this trajectory via feedback control. While such layered control architectures are standard and work well in practice, selecting the simplified model used for trajectory generation typically relies on engineering intuition and experience. In this paper, we propose an alternative data-driven approach to dynamics-aware trajectory generation. We show that a suitable augmented Lagrangian reformulation of a global nonlinear optimal control problem results in a layered decomposition of the overall problem into trajectory planning and feedback control layers. Crucially, the resulting trajectory optimization is dynamics-aware, in that, it is modified with a tracking penalty regularizer encoding the dynamic feasibility of the generated trajectory. We show that this tracking penalty regularizer can be learned from system rollouts for independently-designed low layer feedback control policies, and instantiate our framework in the context of a unicycle and a quadrotor control problem in simulation. Further, we show that our approach handles the sim-to-real gap through experiments on the quadrotor hardware platform without any additional training. For both the synthetic unicycle example and the quadrotor system, our framework shows significant improvements in both computation time and dynamic feasibility in simulation and hardware experiments.
Alessio Russo, Postdoc, Boston University
Multi-Reward Best Policy Identification
In this study, we investigate the Multi-Reward Best Policy Identification (MR-BPI) problem, where the goal is to determine the best policy for all rewards in a given set R with minimal sample complexity and a prescribed confidence level. We derive a fundamental instance-specific lower bound on the sample complexity required by any Probably Correct (PC) algorithm in this setting. This bound guides the design of an optimal exploration policy attaining minimal sample complexity.
Jingqi Li, PhD Student, UC Berkeley
Certifiable Reachability Learning Using a New Lipschitz Continuous Value Function and Minimax Policy Gradients
We propose a new reachability learning framework for high-dimensional nonlinear systems, focusing on reach-avoid problems. These problems require computing the reach-avoid set, which ensures that all its elements can safely reach a target set despite disturbances within pre-specified bounds. Our framework has two main parts: offline learning of a newly designed reach- avoid value function, and post-learning certification. Compared to prior work, our new value function is Lipschitz continuous and its associated Bellman operator is a contraction mapping, both of which improve the learning performance. To ensure deterministic guarantees of our learned reach-avoid set, we introduce two efficient post-learning certification methods. Both methods can be used online for real-time local certification or offline for comprehensive certification. We validate our framework in a 12-dimensional crazyflie drone racing hardware experiment and a simulated 10-dimensional highway takeover example.
Zana Bucinca, PhD Student, Harvard University
Offline Reinforcement Learning for Adaptive Support in AI-Assisted Decision-Making
AI decision-support tools typically offer a fixed type of assistance, like AI recommendations and explanations, regardless of the specific decision, individual, or broader context. This fixed design has been shown to hinder both human-AI decision accuracy and human skill improvement in the task. We posit that AI assistance needs to be dynamic, changing in response to contextual factors (e.g., AI uncertainty, task difficulty), individual differences, and specified objectives (e.g., decision accuracy, skill improvement).
To enable such adaptive support, we propose reinforcement learning (RL) as a general approach for modeling human-AI decision-making to optimize human-AI interaction for diverse objectives. RL enables optimizing various objectives in AI-assisted decision-making by tailoring and adaptively providing decision support to humans — the right type of assistance, to the right person, at the right time. We instantiated our approach with two objectives: human-AI accuracy on the decision-making task and human skill improvement (i.e., learning about the task) and learned decision support policies from previous human-AI interaction data.
We compared the optimized policies against several baselines in AI-assisted decision-making. Across two experiments (N = 316 and N = 964), our results consistently demonstrated that people interacting with policies optimized for accuracy achieve significantly higher accuracy — and even human-AI complementarity — compared to those interacting with any other type of AI support. Our results further indicated that human learning was more difficult to optimize than accuracy. While the policies learned the best available actions to optimize learning, participants who interacted with learning-optimized policies showed significant learning improvement only at times.
Our research (1) demonstrates offline RL to be a promising approach to model the dynamics of human-AI decision-making, leading to policies that may optimize various objectives and provide novel insights about the AI-assisted decision-making space, and (2) emphasizes the importance of considering skill improvement and other human-centric objectives beyond accuracy in AI-assisted decision-making, opening up the novel research challenge of optimizing human-AI interaction for such objectives.
Zhaolin Ren, PhD Student, Harvard University
Efficient learning-based optimization and control toolkits for real-world cyberphysical systems
Learning‐based optimization and control hold immense potential to revolutionize and enhance cyberphysical systems. However, real‐world challenges—such as limited data availability, stochastic and nonlinear system dynamics, and the need for scalability in large‐scale and continuous control problems—constrain their application. In my work, I address these challenges by developing scalable, sample‐efficient, and theoretically principled learning‐based optimization and control algorithms tailored for cyberphysical systems.
My research introduces advancements in two key areas: (1) gradient‐free optimization, learning, and control, and (2) representation learning for sample‐efficient control and reinforcement learning. For gradient‐free optimization, I present novel contributions such as a provably efficient parallel Bayesian optimization framework, which achieves state‐of‐the‐art performance in both synthetic and realistic environments. Additionally, I develop a real‐time, adaptive, Bayesian‐learning‐based feedback control system to enhance cardiac tissue maturation, marking one of the first successful applications of AI‐cyborg integration in bioengineering.
In the domain of representation learning, I propose finite‐dimensional spectral representations for value functions, enabling scalable reinforcement learning with provable guarantees. I extend this approach to address the sim‐to‐real gap in robotics—improving the performance of real‐world controllers—and to multi‐agent networked systems, where I design scalable algorithms for continuous state‐action spaces with rigorous convergence guarantees. By integrating insights from optimization, machine learning, and reinforcement learning, my work establishes new and efficient optimization and control toolkits for deploying learning‐based solutions in diverse cyberphysical systems, including bioengineering and robotics.
Leo Bo Liu, PhD Student, Harvard University
Odors as “natural language”: sparse neural networks reinforced in mammalian olfactory systems and large language models
Sparse connectivity is a hallmark of brain neural networks and a key focus in AI for efficient computation. We explore sparse neural networks through two related topics: bilateral alignment in mammalian olfactory systems and pruning large language models for on-device AI assistants.
For the first topic, inspired by dual nostrils creating two cortical odor representations, we studied sparse inter-hemispheric projections for bilateral alignment. Using a local and biologically plausible Hebbian rule, we found sparse projections suffice and their density scales inversely with cortical neuron numbers. Moreover, Hebbian updates approximate global stochastic gradient descent (SGD) since their directions overlap, suggesting biologically plausible rules can align with global optimization.
Inspired by a similar scaling in Transformer attention matrices, the second topic examines pruning Transformer in Meta Llama-2 and Llama-3 models. Over 50% of parameters were pruned while maintaining performance, with Llama-3 producing fewer factual errors at sparsity limits but requiring more parameters due to training differences.
Together, these findings provide insights into sparse network design, from biological systems to efficient AI models.
Haoxing Tian, PhD Student, Boston University
One-Shot Averaging for Distributed TD($\lambda$) Under Markov Sampling
We consider a distributed setup for reinforcement learning, where each agent has a copy of the same Markov Decision Process but transitions are sampled from the corresponding Markov chain independently by each agent. We show that in this setting, we can achieve a linear speedup for TD($\lambda$), a family of popular methods for policy evaluation, in the sense that $N$ agents can evaluate a policy $N$ times faster provided the target accuracy is small enough. Notably, this speedup is achieved by “one shot averaging,” a procedure where the agents run TD($\lambda$) with Markov sampling independently and only average their results after the final step. This significantly reduces the amount of communication required to achieve a linear speedup relative to previous work.
Fengjun Yang, PhD Student, University of Pennsylvania
Coordinating Planning and Tracking in Layered Control Policies via Actor-Critic Learning
We propose a reinforcement learning (RL)-based algorithm to jointly train (1) a trajectory planner and (2) a tracking controller in a layered control architecture. Our algorithm arises naturally from a rewrite of the underlying optimal control problem that lends itself to an actorcritic learning approach. By explicitly learning a dual network to coordinate the interaction between the planning and tracking layers, we demonstrate the ability to achieve an effective consensus between the two components, leading to an interpretable policy. We theoretically prove that our algorithm converges to the optimal dual network in the Linear Quadratic Regulator (LQR) setting and empirically validate its applicability to nonlinear systems through simulation experiments on a unicycle model.
Zhiyu Zhang, Postdoc, Harvard University
Fast TRAC: A Parameter-Free Optimizer for Lifelong Reinforcement Learning
A key challenge in lifelong reinforcement learning (RL) is the loss of plasticity, where previous learning progress hinders an agent’s adaptation to new tasks. While regularization and resetting can help, they require precise hyperparameter selection at the outset and environment-dependent adjustments. Building on the principled theory of online convex optimization, we present a parameter-free optimizer for lifelong RL, called TRAC, which requires no tuning or prior knowledge about the distribution shifts. Extensive experiments on Procgen, Atari, and Gym Control environments show that TRAC works surprisingly well-mitigating loss of plasticity and rapidly adapting to challenging distribution shifts-despite the underlying optimization problem being nonconvex and nonstationary.
Neharika Jali, PhD Student, CMU
Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing System
Policy Gradient based algorithm that leveraging the underlying queueing structure to efficiently route jobs that arrive into a central queue to a system of heterogeneous servers often observed in modern datacenters.
Link: https://proceedings.mlr.press/v238/jali24a/jali24a.pdf
Yuanyuan Shi, Assistant Professor, UCSD
Stability-constrained Reinforcement Learning for Real-world Power Systems
Learning-based controllers, especially reinforcement learning (RL), have shown promising performance for future power grid control with rapid fluctuations and complex dynamics introduced by numerous distributed energy resources (DERs). Nevertheless, the lack of performance guarantees can be a significant barrier to the adoption of RL by the system operators. My research develops RL controllers with stability and steady‑state optimality guarantees, with applications to voltage control in the distribution grids. Specifically, the key structure we identify is monotonic functions. These are parameterized by monotone neural networks to guarantee stability by design and trained through RL for transient performance optimization. We further synthesize the RL controller with the gradient‑based controller to obtain steady‑state optimality. Central to the deployment of these learning‑based control algorithms in real-world power systems is their robustness to uncertainties. We have investigated policy adaptation to handle the model uncertainty between the power grid simulator and the real system, and robustness of the RL policies to handle external uncertainties such as sensor noise and cyber-attacks.
Shreya Saxena, Assistant Professor, Luke Gong, Swartz Postdoctoral Fellow, and Amelia Johnson, Postgraduate Associate, Yale University
Multi-agent reinforcement learning for modeling animal social behaviors in cooperative tasks
Understanding cooperative behavior in animals has been a longstanding focus in
cognition and neuroscience. In social tasks, each animal must develop behavioral
policies influenced by the actions of others, who are simultaneously learning and
adapting their own strategies. This scenario closely parallels the framework of
multi-agent reinforcement learning (MARL). However, the diversity of animal traits and
external influences introduces significant challenges in uncovering the behavioral and
neural mechanisms underlying animal social cooperation. We aim to develop
customized and task-specific artificial MARLs as a novel approach to model and
analyze social cooperative behaviors observed in animal experiments.
In this study, we employed various deep MARL models to emulate animal behavior
during social cooperation tasks. Specifically, we have tailored virtual environments to
replicate experimental paradigms and trained agents using different learning methods to
simulate learning in cooperative tasks. We demonstrated that certain learning methods
could replicate behavioral metrics recorded experimentally, validating the feasibility of
this approach. Our next steps focus on incorporating additional features of animal
experimental conditions, like the partial observability of the environment due to gaze
dynamics into the models. By examining the agents’ learning processes, we aim to
understand how they utilize social and non-social information to create representations
of other agents and accomplish tasks. In turn, animal behavior may provide key insights
into efficient cooperative learning using inverse models.
In summary, leveraging MARL to model the acquisition of social behavior policies in
environments akin to experimental paradigms will offer a promising avenue for
advancing our understanding of animal learning in social contexts. Additionally,
comparing the activity of these deep reinforcement learning networks to recorded neural
data from animals could provide insights into how the brain processes socially salient
information during task learning. Ultimately, this framework holds potential as a rapid
and efficient tool for theory- and goal-driven hypothesis testing in neuroscience, which is
often extremely time-consuming to validate through traditional experimental methods.
Pulkit Agrawal, Associate Professor, and Idan Shenfeld, PhD Student, MIT
Value Augmented Sampling for Language Model Alignment and Personalization
Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods that require changing the weights of the LLM, VAS does not require access to the weights of the pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are available only as APIs. In addition, our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time, paving the road ahead for the future of aligned, personalized LLMs.
Pulkit Agrawal, Associate Professor, and Zhang-Wei Hong, PhD Student, MIT
Going Beyond Heuristics by Imposing Policy Improvement as a Constraint
In many reinforcement learning (RL) applications, incorporating heuristic rewards alongside the task reward is crucial for achieving desirable performance. Heuristics encode prior human knowledge about how a task should be done, providing valuable hints for RL algorithms. However, such hints may not be optimal, limiting the performance of learned policies. The currently established way of using heuristics is to modify the heuristic reward in a manner that ensures that the optimal policy learned with it remains the same as the optimal policy for the task reward (i.e., optimal policy invariance). However, these methods often fail in practical scenarios with limited training data. We found that while optimal policy invariance ensures convergence to the best policy based on task rewards, it doesn’t guarantee better performance than policies trained with biased heuristics under a finite data regime, which is impractical. In this paper, we introduce a new principle tailored for finite data settings. Instead of enforcing optimal policy invariance, we train a policy that combines task and heuristic rewards and ensures it outperforms the heuristic-trained policy. As such, we prevent policies from merely exploiting heuristic rewards without improving the task reward. Our experiments on robotic locomotion, helicopter control, and manipulation tasks demonstrate that our method consistently outperforms the heuristic policy, regardless of the heuristic rewards’ quality.
Arthur Castello Branco de Oliveira, Postdoc, Northeastern University
Comments on the Convergence of the LQR problem in a Data-Driven Context
The gradient dominance is a established condition for exponential convergence and robustness of training algorithms. Despite that, the gradient of the continuous time LQR cost function does not satisfy a PL inequality globally, only a weaker version of it. In this poster we explore consequences of this fact, compare the convergence of the CT LQR gradient flow to its DT counterpart, where a global PL inequality is satisfied, and observe consequences, or lack thereof, of this weaker condition when implementing an overparameterized formulation.
Aviral Kumar, Assistant Professor, CMU
Algorithms for Online RL Fine-Tuning at Scale
One of the key advantages of RL is in fine-tuning models that are available by running training using other procedures such as imitation learning or offline RL for robotic policies, or large-scale Internet pre-training for foundation models. In this talk, I will talk about two of our recent works on developing techniques for online RL fine-tuning. The first work focuses on using expressive policy classes such as autoregressive token-based policies and diffusion policies in online RL fine-tuning. We develop policy-agnostic RL (https://policyagnosticrl.github.io/), an approach that can fine-tune any arbitrary policy class in a stable manner at large scales entirely via TD-learning. This results in some of the first results that show online fine-tuning of large, generalist robotic VLA policies on real robots. Second, I will also talk about how one could alleviate challenges of unlearning and forgetting that often arise when running online RL fine-tuning, especially with TD-learning methods. I will present a mental model of this problem and devise a very simple approach based on warm up to address this issue (https://zhouzypaul.github.io/wsrl/).
Navid Azizan, Assistant Professor, and Zeyang Li, PhD Student, MIT
Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium
Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue—where the system will inevitably violate state constraints within certain regions of the constraint set—resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with state-wise constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose Multi-Agent Dual Actor-Critic (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.
Animesh Garg, Assistant Professor, Georgia Tech
Decision making with world models
Well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization.
We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task.