Human-Timescale Adaptation
in an Open-Ended Task Space

Adaptive Agent Team
ICML 2023 (oral)

Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans.  

In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. 

Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. 

We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.

Adaptive Agents Team (in alphabetical order)

Full-time Contributors: Jakob Bauer, Kate Baumli, Feryal Behbahani, Avishkar Bhoopchand, Michael Chang, Adrian Collister, Edward Hughes, Sheleem Kashem, Jack Parker-Holder, Yannick Schroecker, Jakub Sygnowski, Alexander Zacherl, Lei Zhang. Part-time Contributors: Nathalie Bradley-Schmieg, Natalie Clay, Vibhavari Dasagi, Lucy Gonzalez, Karol Gregor, Maria Loks-Thompson, Hannah Openshaw, Shreya Pathak, Nicolas Perez-Nieves, Nemanja Rakicevic, Tim Rocktäschel, Sarah York. Advisors: Satinder Singh, Karl Tuyls

We train a large Transformer model with meta-RL in XLand. During training, tasks are uniformly sampled, and subsequently filtered to produce an ever-changing training pool of tasks at the frontier of the agent’s capabilities. After training on these tasks, the agent is capable of adapting to unseen hand-authored tasks as effectively and efficiently as humans.

We train our agent in XLand, a vast, smooth and diverse task space of adaptation problems. Different tasks have different adaptation requirements, such as experimentation, tool use or division of labour.

Please read our paper for the detailed results, methods and ablations, also including a comparison to human performance.  

Results Reel: Full Episodes

These are unedited recordings of the full episodes from which we took the footage that we show in the Results Reel video in abridged form. As agent behaviour is stochastic, these episodes represent 'common behavioural patterns' seen on these tasks and should not be interpreted as representative of the agent behaviour in every episode.

Human Adaptation Performance

The following two videos show what "good adaptation" in this domain by experienced human players looks like. See figure E.1 in the paper for a comparison between agent and human scores on these two tasks.

Same Signature tasks

These four tasks were built to highlight the necessity for adaptation in our task space. On the first step all the tasks look exactly the same to AdA: They are set in the same world with the same set of objects and the same number of fully hidden production rules. Only by interacting with the environment can the agent infer the actual rules that are hidden. And since each sets of rules creates different dynamics, the difficulty varias greatly and both humans and agent needs to show a different behaviour to solve each task. 

Prompting Experiments

These videos show how our agent can be prompted and improve its behaviour with (human) expert demonstrations. We don't use any demonstrations during training but the agent can effectively make use of these prompts and adapt effectively. The first video shows AdA's naive behaviour for this navigation task: It tends to default to only exploring the left path for multiple trials and not get any reward. But as we see in the second video just showing AdA a 1-trial long (human) expert demonstration navigating to the goal object corrects for this bias and when AdA takes over control in the 2nd trial, it reliably navigates to the goal object. For details see chapter 3.8. "AdA can leverage prompting with first-person demonstrations".

Additional Evaluation Tasks

The following is a selection of additional evaluation tasks not shown in the results reel. We could not upload every single task we created, so here we intentionally selected tasks that AdA cannot solve or shows little adaptation.  Please see the paper for the detailed results and comparison to human performance. 

Single-Agent Tasks
(see table E.1 for descriptions and figure E.1 for agent and human results)

Multi-Agent Tasks
(see table E.2 for descriptions)