by Mrinal Kalakrishnan, Robot Learning Lead

Aug 15, 2022

Hero team screenshot

TL;DR: We collaborated with Google Research to help develop PaLM-SayCan, an AI research breakthrough that unifies natural language understanding with the physical capabilities of robots. Although we’re currently only testing this technology in the lab, PaLM-SayCan shows it is possible for robots to fulfill complex, natural language instructions by combining the reasoning abilities of large-scale language models with learned robot skills.

youtube video thumbnail

Towards robots that can understand us

1:50 - Play video

The limits of my language mean the limits of my world.

At Everyday Robots, we’re building a new type of robot: one that can learn by itself, to help anyone with (almost) anything. These helper robots will shift the paradigm in robotics — moving the field away from robots that need to be painstakingly coded to do a specific thing in a specific space, to robots that can learn to take on a multitude of tasks in everyday spaces.

Our work is driven by a desire to build robots that are genuinely useful to people. To do that, we need to develop experiences that feel natural, like a helper that you can ask to do useful tasks for you at work, and at home.

When it comes to robotics, usefulness is determined not just by what robots can do, but also how we tell them to do it — for example, how we instruct a helper robot to clean up a cup of spilled coffee. That’s where natural language understanding comes into play. 

Let’s start at the beginning: natural language understanding is the ability for a computer to read and understand words in order to take appropriate action, like how a person would. This may sound straightforward, but it’s actually very complex, because in the simplest terms, computers “speak” in code, while people use words. This tension is why, historically, chatbots haven’t been able to grasp tone, or why trolls can get around harmful content restrictions with simple misspellings.

Recent leaps in natural language understanding are beginning to change that, in particular — large-scale language models. Think of them as computers that have read billions and billions of words in order to understand what we actually mean when we say things like “best restaurant in Dallas”, and then use that understanding to generate and organize text-based responses (i.e. sorting and filtering by price, location, or cuisine).

So far, the benefits of language models have primarily been restricted to our digital lives, like predicting the next sentence of an email or telling us the weather or the news with a simple “Hey Google”. But helper robots can change that, unlocking the benefits of computers that can understand us in our physical lives. 

Unlocking understanding, in partnership with Google Research

Over the past seven years, our collaboration with Google Research has resulted in great strides in robot learning. We’ve made advances in areas like reinforcement learning, imitation learning, and learning from simulation. All of our work so far has focused on training our robots to execute short skills, such as picking up objects or opening doors. Now, with Google Research’s latest breakthrough,  we’re building on that foundation by augmenting these low-level skills with a language model.

It’s called PaLM-SayCan, and it unifies natural language understanding with the physical capabilities of robots — resulting in a robot that can be  tasked with a problem in the form of long, abstract natural language instructions, and execute lower-level skills to solve it.

This work is particularly exciting because it shows us that enhancing underlying language models can improve a robot’s overall performance. When comparing PaLM-SayCan to a less powerful baseline model, we saw a 14% improvement in a helper robot’s ability to map out a viable approach to a task and a 13% improvement in its ability to carry out the low-level skills the task requires. Most notably, with PaLM-SayCan, helper robots showed a 26% increase in the ability to plan long horizon tasks that require eight (or more) steps. 

Here’s how it works: 

Step one: The language model understands a task

The language model uses prompt engineering and a set of constrained responses to break a high-level task (like “bring me a snack”) into small, actionable steps. The language model scores all possible first steps in terms of how likely it is to make progress towards the overall task. For example, if you’ve asked for a snack, high scoring first steps could be things like “find a banana” or a candy bar or a watermelon. 

SayCan_blog-01
Step two: The hypotheses “meet” the real world

But those actions aren’t necessarily “correct”, as the language model doesn’t know the scene (i.e. what snacks are in the drawer) or the capabilities of the robot (i.e. the robot can pick up a banana, but not a large watermelon).

SayCan blog 02 png
Step three: The helper robot reports real-world conditions

The helper robot is then queried on the feasibility of each of its learned low-level skills (i.e. pick up the banana, pick up the candy bar) in the current context (i.e. if there is no candy bar in the drawer, it cannot be picked up, but if there is a banana in the drawer, it can). This is called the affordance function.

SayCan blog 03 png
Step four: The winner takes it all

Then, we have a winner. The helper robot performs the skill which has the highest combined language model score and affordance function score (i.e. pick up the banana).

SayCan blog 04 png
Step five: Iterate, iterate, iterate.

This iterative process is then repeated until the high-level task is completed. (i.e. you get your snack: a banana!)

SayCan blog 05 png

Towards a robot-enabled future

Although we are at the beginning of this journey (currently, PaLM-SayCan is only being applied to prototypes), our collaboration with Google Research has shown that it is possible to translate the ability of language models to understand, analyze and suggest solutions into real-world applications. 

PaLM-SayCan is one of the many ways we’re integrating cutting-edge machine learning into our work. And, although we’re a long way from robots that can translate any instruction into concrete tasks, the possibilities and promise of this experiment are very exciting indeed. 

Find out more: