i-Sim2Real: Reinforcement Learning of Robotic Policies in Tight Human-Robot Interaction Loops

Saminda Abeyruwan*, Laura Graesser*, David B. D’Ambrosio, Avi Singh, Anish Shankar, Alex Bewley, Deepali Jain, Krzysztof Choromanski, Pannag R. Sanketi

* indicates equal contribution

Robotics at Google

arXiv | video

High-Speed, Dynamic Table Tennis with Deep Reinforcement Learning!

Rallies of up to 340 hits!

After some post-submission bug fixes to our control code, we were able to get a rally of up to 340 hits, and several rallies of over 200 hits! (as opposed to a maximum of 150 hits in the paper evaluations). Here is one such long rally, in which a player interacts with a robot for over four minutes, without any interruptions. 


Abstract

Sim-to-real transfer is a powerful paradigm for robotic reinforcement learning. The ability to train policies in simulation enables safe exploration and large-scale data collection quickly at low cost. However, prior works in sim-to-real transfer of robotic policies typically do not involve any human-robot interaction because accurately simulating human behavior is an open problem. In this work, our goal is to leverage the power of simulation to train robotic policies that are proficient at interacting with humans upon deployment. But there is a chicken and egg problem --- how to gather examples of a human interacting with a physical robot so as to model human behavior in simulation without already having a robot that is able to interact with a human? Our proposed method, Iterative-Sim-to-Real (i-S2R), attempts to address this. i-S2R bootstraps from a simple model of human behavior and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined. For all training we apply a new evolutionary search algorithm called Blackbox Gradient Sensing (BGS). We evaluate our method on a real world robotic table tennis setting, where the objective for the robot is to play cooperatively with a human player for as long as possible. Table tennis is a high-speed, dynamic task that requires the two players to react quickly to each other’s moves, making for a challenging test bed for research on human-robot interaction. We present results on an industrial robotic arm that is able to cooperatively play table tennis with human players, achieving rallies of 22 successive hits on average and 150 at best. Further, for 80% of players, rally lengths are 70% to 175% longer compared to the sim-to-real plus fine-tuning (S2R+FT) baseline.

Method

Step 1: Static data collection.

During initial data collection, a human throws balls at an unresponsive robot. 

Step 2: Training in simulation with bootstrapped model and deployment in real world. 

A policy is trained in simulation via RL using a human behavior model trained using the initial ball throws, but policy performance upon deployment is not particularly strong, since there is a sim-to-real gap. 

Step 3: Fine-tuning and interactive data collection. 

Fine-tuning in the real world with reinforcement learning improves performance. However, the robot still struggles to carry out a long rally, as it was only trained on static ball throws, not interactive data. We collect interactive data during this stage to update our human behavior models for further training in sim. 

Step 4: Re-training and further data collection. This step is repeated until human behavior model and robot policy converge.

After a few rounds of updating the human behavior model, re-training in simulation and fine-tuning in the real world, the robot is able to hold long rallies with the human player. 

i-S2R in Action

A 147-hit rally obtained with i-S2R. 

An 82-hit rally obtained with i-S2R with a different player.

Quantitative Results

left When aggregated across all players, i-S2R rally length is higher than S2R+FT by about 9%.  However, note that simple aggregation puts extra weight on higher skilled players that are able to hold a longer rally. center The normalized rally length distribution (see paper for normalization details) shows a bigger improvement between i-S2R and S2R+FT in terms of the mean, median and 25th and 75th percentiles. right The histogram of rally lengths for i-S2R and S2R+FT (250 rallies per model) shows that a large fraction of the rallies for S2R+FT are shorter (i.e. less than 5), while i-S2R achieves longer rallies more frequently. Boxplot details: The white circle is the mean, the horizontal line is the median, box bounds are the 25th and 75th percentiles. 

When broken down by player skill, we notice that i-S2R has a significantly longer rally length than S2R+FT and is comparable to S2R-Oracle+FT for beginner and intermediate players. The advanced player is an exception. 

Generalization to New Players

Our learned policy generalizes to new players - the player evaluating the policy in this video was not involved in the training of the policy. 

i-S2R retains about 70% of its original performance when evaluated on a new player. On the other hand, the S2R+FT baseline achieves only about 30% of its original performance. 

Summary Video