Large datasets have been shown to facilitate robot intelligence. By collecting diverse datasets for tasks such as grasping and stacking, robots are able to learn from this data to grasp and stack challenging, novel objects they haven’t seen before.

Large-datasets facilitate robot intelligence by enabling robots to interact with challenging objects that they have not encountered before.

While these results are impressive, they are still limited in critical ways compared to human intelligence. Today, robot intelligence is narrow-minded - they usually only find one way to solve a problem. By contrast, humans are really good at reasoning about creative ways to solve a problem and physically manipulating objects to make it happen.

Robot intelligence is narrow-minded (left) while human intelligence allows for creative problem solving that is enabled by rich manipulation ability (right).

How can we help our robots cross this gap in problem solving ability? We assert that one way is to let our robots learn from data that captures human intelligence. In this blog post, we describe how we built a data collection platform that enables collecting datasets that captures human intelligence.

What kind of data captures human intelligence?

Diversity. The data should be diverse in the kinds of problem-solving strategies demonstrated. Consider the example below, where we would like to fit an item into a container. If the item is small, you could toss it in, and if it’s already near the container you could probably push it in. If it’s large, you would have to stuff it in. As humans, we have a good sense of when we should try these different approaches – robots should learn from all of these strategies – it might need any of them in a given situation.

Dexterity. The data should contain instances of dexterous manipulation so that the robot can learn fine-grained manipulation behaviors. We want our robots to understand how they can physically manipulate objects to achieve desired outcomes.

Large-Scale. Finally, there should be a large amount of data. This is important – we are very good at problem solving in countless situations, but robots aren’t able to do this yet. The more data we show them, the more likely that they’ll acquire this general problem-solving ability too.

Collecting data that captures human intelligence

There are several methods that have been used to collect robotic data in the past. Here, we evaluate the ability of each method to collect desirable data for generalization.

Prior data collection methodologies include autonomous data collection (left), human supervision with web interfaces (middle), and human teleoperation with motion interfaces (right).
  • Autonomous Data Collection: Many data collection mechanisms and algorithms such as Self-Supervised Learning12 and Deep Reinforcement Learning3 use random exploration to collect their data. While this allows the robot to autonomously collect data, the data is strongly correlated and lacks diverse problem-solving strategies. This is because data is collected purely at random at first, and over time, methods converge to specific solution strategies.

  • Human Supervision with Web Interfaces: By contrast, human supervision allows for direct specification of task solutions. Prior mechanisms4 have allowed humans to leverage graphical web interfaces to guide robots through tasks. While such data collection schemes allow for diverse data to be collected at scale through humans, the interfaces limit the dexterity of the robot motions that can be demonstrated. For example, in the middle video above, a user has specified a program for the robot to execute, and the robot takes care of picking up the cups using simple top-down grasps. The human does not have much of a say in how the task is done.

  • Human Teleoperation with Motion Interfaces: Others have developed motion interfaces to enable a direct one-to-one mapping between human motion and the end effector of the arm. One such example5 is a person using a Virtual Reality headset and controller to guide the arm through a pick-and-place task. By offering users full control over how the arm accomplishes the task, these interfaces allow for data that is both diverse and dexterous. However, they do not allow for large-scale data collection, since the special hardware needed to develop such interfaces is not widely available.

Comparison of data collection methodologies. RoboTurk is the only mechanism that is able to collect data that is diverse and dexterous at scale.

Our goal was to develop a data collection mechanism that captures human intelligence by collecting data that has diverse problem-solving strategies, dexterous object manipulation, and that could be collected at scale. To address this challenge, we developed RoboTurk.

RoboTurk

RoboTurk is a platform that allows remote users to teleoperate simulated and real robots in real-time with only a smartphone and a web browser. Our platform supports many simultaneous users, each controlling their own robot remotely. A new user can get started in less than 5 minutes - all they need to do is download our smartphone application and go to our website, and they are ready to start collecting data.

RoboTurk is a platform that allows remote users to teleoperate robots in real-time with only a smartphone and a web browser. The platform supports many simultaneous users, each controlling their own robot (left). New users can get started in less than 5 minutes by downloading our smartphone app and visiting our website (right).

Our platform enables people to control robots in real-time from anywhere - libraries, cafes, homes, and even the top of a mountain.

RoboTurk enables remote teleoperation and data collection from anywhere - even in the Alps!

User Interface to enable Dexterity

Users receive a video stream of the robot workspace in their web browser and use their phone to guide the robot through a task. The motion of the phone is coupled to the motion of the robot, allowing for natural and dexterous control of the arm.

Users receive a video stream of the robot workspace in their web browser and use their phone to guide the robot through a task. The motion of the phone is coupled to the motion of the robot, allowing for natural and dexterous control of the arm.

We conducted a user study and showed that our user interface compares favorably with virtual reality controllers, which use special external tracking for the controllers, and significantly outperforms other interfaces such as a keyboard and a 3D mouse. This demonstrates that our user interface is both natural for humans to efficiently complete tasks and scalable to ensure that anyone with a smartphone can participate in data collection.

User study to compare different interfaces for teleoperation. Our phone interface allows humans to complete tasks just as efficiently as Virtual Reality interfaces but without the need for special hardware.

Diversity through Worldwide Teleoperation

Enabling remote data collection with consumer-grade hardware allows many different people to easily provide data, naturally resulting in datasets that are diverse. To test the capability of RoboTurk to enable remote data collection, we tried controlling robot simulations hosted on servers in China from our lab in California, a distance of over 5900 miles! We found that is possible to collect quality demonstrations using RoboTurk regardless of the distance between user and server.

Comparing teleoperation efficiency from Stanford to Oregon versus from Stanford to China. Large distances do not impede the ability of operators to collect successful task demonstrations.

More recently, we tried teleoperating our physical robot arms located at Stanford from Macau. We found that our system provided real-time teleoperation of our robot arms even at a distance of over 11,000 km, all on a cellular network connection.

Real-time teleoperation of our Stanford robot arms from Macau, on a cellular network connection.

Large-Scale Data Collection

Our Pilot Dataset, which was collected in just 22 hours, has over 2000 task demonstrations.

RoboTurk enables collect large amounts of data in a matter of hours. In our first publication, we used RoboTurk to collect a Pilot Dataset consisting of over 2000 task demonstrations in just 22 hours of total system usage. We also leveraged the demonstrations for policy learning and showed that using more demonstrations enables higher quality policies to be learned.

The demonstrations we collected enable fast policy learning, with more data leading to higher quality policies (left). A policy trained using the data is able to efficiently complete the task (right).

In summary, RoboTurk is able to collect data that embodies human intelligence:

  • Diversity. RoboTurk can be used to collect diverse data by leveraging many simultaneous human users for data collection.

  • Dexterity. RoboTurk offers full 6-DoF control of the robot arm through a natural phone interface, allowing for dexterity in the data.

  • Large-Scale. RoboTurk allows for large-scale data collection by allowing people to collect data from anywhere using just a smartphone and web browser. Our pilot dataset was collected in just 22 hours of system operation.

RoboTurk enables diversity through many users (left), dexterity through fine-grained 6-DoF control (middle), and can be used to collect data at scale (right).

Collecting Data on Physical Robots

In our initial publication, we used RoboTurk to collect a large dataset using robot manipulation tasks developed using MuJoCo and robosuite. However, there are several interesting tasks that cannot be modeled in simulation, and we did not want to restrict ourselves to those that could. Thus, we extended RoboTurk to enable data collection with real robot arms, and used it to collect the largest robot manipulation dataset collected via teleoperation.

We collected data on three Sawyer robot arms - each of which had a front-facing webcam and a top-down Kinect depth camera mounted in the workspace of the robot arm.

The dataset consists of RGB images from a front-facing RGB camera (which is also the teleoperator video stream view) at 30Hz, RGB and Depth images from a top-down Kinectv2 sensor also at 30Hz, and robot sensor readings at 100Hz.

We collected our dataset using 54 different participants over the course of 1 week. Every user participated in a supervised hour of remote data collection, including a brief 5 minute tutorial at the beginning of the session. Afterwards, they were given the option to collect data without supervision for all subsequent collection. The users who participated in our data collection study collected the data from a variety of locations. All locations were remote - no data collection occurred in front of the actual robot arms.

Tasks

We designed three robotic manipulation tasks for data collection, shown above. These tasks were chosen with care in order to make sure that the data collected would be useful for robot generalization. Each task admits diverse solution strategies, which encouraged our diverse set of users to experiment with different solution strategies, requires dexterous manipulation to solve, and the robot needs to learn to generalize to several scenarios. We also note that the tasks would be incredibly difficult to simulate, making physical data collection necessary.

  • Object Search. The goal of this task is to search for a set of target objects within a cluttered bin and fit them into a specific box. There are three target object categories: plush animals, plastic water bottles, and paper napkins. A target category is randomly selected and relayed to the operator, who must use the robot arm to find all three objects corresponding to the target category and place each item into its corresponding hole. This task requires precise manipulation due to the bin containing many rigid and deformable objects in clutter, the need to search for hidden objects, and tight object placement.
In the Object Search task, the goal is to search for target objects (left) and fit them into a specific box (right).
  • Tower Creation. In this task, an assortment of cups and bowls are arranged on the table. The goal of the task is to create the tallest tower possible by stacking the cups and bowls on top of each other. This task requires physical reasoning: operators must use a geometric understanding of objects and dexterous placement to carefully craft their towers while maintaining tower stability.
In the Tower Creation task, the goal is to stack cups and bowls (left) to build the tallest tower possible (right).
  • Laundry Layout. This task starts with a hand towel, a pair of jeans, or a t-shirt placed on the table. The goal is to use the robot arm to straighten the item so that it lies flat on the table with no folds. On every task reset we randomly place the item into a new configuration. This task requires generalization over several different item configurations.
In the Laundry Layout task, the goal is to layout towels (left), jeans (middle), and t-shirts (right).

Data Collection

We collected over 111 hours of total robot manipulation data in just 1 week across 54 users on our 3 manipulation tasks, with over 2000 successful demonstrations in total. This makes our dataset 1-2 orders of magnitude larger than most other datasets in terms of interaction time. The number of task demonstrations in our dataset also compares favorably with the number of demonstrations in large datasets such as MIME6, but the tasks that we collected data on are more difficult to complete, as they take on the order of minutes to complete successfully, as opposed to seconds. Some other notable datasets collected by humans include DAML7, Deep Imitation5, and JIGSAWS8.

Our dataset is the largest robot manipulation dataset ever collected using teleoperation.

Here is an assortment of randomly sampled demonstrations from our dataset.

Platform Evaluation

Diverse Solution Strategies

On the Tower Creation task, our users surprised us by building intricate structures out of the simple sets of cups and bowls. We also saw a great deal of diversity in the towers that people chose to build.

Our users surprised us by building intricate structures out of the simple sets of cups and bowls.

Some notable emergent solution strategies that were observed include building an inverted cone and alternating cups and bowls for stability, as well as flipping over a bowl for the base of the tower and grouping 3 cups together to form a stable platform. In particular, we had no idea that it was even possible to control the robot to flip a bowl over - it truly speaks to the power of human creativity coupled with the dexterity that the interface enables.

Notable strategies included building an inverted cone (left) and alternating cups and bowls for stability (right).

Notable strategies included flipping over a bowl for the base of the tower (left) and grouping 3 cups together to form a stable platform (right).

The users themselves were diverse - their skill levels varied significantly. This can be seen from the large variation in average task completion time per user on the Object Search and Laundry Layout tasks in the plot below. User variation naturally emerges from collecting across 54 different people and ensures data diversity. Note that most users were determined to use all 5 of their allotted minutes for the Tower Creation task.

Average task completion times per user, sorted from fastest to slowest. Users exhibit large variation in skill level, ensuring data diversity.

Diverse and Dexterous Manipulation

Next, we present some qualitative examples of diverse and dexterous behaviors in the Object Search task.

In the examples below, the operators used three different strategies to manipulate the plastic water bottle into a favorable place in order to grasp it successfully:

  • move to grasp (left): the operator moves the bottle into a convenient position to grasp it
  • flip to grasp (middle): the operator flips the water bottle to orient it for a grasp
  • approach from angle (right): the operator angles the arm underneath the bottle and the cloth in order to grasp the bottle successfully
The operators carefully manipulated objects in order to grasp them successfully.

In the examples below, the operators decided to extract items from the clutter in order to successfully grasp them.

The operators extracted items from the clutter in order to successfully grasp them.

The examples below show three different strategies we observed for placing target objects into the correct container:

  • clever grasp (left): by using a strategic grasp, the operator is able to simply drop the bottle into the container
  • stuff (middle): the operator stuffs the napkin into the container
  • strategic object use (right): the operator uses one object to poke the other object into the container.
The operators used different strategies to fit items into the containers.

Scaling to New Users

All 54 of our users were new, non-expert users. We found that users with no experience started generating useful data in a matter of minutes.

  • On Object Search, new users were able to successfully pick and place a target object for the first time within 2 minutes of interaction time on average.
  • On Laundry Layout, new users were able to successfully layout their first towel in less than 4 minutes of interaction on average.

This corroborates the results of our user exit survey - a majority (60.8%) of users reported that they felt comfortable using the system within 15 minutes, while 96% felt comfortable within an hour.

Furthermore, we witnessed significant user improvement over time. As shown below, users learned to complete the task more efficiently over time as they collected more demonstrations. Furthermore, users moved the orientation of the phone more with increasing experience, suggesting that they learned to leverage full 6-DoF control to generate dexterous task solutions of increasing quality.

Users improved significantly over time. They completed tasks faster and controlled the phone orientation more, allowing them to take advantage of full 6-DoF control to generate better task solutions.

Leveraging the Dataset

We provide some examples applications for our dataset. However, we emphasize that our dataset can be useful for several other applications as well, such as multimodal density estimation, policy learning, and hierarchical task planning.

Reward Learning

Consider the problem of learning a policy to imitate a specific video demonstration. Prior work has approached this problem by learning an embedding space over visual observations and then crafting a reward function to imitate a reference trajectory based on distances in the embedding space. This reward function can then be used with reinforcement learning to learn a policy that imitates the trajectory. Taking inspiration from this approach, we trained a modified version of Time Contrastive Networks (TCN)9 on Laundry Layout demonstrations and investigate some interesting properties of the embedding space.

Learned embedding distances to a desired target frame provide a meaningful reward function for imitation learning as well as a useful metric for task progress.

In the figure above, we consider the frame embeddings along a single Laundry Layout demonstration. We plot the negative L2 distance of the frame embeddings with respect to the embedding of a target frame near the end of the video, where the target frame depicts a successful task completion with the towel lying flat on the table. The figure demonstrates that distances in this embedding space with a suitable target frame yield a reasonable reward function that could be used to imitate task demonstrations purely from visual observations.

Furthermore, embedding distances capture task semantics to a certain degree and could even be used to measure task progress. For example, in frames 3 and 5, the towel is nearly flat on the table, and the embedding distance to frame 6 is correspondingly small. By contrast, in frames 2 and 4, the robot is holding the towel a significant distance away from the table, and the distance to frame 6 is correspondingly large.

Here is a video that shows how the reward function varies along this demonstration.

The learned reward function decreases when the towel moves away from the table and increases when the towel returns to the table. The reward steadlily increases as the towel becomes more flat, and the task comes closer to completion.

Behavioral Cloning

To demonstrate that the data collected by our platform can be used for policy learning, we leveraged a subset of data to train a policy on some Laundry Layout task instances using behavioral cloning. The trained policy is shown below.

This policy trained with behavioral cloning is able to solve some Laundry Layout task instances.

Download our datasets!

Our simulation dataset is available on our website and our real robot dataset will be available shortly!

Summary

  • RoboTurk is a platform to collect datasets that embody human intelligence. The data contains diverse problem-solving strategies and dexterous object manipulation, and is large-scale.

  • We introduce three challenging manipulation tasks: Object Search, Tower Creation, and Laundry Layout. These tasks admit diverse solutions and strategies and require dexterous manipulation to solve. Significant generalization capability is also required for robots to solve these tasks due to the large variation in task instance.

  • We present the largest known human teleoperated robot manipulation dataset consisting of over 111 hours of data across 54 users. The dataset was collected in 1 week on 3 Sawyer robot arms using the RoboTurk platform.

  • We evalaute our platform and show that the data collected consists of diverse and dexterous task solutions, and that first-time users start generating useful data in minutes and improve significantly over time.

  • The dataset has several applications such as multimodal density estimation, video prediction, reward function learning, policy learning and hierarchical task planning, and more.

This blog post is based on the following papers:

  1. Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2016, October). Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics (pp. 173-184). Springer, Cham. 

  2. Dasari, S., Ebert, F., Tian, S., Nair, S., Bucher, B., Schmeckpeper, K., … & Finn, C. (2019). RoboNet: Large-Scale Multi-Robot Learning. arXiv preprint arXiv:1910.11215. 

  3. Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., & Levine, S. (2018, May). Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 6284-6291). IEEE. 

  4. Alexandrova, S., Tatlock, Z., & Cakmak, M. (2015, May). RoboFlow: A flow-based visual programming language for mobile manipulation tasks. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 5537-5544). IEEE. 

  5. Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., & Abbeel, P. (2018, May). Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1-8). IEEE.  2

  6. Sharma, P., Mohan, L., Pinto, L., & Gupta, A. (2018). Multiple interactions made easy (mime): Large scale demonstrations data for imitation. arXiv preprint arXiv:1810.07121. 

  7. Yu, T., Finn, C., Xie, A., Dasari, S., Zhang, T., Abbeel, P., & Levine, S. (2018). One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557. 

  8. Yixin Gao, S. Swaroop Vedula, Carol E. Reiley, Narges Ahmidi, Balakrishnan Varadarajan, Henry C. Lin, Lingling Tao, Luca Zappella, Benjam ́ın B ́ejar, David D. Yuh, Chi Chiung Grace Chen, Ren ́e Vidal, Sanjeev Khudanpur and Gregory D. Hager, The JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS): A Surgical Activity Dataset for Human Motion Modeling, In Modeling and Monitoring of Computer Assisted Interventions (M2CAI) – MICCAI Workshop, 2014. 

  9. Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., … & Brain, G. (2018, May). Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1134-1141). IEEE.