This AI Could Go From ‘Art’ to Steering a Self-Driving Car

DALL-E drew laughs for creating images of a daikon radish in a tutu. But it builds on an important advance in computer vision with serious applications.
a grid of knights made out of pasta
DALL-E generated these images for the prompt “a knight made of spaghetti.”Courtesy of OpenAI

You’ve probably never wondered what a knight made of spaghetti would look like, but here’s the answer anyway—courtesy of a clever new artificial intelligence program from OpenAI, a company in San Francisco.

The program, DALL-E, released earlier this month, can concoct images of all sorts of weird things that don't exist, like avocado armchairs, robot giraffes, or radishes wearing tutus. OpenAI generated several images, including the spaghetti knight, at WIRED’s request.

DALL-E is a version of GPT-3, an AI model trained on text scraped from the web that’s capable of producing surprisingly coherent text. DALL-E was fed images and accompanying descriptions; in response, it can generate a decent mashup image.

Images created by DALL-E in response to “an illustration of a baby daikon radish in a tutu walking a dog.”

Courtesy of OpenAI

Pranksters were quick to see the funny side of DALL-E, noting for instance that it can imagine new kinds of British food. But DALL-E is built on an important advance in AI-powered computer vision, one that could have serious, and practical, applications.

Called CLIP, it consists of a vast artificial neural network—an algorithm inspired by the way the brain learns—fed hundreds of millions of images and accompanying text captions from the web and trained to predict the correct labels for an image.

Researchers at OpenAI found that CLIP could recognize objects as accurately as algorithms trained in the usual way—using curated data sets where images are neatly matched to labels.

As a result, CLIP can recognize more things, and it can grasp what certain things look like without needing copious examples. CLIP helped DALL-E produce its artwork, automatically selecting the best images from the ones it generated. OpenAI has released a paper describing how CLIP works as well as a small version of the resulting program. It has yet to release a paper or any code for DALL-E.

article image
Supersmart algorithms won't take all the jobs, But they are learning faster than ever, doing everything from medical diagnostics to serving up ads.

Both DALL-E and CLIP are “super impressive,” says Karthik Narasimhan, an assistant professor at Princeton specializing in computer vision. He says CLIP builds upon previous work that has sought to train large AI models using images and text simultaneously, but does so at an unprecedented scale. “CLIP is a large-scale demonstration of being able to use more natural forms of supervision—the way that we talk about things,” he says.

He says CLIP could be commercially useful in many ways, from improving the image recognition used in web search and video analytics, to making robots or autonomous vehicles smarter. CLIP could be used as the starting point for an algorithm that lets robots learn from images and text, such as instruction manuals, he says. Or it could help a self-driving car recognize pedestrians or trees in an unfamiliar setting.

Vladimir Haltakov, an engineer working on autonomous driving at BMW, has been playing with the smaller version of CLIP for some time. The company has collected images from millions of kilometers of autonomous driving, he says, but it’s sometimes difficult to find a particular image that could help in training. He says the algorithm could help him search through the data using a text prompt. “Being able to describe what you are looking for can be very helpful during development,” he says.

Some AI programmers and hackers have begun experimenting with CLIP using the code released by OpenAI. Justin Pinkney, a deep-learning consultant and the creator of Toonify, an app that uses AI to convert photos of people into cartoon caricatures, calls the program “very impressive” and “extremely versatile.” He says CLIP could prove useful to build a data set of images for a specific task, and he says he wants to see if it can help guide AI systems that generate imagery. “It's pretty astonishing that it seems to have learned things like what celebrities look like, what characterizes different styles of painting and artists,” he says.

DALL-E’s answer to “a photo of food of the United Kingdom.”

Courtesy of OpenAI

Travis Hoppe, a scientist interested in the intersection of AI and art, used CLIP to build a tool that finds images to accompany a piece of poetry using the image site Unsplash. He says he wishes OpenAI would also release code for DALL-E, but he adds, “I have a feeling they won't.”

Ilya Sutskever, chief scientist at OpenAI, says there may be commercial applications, but the company is currently focused on research. OpenAI has not decided whether it will release the full version of either program.

Andrei Barbu, a research scientist at MIT’s Center for Brains, Minds, and Machines who studies computer vision and AI, thinks CLIP may prove useful in commercial settings. He says it would be especially useful for cases where it is impractical to create lots of labeled images for training.

Barbu is also frustrated that OpenAI has not yet released the full version of CLIP, or any of the code for DALL-E—continuing a trend among some of the more prominent commercial AI labs. “It's a little awkward from the point of view of researchers,” Barbu says. “A lot of these amazing things come out, but none of us can actually do anything with them, none of us can build anything on top of them, nor can we even reproduce them.”


More Great WIRED Stories