Facebook's AI Can Caption Photos for the Blind on Its Own

Through the power of "deep learning," Facebook is figuring out how to make the social network accessible to nearly everyone.
Facebook039s accessibility team from left Matt King Jeff Wieland and Shaomei Wu.
Facebook

Matt King is blind, so he can't see the photo. And though it was posted to his Facebook feed with a rather lengthy caption, that's no help. Thanks to text-to-speech software, his laptop reads the caption aloud, but it's in German. And King doesn't understand German.

But then he runs an artificial intelligence tool under development at Facebook, and after analyzing the photo, the tool goes a long way towards describing it. The scene is outdoors, the AI says. It includes grass and trees and clouds. It's near some water. King can't completely imagine the photo—a shot of a friend with a bicycle during a ride through European countryside—but he has a decent idea of what it looks like.

"My dream is that it would also tell me that it includes Christoph with his bike," King says. "But from my perspective as a blind user, going from essentially zero percent satisfaction from a photo to somewhere in the neighborhood of half ... is a huge jump."

The 49-year-old King is part of the Facebook Accessibility Team. This means he works to hone the world's most popular social network so that it can properly serve people with disabilities, including people who are deaf, people without full use of their hands, and, yes, people who are blind, like King himself. Though that AI tool is merely a prototype, Facebook plans to eventually share it with the world at large. And that's no small thing. About 50,000 people actively use the social network through Apple Voiceover, a popular text-to-speech system, and the overall population of blind Facebookers is undoubtedly much larger.

Like other social networks, Facebook is an extremely visual medium. But with help from a tool like Apple Voiceover, someone like King—who lost the last of his sight in college—can connect with friends and colleagues over Facebook much like anyone else can. As Jessie Lorenz, the executive director of the nonprofit Independent Living Resource Center, told WIRED earlier this year: “I can ask other parents about a playdate or a repair man or a babysitter, just like anyone else would. Blindness becomes irrelevant in situations like that.”

King tunes his text-to-speech tool to read Facebook posts at a rapid-fire pace—so fast that no one else in the room can understand it. That means he can browse his News Feed as quickly as the typical Facebooker. And in some cases, even without Facebook's experimental AI system, he can start to understand what's in a photo. Some photos include decent captions, and others offer meta-data describing who took them and when. But the AI system, bootstrapped with help from an accessibility researcher named Shaomei Wu and various Facebook AI engineers, pushes things significantly further. It can provide context using nothing but the photo itself.

"The team started with trying to make sure that all the products that [Facebook] builds are usable by people with disabilities," says Jeff Wieland, the founder and head of Facebook's accessibility team. "Long-term, we really want to get to the point where we're building innovative technologies for people with disabilities."

'That's Really Where We Want to Go'

Facebook's photo-reading system is based on what's called deep learning, a technique the company has long used to identify faces and objects in photos posted to its social network. Using vast neural networks—interconnected machines that approximate the web of neurons in the human brain—the company can teach its services to identify photos by analyzing enormous numbers of similar images. To identify your face, for instance, it feeds all known pictures of you into the neural network, and over time, the system develops a pretty good idea of what you look like. This is how Facebook seems to recognize you and your friends when you upload a photo and start adding tags.

Google uses similar neural networks to help you locate photos inside its new Google Photos app, and the same basic technology can drive all sorts of other online tasks, from speech recognition to language translation. It's only natural that Facebook would use this technology to describe photos for the blind—though the technology is far from perfect.

"For object recognition and face recognition, we've basically reached human performance," says Yoshua Bengio, a professor at the University of Montreal and one of the founding fathers of deep learning. "But there are still problems involving complex images, lighting, understanding the whole scene, and so on."

At the moment, Facebook's system merely provides a basic description of each photo. It can identify certain objects. It can tell you whether the photo was taken indoors or outdoors. It can say whether the people in the photo are smiling. But as King explains, this kind of thing can be quite useful. It's particularly useful when friends and family upload new profile pics, which typically arrive without a caption.

That said, there's ample room to improve the system. Deep learning neural nets are also pretty good at grasping natural language—the way humans naturally speak—and companies such as Google and Microsoft have published research papers showing how these neural nets can be used to automatically generate more complete photo captions—captions that describe the scene in full. This would be the next logical step for Facebook. "We're returning a list. We're not returning a story," Wieland says. "But that's really where we want to go."

Josh Valcarcel/WIRED
The Entire Internet

The work is part of a broader effort to bring Facebook to people with disabilities. The Accessibility Team, which Wieland founded after working at the User Experience Lab that tracks how Facebook is used across the 'net, also facilitates closed captioning for the deaf. It promotes the use of mouth-controlled joysticks and other tools for those who can't use their hands. And it works to ensure that the social network can be used in the developing world, where Internet connections are slower and less reliable than those in the States.

At the same time, Wieland's team is hoping to push other companies in similar directions. In recent months, it helped found the Teaching Accessibility Initiative, a consortium of tech companies—including Yahoo and Microsoft—that aims to share practices in this area. And it's working to modify React, Facebook's open source app development tool, for use with text-to-speech readers and others software that aids people with disabilities. Because it's open source, anyone can use React, and according to data from GitHub, it has become an extremely popular means of building new apps. "It's one way we can make the entire Internet accessible," Wieland says.

The possibilities within and beyond the company are enormous. As King notes, deep learning can be applied to speech recognition as well as image recognition, to moving images as well as stills. "AI is applicable to all those situations," he says. "And it's applicable to everyone."