This Technique Can Make It Easier for AI to Understand Videos

A staggering amount of video is shared online. Researchers are teaching artificial intelligence to process more—while using less power.
abstract face with neural net applied leading to folders and grids
Illustration: Ariel Davis

Whether it’s dubious viral memes, gaffe-prone presidential debates, or surreal TikTok remixes, you could spend the rest of your life trying to watch all the video footage posted on YouTube in a single day. Researchers want to let artificial intelligence algorithms watch and make sense of it instead.

A group from MIT and IBM developed an algorithm capable of accurately recognizing actions in videos while consuming a small fraction of the processing power previously required, potentially changing the economics of applying AI to large amounts of video. The method adapts an AI approach used to process still images to give it a crude concept of passing time.

The work is a step towards having AI recognize what’s happening in video, perhaps helping to tame the vast amounts now being generated. On YouTube alone, over 500 hours of video were uploaded every minute during May 2019.

Companies would like to use AI to automatically generate detailed descriptions of videos, letting users discover clips that haven’t been annotated. And, of course, they would love to sell ads based on what’s happening in a video, perhaps showing pitches for tennis lessons as soon as someone starts live-streaming a match. Facebook and Google also hope to use AI to automatically spot and filter illegal or malicious content, although this may prove an ongoing game of cat and mouse. It will be a challenge to do all this without significantly increasing the carbon footprint of AI.

Tech companies like to flaunt their use of AI, but it’s still not used much to analyze video. YouTube, Facebook, and TikTok use machine learning algorithms to sort and recommend clips, but they appear to rely primarily on the metadata associated with a video, such as the description, tags, and when and where it was uploaded. All are working on methods that analyze the contents of videos, but these approaches require a lot more computer power.

“Video understanding is so important,” says Song Han, an assistant professor at MIT who led the new work. “But the amount of computation is prohibitive.”

The energy consumed by AI algorithms is rising at an alarming rate, too. By some estimates, the amount of computer power used in cutting-edge AI experiments doubles about every three and a half months. In July, researchers at the Allen Institute for Artificial Intelligence called for researchers to publish details of the energy efficiency of their algorithms, to help address this looming environmental problem.

This could be especially important as companies tap AI to analyze video. There have been big advances in image recognition in recent years, largely thanks to deep learning, a statistical technique for extracting meaning from complex data. Deep learning algorithms can detect objects based on the pixels shown in an image.

But deep learning is less adept at interpreting video. Analyzing a video frame won’t reveal what’s happening unless that frame is compared with the ones that come before and after—a person holding a door may be opening it or closing it, for example. And while Facebook researchers developed a version of deep learning that incorporates temporal changes in 2015, this approach is relatively unwieldy.

By Han’s estimates, it can take 50 times as much data, and eight times as much processing power, to train a deep learning algorithm to interpret a video as a still image.

Together with two colleagues, Han developed a solution, dubbed the Temporal Shift Module. Conventional deep learning algorithms for video recognition perform a 3D operation (known as a convolution) on multiple video frames at once. Han’s approach uses a more efficient 2D algorithm, more commonly used for still images. The Temporal Shift Module provides a way to capture the relationship between the pixels in one frame and those in the next without performing the full 3D operation. As the 2D algorithm processes each frame in turn, while incorporating information from adjacent frames, it achieves a sense of things unfolding over time, allowing it to detect the actions shown.

The trio tested their algorithm on a video dataset known as Something Something, which was created by paying thousands of people to perform simple tasks from pouring tea to opening jars.

When compared with other algorithms using a benchmark provided with the dataset, the algorithm claimed first place, in terms of accuracy, on the first versions of this dataset, and second place on the second version. In both cases, it also required just a fraction (between one-ninth and one-quarter) of the computer power of other approaches.

The researchers created a demo version of their algorithm that runs on a single GPU chip, allowing a computer to interpret actions through a webcam in real-time. They also tested a version of their algorithm on a cluster of 1,536 GPU chips, through Summit, a supercomputer operated by Oak Ridge National Lab. The new algorithm, combined with the machine’s massive hardware, reduced the time to learn to recognize activities in a video dataset called Kinetic by over 99.5 percent, from 49 hours and 55 minutes to just 14 minutes and 13 seconds.

Some tech companies have taken notice of Han’s work, which was first posted online late last year. He says China’s Baidu has already incorporated the code into its deep learning framework, known as Paddle Paddle. Juan Carlos Niebles, a professor at Stanford who specializes in computer vision, says the MIT work is part of a broader push to help AI gain better video understanding. “Efficient processing for video is of critical importance,” Niebles says. “[It will] be very useful for many applications, including entertainment, sports, robotics, remote visual sensing and so on—where processing time is important or the amount of video being generated is massive.” A spokesperson for Facebook said its researchers aren't familiar with the work. Google declined to comment.

Han says his algorithm could also make many devices smarter by letting them run advanced video analysis on modest hardware. The Pixel 4 smartphone may recognize simple swipes; future devices might understand complex hand gestures and body language. More ominously, surveillance cameras may not only recognize your face but be able to keep a record of what you are doing.

Xiaolong Wang, who specializes in using deep learning on video and who will become an assistant professor at UC San Diego next year, says the new work is impressive, but he warns that AI algorithms do not truly understand what’s going on in a video. “Recognizing the action ‘swimming’ can be rather easy if the model can recognize the human is in the water,” Wang points out.

Wang adds that this lack of understanding demonstrates how inept AI programs are at truly understanding the world the way humans do. “There is still a long way to go before deep learning models can reason about time,” he says.


More Great WIRED Stories