Seeing video is not believing —

Meta announces Make-A-Video, which generates video from text [Updated]

Using a text description or an existing image, Make-A-Video can render video on demand.

Still image from an AI-generated video of a teddy bear painting a portrait.
Enlarge / Still image from an AI-generated video of a teddy bear painting a portrait.

Today, Meta announced Make-A-Video, an AI-powered video generator that can create novel video content from text or image prompts, similar to existing image synthesis tools like DALL-E and Stable Diffusion. It can also make variations of existing videos, though it's not yet available for public use.

On Make-A-Video's announcement page, Meta shows example videos generated from text, including "a young couple walking in heavy rain" and "a teddy bear painting a portrait." It also showcases Make-A-Video's ability to take a static source image and animate it. For example, a still photo of a sea turtle, once processed through the AI model, can appear to be swimming.

The key technology behind Make-A-Video—and why it has arrived sooner than some experts anticipated—is that it builds off existing work with text-to-image synthesis used with image generators like OpenAI's DALL-E. In July, Meta announced its own text-to-image AI model called Make-A-Scene.

Instead of training the Make-A-Video model on labeled video data (for example, captioned descriptions of the actions depicted), Meta instead took image synthesis data (still images trained with captions) and applied unlabeled video training data so the model learns a sense of where a text or image prompt might exist in time and space. Then it can predict what comes after the image and display the scene in motion for a short period.

"Using function-preserving transformations, we extend the spatial layers at the model initialization stage to include temporal information," Meta wrote in a white paper. "The extended spatial-temporal network includes new attention modules that learn temporal world dynamics from a collection of videos."

Meta has not made an announcement about how or when Make-A-Video might become available to the public or who would have access to it. Meta provides a sign-up form people can fill out if they are interested in trying it in the future.

Meta acknowledges that the ability to create photorealistic videos on demand presents certain social hazards. At the bottom of the announcement page, Meta says that all AI-generated video content from Make-A-Video contains a watermark to "help ensure viewers know the video was generated with AI and is not a captured video."

If history is any guide, competitive open source text-to-video models may follow (some, like CogVideo, already exist), which could make Meta's watermark safeguard irrelevant.

Update: Yesterday was a busy day in AI news. Aside from Make-A-Video, another text-to-video model called Phenaki emerged, and it can apparently create several-minute-long videos from detailed text prompts at low resolutions. Its authors are anonymous for now due to a blind submission process to ICLR, but you can read its whitepaper online. Also, an overview of a new text-to-3D model called DreamFusion debuted, the product of several researchers from Google.

Meanwhile, AI researcher Simon Willison examined the data set used to train Meta's Make-A-Video model and discovered that it used over 10 million videos scraped from Shutterstock without permission, and Andy Baio noticed that 3.3 million additional videos came from YouTube. Willison also created a site that allows you to search through the video data set, and Andy Baio wrote up some ethical commentary on the practice of using commercial media in "non-commercial" academic research that then gets baked into commercial AI products.

Channel Ars Technica