When it comes to text to video tools, most think of Sora, Dream Machine, Runway or Kling. Meta’s Movie Gen aims to take it to the next level. This tool lets you use text prompts to generate highly realistic videos. Best of all, they will include sounds. You can also edit existing videos and transform your personal images into unique videos. Here are the capabilities of this tool:
- video generation
- personalized video generation
- precise video editing
- audio generation
Here is what you can do: this 30B model can generate up to 16 seconds of video at 16 frames per second. The model is optimized for text to image and text to video tasks. It can understand object, camera motion and subject-object interactions.
You can also use your image and combine it with a text prompt to generate video of yourself doing things you didn’t actually do in real-life. You can use this tool for precise image editing. You get to add, remove, and replace elements. What’s neat is Movie Gen preserves your original target and only targets the pixel that are relevant.
You also get a 13B parameter audio generation model that takes a video and text prompts to generate high fidelity audio up to 45 seconds. Its audio extension technique can generate audio for videos of arbitrary lengths. For example, this tool can generate ATV engine and rustling leaves sounds. As the company explains:
there are lots of optimizations we can do to further decrease inference time and improve the quality of the models by scaling up further.
Here is a summary of these:
- Movie Gen is a set of advanced models from Meta that can create high-quality 1080p videos with synchronized audio and different aspect ratios.
- It has features such as text-to-video generation, personalized videos, precise video editing, and even video-to-audio and text-to-audio generation.
- The biggest model has 30 billion parameters and can generate videos up to 16 seconds long at 16 frames per second.
- The Movie Gen Video model can produce HD videos from text prompts and also lets you edit or personalize those videos based on a photo.
- The Movie Gen Audio model, with 13 billion parameters, generates rich sound effects and music that sync with video. You can even use it to generate ambient sounds.
- With video personalization, you can create videos based on your image combined with a text prompt.
- These models beat out like Runway Gen3, LumaLabs, and OpenAI Sora (we have to test to see).
- These models were trained on a dataset of 100 million video-text pairs and 1 billion image-text pairs, using Transformer-based architectures and smart compression techniques.
- With Spatial Upsampler, you can bump video resolution to 1080p without losing quality.