Text to video models are improving all the time. The latest ones can produce highly realistic videos. Step-Video-T2V is a new text-to-video pre-trained model with 30b parameters. It can generate videos with up to 204 frames in length. As the searchers explains:
A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos.
Here are just a few examples:
You can read the paper here.