Text-to-video model explained

A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.^[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.^[2]

Models

There are different models, including open source models. Chinese-language input^[3] CogVideo is the earliest text-to-video model "of 9.4 billion parameters" to be developed, with its demo version of open source codes first presented on GitHub in 2022. That year, Meta Platforms released a partial text-to-video model called "Make-A-Video",^[4] ^[5] ^[6] and Google's Brain (later Google DeepMind) introduced Imagen Video, a text-to-video model with 3D U-Net.^[7] ^[8] ^[9] ^[10] ^[11]

In March 2023, a research paper titled "VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation" was published, presenting a novel approach to video generation.^[12] The VideoFusion model decomposes the diffusion process into two components: base noise and residual noise, which are shared across frames to ensure temporal coherence. By utilizing a pre-trained image diffusion model as a base generator, the model efficiently generated high-quality and coherent videos. Fine-tuning the pre-trained model on video data addressed the domain gap between image and video data, enhancing the model's ability to produce realistic and consistent video sequences.^[13] In the same month, Adobe introduced Firefly AI as part of its features.^[14]

In January 2024, Google announced development of a text-to-video model named Lumiere which is anticipated to integrate advanced video editing capabilities.^[15] Matthias Niessner and Lourdes Agapito at AI company Synthesia work on developing 3D neural rendering techniques that can synthesise realistic video by using 2D and 3D neural representations of shape, appearances, and motion for controllable video synthesis of avatars.^[16] In June 2024, Luma Labs launched its Dream Machine video tool.^[17] ^[18] That same month,^[19] Kuaishou extended its Kling AI text-to-video model to international users. In July 2024, TikTok owner ByteDance released Jimeng AI in China, through its subsidiary, Faceu Technology.^[20] By September 2024, the Chinese AI company MiniMax debuted its video-01 model, joining other established AI model companies like Zhipu AI, Baichuan, and Moonshot AI, which contribute to China’s involvement in AI technology.^[21]

Alternative approaches to text-to-video models include Google's Phenaki, Hour One, Colossyan, Runway's Gen-3 Alpha,^[22] ^[23] and OpenAI's unreleased (as at August 2024) Sora,^[24] available only to alpha testers.^[25] Several additional text-to-video models, such as Plug-and-Play, Text2LIVE, and TuneAVideo, have emerged.^[26] Google is also preparing to launch a video generation tool named Veo for YouTube Shorts in 2025.^[27] FLUX.1 developer Black Forest Labs has announced its text-to-video model SOTA.^[28]

Architecture and Training

There are several architectures that have been used to create Text-to-Video models. Similar to Text-to-Image models, these models can be trained using Recurrent Neural Networks (RNNs) such as long short-term memory (LSTM) networks, which has been used for Pixel Transformation Models and Stochastic Video Generation Models, which aid in consistency and realism respectively.^[29] An alternative for these include transformer models. Generative adversarial networks (GANs), Variational autoencoders (VAEs), — which can aid in the prediction of human motion^[30] — and diffusion models have also been used to develop the image generation aspects of the model.^[31]

Text-video datasets used to train models include, but are not limited to, WebVid-10M, HDVILA-100M, CCV, ActivityNet, and Panda-70M.^[32] ^[33] These datasets contain millions of original videos of interest, generated videos, captioned-videos, and textual information that help train models for accuracy. Text-video datasets used to train models include, but are not limited to PromptSource, DiffusionDB, and VidProM. These datasets provide the range of text inputs needed to teach models how to interpret a variety of textual prompts.

The video generation process involves synchronizing the text inputs with video frames, ensuring alignment and consistency throughout the sequence. This predictive process is subject to decline in quality as the length of the video increases due to resource limitations.

Limitations

Despite the rapid evolution of Text-to-Video models in their performance, a primary limitation is that they are very computationally heavy which limits its capacity to provide high quality and lengthy outputs.^[34] ^[35] Additionally, these models require a large amount of specific training data to be able to generate high quality and coherent outputs, which brings about the issue of accessibility.

Moreover, models may misinterpret textual prompts, resulting in video outputs that deviate from the intended meaning. This can occur due to limitations in capturing semantic context embedded in text, which affects the model’s ability to align generated video with the user’s intended message.^[33] Various models, including Make-A-Video, Imagen Video, Phenaki, CogVideo, GODIVA, and NUWA, are currently being tested and refined to enhance their alignment capabilities and overall performance in text-to-video generation.

Ethics

The deployment of Text-to-Video models raises ethical considerations related to content generation. These models have the potential to create inappropriate or unauthorized content, including explicit material, graphic violence, misinformation, and likenesses of real individuals without consent.^[36] Ensuring that AI-generated content complies with established standards for safe and ethical usage is essential, as content generated by these models may not always be easily identified as harmful or misleading. The ability of AI to recognize and filter out NSFW or copyrighted content remains an ongoing challenge, with implications for both creators and audiences.

Impacts and Applications

Text-to-Video models offer a broad range of applications that may benefit various fields, from educational and promotional to creative industries. These models can streamline content creation for training videos, movie previews, gaming assets, and visualizations, making it easier to generate high-quality, dynamic content.^[37] These features provide users with economical and personal benefits.

Comparison of existing models

**Key features**!class="unsortable"
Capabilities	Pricing	Video length	Supported languages
Synthesia	Synthesia	2019	Released	AI avatars, multilingual support for 60+ languages, customization options^[38]	Specialized in realistic AI avatars for corporate training and marketing	Subscription-based, starting around $30/month	Varies based on subscription	60+
InVideo AI	InVideo	2021	Released	AI-powered video creation, large stock library, AI talking avatars	Tailored for social media content with platform-specific templates	Free plan available, Paid plans starting at $16/month	Varies depending on content type	Multiple (not specified)
Fliki	Fliki AI	2022	Released	Text-to-video with AI avatars and voices, extensive language and voice support	Supports 65+ AI avatars and 2,000+ voices in 70 languages	Free plan available, Paid plans starting at $30/month	Varies based on subscription	70+
Runway Gen-2	Runway AI	2023	Released	Multimodal video generation from text, images, or videos^[39]	High-quality visuals, various modes like stylization and storyboard	Free trial, Paid plans (details not specified)	Up to 16 seconds	Multiple (not specified)
Pika Labs	Pika Labs	2024	Beta	Dynamic video generation, camera and motion customization^[40]	User-friendly, focused on natural dynamic generation	Currently free during beta	Flexible, supports longer videos with frame continuation	Multiple (not specified)
Runway Gen-3 Alpha	Runway AI	2024	Alpha	Enhanced visual fidelity, photorealistic humans, fine-grained temporal control^[41]	Ultra-realistic video generation with precise key-framing and industry-level customization	Free trial available, custom pricing for enterprises	Up to 10 seconds per clip, extendable	Multiple (not specified)
OpenAI Sora	OpenAI	2024 (expected)	Alpha	Deep language understanding, high-quality cinematic visuals, multi-shot videos^[42]	Capable of creating detailed, dynamic, and emotionally expressive videos; still under development with safety measures	Pricing not yet disclosed	Expected to generate longer videos; duration specifics TBD	Multiple (not specified)

Notes and References

Artificial Intelligence Index Report 2023. Stanford Institute for Human-Centered Artificial Intelligence. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022..
Melnik . Andrew . Video Diffusion Models: A Survey . 2024-05-06 . 2405.03150 . Ljubljanac . Michal . Lu . Cong . Yan . Qi . Ren . Weiming . Ritter . Helge. cs.CV .
Web site: Wodecki . Ben . 2023-08-11 . Text-to-Video Generative AI Models: The Definitive List . 2024-11-18 . AI Business . Informa.
Web site: Davies . Teli . 2022-09-29 . Make-A-Video: Meta AI's New Model For Text-To-Video Generation . 2022-10-12 . Weights & Biases . en.
Web site: Monge . Jim Clyde . 2022-08-03 . This AI Can Create Video From Text Prompt . 2022-10-12 . Medium . en.
Web site: Meta's Make-A-Video AI creates videos from text . 2022-10-12 . www.fonearena.com.
News: google: Google takes on Meta, introduces own video-generating AI . 2022-10-12 . The Economic Times. 6 October 2022 .
Web site: Monge . Jim Clyde . 2022-08-03 . This AI Can Create Video From Text Prompt . 2022-10-12 . Medium . en.
Web site: Nuh-uh, Meta, we can do text-to-video AI, too, says Google . 2022-10-12 . The Register.
Web site: Papers with Code - See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction . 2022-10-12 . paperswithcode.com . en.
Web site: Papers with Code - Text-driven Video Prediction . 2022-10-12 . paperswithcode.com . en.
2303.08320 . cs.CV . Zhengxiong . Luo . Dayou . Chen . VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation . 2023 . Zhang . Yingya . Huang . Yan . Wang . Liang . Shen . Yujun . Zhao . Deli . Zhou . Jingren . Tan . Tieniu.
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation . 2303.08320 . Luo . Zhengxiong . Chen . Dayou . Zhang . Yingya . Huang . Yan . Wang . Liang . Shen . Yujun . Zhao . Deli . Zhou . Jingren . Tan . Tieniu . 2023 . cs.CV .
Web site: 2024-10-10 . Adobe launches Firefly Video model and enhances image, vector and design models. Adobe Newsroom . 2024-11-18 . Adobe Inc..
Web site: Yirka . Bob . 2024-01-26 . Google announces the development of Lumiere, an AI-based next-generation text-to-video generator. . 2024-11-18 . Tech Xplore.
Web site: Text to Speech for Videos . 2023-10-17 . Synthesia.io.
Web site: Nuñez . Michael . 2024-06-12 . Luma AI debuts 'Dream Machine' for realistic video generation, heating up AI media race . 2024-11-18 . VentureBeat . en-US.
Web site: Fink . Charlie . Apple Debuts Intelligence, Mistral Raises $600 Million, New AI Text-To-Video . 2024-11-18 . Forbes . en.
Web site: Franzen . Carl . 2024-06-12 . What you need to know about Kling, the AI video generator rival to Sora that's wowing creators . 2024-11-18 . VentureBeat . en-US.
Web site: 2024-08-06 . ByteDance joins OpenAI's Sora rivals with AI video app launch . 2024-11-18 . Reuters.
Web site: 2024-09-02 . Chinese ai "tiger" minimax launches text-to-video-generating model to rival OpenAI's sora . 2024-11-18 . Yahoo! Finance.
Web site: Kemper . Jonathan . 2024-07-01 . Runway's Sora competitor Gen-3 Alpha now available . 2024-11-18 . THE DECODER . en-US.
News: 2023-03-20 . Generative AI's Next Frontier Is Video . 2024-11-18 . Bloomberg.com . en.
Web site: 2024-02-15 . OpenAI teases 'Sora,' its new text-to-video AI model . 2024-11-18 . NBC News . en.
Web site: Kelly . Chris . 2024-06-25 . Toys R Us creates first brand film to use OpenAI's text-to-video tool . 2024-11-18 . Marketing Dive . . en-US.
Book: Jin . Jiayao . Wu . Jianhang . Xu . Zhoucheng . Zhang . Hang . Wang . Yaxin . Yang . Jielong . Text to Video: Enhancing Video Generation Using Diffusion Models and Reconstruction Network . 2023-08-04 . 2023 2nd International Conference on Computing, Communication, Perception and Quantum Technology (CCPQT) . https://ieeexplore.ieee.org/document/10336607 . IEEE . 108–114 . 10.1109/CCPQT60491.2023.00024 . 979-8-3503-4269-7.
Web site: Forlini . Emily Dreibelbis . 2024-09-18 . Google's veo text-to-video AI generator is coming to YouTube shorts . 2024-11-18 . PC Magazine.
Web site: 2024-08-01 . Announcing Black Forest Labs . 2024-11-18 . Black Forest Labs . en-US.
Book: Bhagwatkar . Rishika . Bachu . Saketh . Fitter . Khurshed . Kulkarni . Akshay . Chiddarwar . Shital . A Review of Video Generation Approaches . 2020-12-17 . 2020 International Conference on Power, Instrumentation, Control and Computing (PICC) . https://ieeexplore.ieee.org/document/9362485 . IEEE . 1–5 . 10.1109/PICC51425.2020.9362485 . 978-1-7281-7590-4.
Book: Kim . Taehoon . Kang . ChanHee . Park . JaeHyuk . Jeong . Daun . Yang . ChangHee . Kang . Suk-Ju . Kong . Kyeongbo . Human Motion Aware Text-to-Video Generation with Explicit Camera Control . 2024-01-03 . 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) . https://ieeexplore.ieee.org/document/10484108 . IEEE . 5069–5078 . 10.1109/WACV57701.2024.00500 . 979-8-3503-1892-0.
Book: Singh, Aditi . A Survey of AI Text-to-Image and AI Text-to-Video Generators . 2023-05-09 . 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) . https://ieeexplore.ieee.org/document/10303174 . IEEE . 32–36 . 10.1109/AIRC57904.2023.10303174 . 979-8-3503-4824-8. 2311.06329 .
Miao . Yibo . T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models . 2024-09-08 . 2407.05965 . Zhu . Yifan . Dong . Yinpeng . Yu . Lijia . Zhu . Jun . Gao . Xiao-Shan. cs.CV .
Book: Zhang . Ji . Mei . Kuizhi . Wang . Xiao . Zheng . Yu . Fan . Jianping . From Text to Video: Exploiting Mid-Level Semantics for Large-Scale Video Classification . August 2018 . 2018 24th International Conference on Pattern Recognition (ICPR) . https://ieeexplore.ieee.org/document/8545513 . IEEE . 1695–1700 . 10.1109/ICPR.2018.8545513 . 978-1-5386-3788-3.
Book: Bhagwatkar . Rishika . Bachu . Saketh . Fitter . Khurshed . Kulkarni . Akshay . Chiddarwar . Shital . A Review of Video Generation Approaches . 2020-12-17 . 2020 International Conference on Power, Instrumentation, Control and Computing (PICC) . https://ieeexplore.ieee.org/document/9362485 . IEEE . 1–5 . 10.1109/PICC51425.2020.9362485 . 978-1-7281-7590-4.
Book: Singh, Aditi . A Survey of AI Text-to-Image and AI Text-to-Video Generators . 2023-05-09 . 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) . https://ieeexplore.ieee.org/document/10303174 . IEEE . 32–36 . 10.1109/AIRC57904.2023.10303174 . 979-8-3503-4824-8. 2311.06329 .
Miao . Yibo . T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models . 2024-09-08 . 2407.05965 . Zhu . Yifan . Dong . Yinpeng . Yu . Lijia . Zhu . Jun . Gao . Xiao-Shan. cs.CV .
Book: Singh, Aditi . A Survey of AI Text-to-Image and AI Text-to-Video Generators . 2023-05-09 . 2023 4th International Conference on Artificial Intelligence, Robotics and Control (AIRC) . https://ieeexplore.ieee.org/document/10303174 . IEEE . 32–36 . 10.1109/AIRC57904.2023.10303174 . 979-8-3503-4824-8. 2311.06329 .
Web site: Top AI Video Generation Models of 2024 . 2024-08-30 . Deepgram . en.
Web site: Runway Research Gen-2: Generate novel videos with text, images or video clips . 2024-08-30 . runwayml.com . en.
Web site: Sharma . Shubham . 2023-12-26 . Pika Labs' text-to-video AI platform opens to all: Here's how to use it . 2024-08-30 . VentureBeat . en-US.
Web site: Runway Research Introducing Gen-3 Alpha: A New Frontier for Video Generation . 2024-08-30 . runwayml.com . en.
Web site: Sora OpenAI . 2024-08-30 . openai.com.