HKUST AIGC Studio

Text2Video-Zero

FEI, Yang

Introduction

Text-to-video generation is quite popular nowadays due to advances in AI-powered text-to-image tasks. Text2Video-Zero is the first tool to enable zero-shot text-to-video generation, proposing an approach is low-cost yet produces quality video output. In this article, I will introduce the different parameters of this AI-powered tool to guide you through creating your own video. To make the tool more accessible, I will demonstrate the steps using Hugging Face Space at https://huggingface.co/spaces/PAIR/Text2Video-Zero.

Model

Text2Video-Zero modifies the text-to-image model architecture by replacing self-attention with cross-frame attention, having each frame attend to the first frame to generate videos. Therefore, those pre-trained model is vital when you start to produce your video by using Text2Vide-Zero. In this way, I will discuss three famous pre-trained models for this demo. First is runwayml/stable-diffusion-v1-5. After several experiments on Hugging Face, I found out this pre-trained stable diffusion model contains diverse characters like animals, human and robots. It' s quite a comprehensive model for generating realistic video elements. Moreover, since the project utilized stable diffusion model during research, this model is well-adapted and consistently produces above-baseline results. In a word, by using this model, it generates good quality videos, though not always the best. Here are some nice examples produced by this model:

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

Secondly, I would recommend dreamlike-art/dreamlike-photoreal-2.0 model, which is also well suited for this project. Like runwayml/stable-diffusion-v1-5, it is a large pre-trained model. Although not used as a benchmark in the Text2Video-Zero paper, the dreamlike-photoreal-2.0 model can generate high quality, visually appealing videos and therefore it’ s a popular model on Hugging Face. Here are three examples of videos produced by this model:

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

Thirdly, I recommend the timlenardo/tdmx-edge-of-realism-dreambooth model, which is another text-to-image pre-trained model that generates highly realistic and visually stunning videos with Text2Video-Zero. This model produces remarkably consistent videos through its capability to generate high-quality images. More importantly, the videos are coherent over time and pleasing to watch. Below are some gorgeous videos produced by this model:

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

Prompt

Prompt engineering has become very popular and important in recent times with the advancement of AI-powered chatbots. While I won't go into details here, it is worth briefly explaining prompt engineering and how it applies to Text2Video-Zero. Prompt engineering involves carefully crafting the text prompts provided to AI systems in order to get better, more accurate, and more diverse results. The prompts used in Text2Video-Zero are similar to other text-to-image models. Some examples are:

"A horse galloping on a street, high quality detailed realistic video"
"A panda dancing in Antarctica, trending on ArtStation"

I will provide two examples that demonstrate how modifying the text prompt results in different generated videos from Text2Video-Zero. The adjectives used in a prompt can significantly impact the generated video from Text2Video-Zero. As an example, I generated two videos using the timlenardo/tdmx-edge-of-realism-dreambooth model, which is a high quality text-to-image model. The left video used the prompt "a cute cat walking on the grass" while the right video used "a cat walking on grass". All other options were kept the same.

"a cute cat walking on grass"

"a cat walking on grass"

The difference is clear - the left video shows cat’ s positive face while The right video depicts another angle. This demonstrates how small changes in the prompt wording can guide the model to produce noticeably different results. Carefully choosing descriptive words is an important part of prompt engineering to get better quality and more accurate videos from Text2Video-Zero.

What’ s more, using a complete sentence versus a description also impacts the coherence of Text2Video-Zero's output. For example, the prompt "a dog is reading a book" generates a video where the dog's pose and surroundings change inconsistently. However, the prompt "a dog reading a book" produces a more consistent video showing the dog in the same position reading.

"a dog is reading a book"

"a dog reading a book"

The left video varies randomly frame to frame. But the right video depicts the dog continuously reading, by simply describing the subject and action. This shows that prompt phrasing impacts visual coherence, and complete sentences can sometimes confuse the model.

Advanced Options

Text2Video-Zero provides some advanced options to further control video generation. I will discuss three key ones: Video Length - This specifies the number of frames in the generated 3-4 second video. More frames result in smoother, higher quality videos.

Video_length = 8

Video_length = 12

Video_length = 16

As you can see, using more frames consistently produces better visual results. Global Translation ( $δ x$ , $δ y$ ) - These values control the global scene motion and camera movement described in the text prompt. Adjusting $δ x$ and $δ y$ guides the model to add appropriate left-right and up-down camera motions fitting the caption narrative. Positive deltas lead to rightward and downward shifts, while negative values induce leftward and upward movements. Increasing the magnitude results in a greater intensity of motion.

$δ x = 12,$ $δ y = 12$

$δ x = 12,$ $δ y = 16$

$δ x = 12,$ $δ y = 20$

The first three videos demonstrate the effect of changing the $δ y$ parameter. The $δ y$ values used are 12, 16, and 20.

$δ x = 16,$ $δ y = 12$

$δ x = 20,$ $δ y = 12$

$δ x = 20,$ $δ y = 20$

The next three videos showcase the impact of adjusting the $δ x$ parameter. These videos use $δ x$ values of 12, 16, and 20.

Carefully tuning these advanced parameters allows generating higher resolution videos with logical camera movements aligned to the descriptive text prompt. Text2Video-Zero enables granular control over video characteristics like length, motion and scene composition through both prompt wording and configurable model settings.

Conclusion

Collectively, Text2Video-Zero offers you an impressive AI-powered tool for generating realistic videos from text prompts. Hopefully this tutorial provides guidance for effectively using Text2Video-Zero.

Reference

Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., & Shi, H. (2023). Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Text2Video-Zero

FEI, Yang

Introduction

Model

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

"a dog reading a book"

"an iron man flying in the sky"

"a cute cat walking on grass"

Prompt

"a cute cat walking on grass"

"a cat walking on grass"

"a dog is reading a book"

"a dog reading a book"

Advanced Options

Video_length = 8

Video_length = 12

Video_length = 16

δ x = 12, δ y = 12

δ x = 12, δ y = 16

δ x = 12, δ y = 20

δ x = 16, δ y = 12

δ x = 20, δ y = 12

δ x = 20, δ y = 20

Conclusion

Reference

$δ x = 12,$ $δ y = 12$

$δ x = 12,$ $δ y = 16$

$δ x = 12,$ $δ y = 20$

$δ x = 16,$ $δ y = 12$

$δ x = 20,$ $δ y = 12$

$δ x = 20,$ $δ y = 20$