HKUST AIGC Studio

Rerender a Video

Ho Shao Ping

Introduction

Text-to-video generation are capable of generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. Rerender A Video is a novel zero-shot text-guided video-to-video translation framework that is able to adapt image models to videos while also customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet to rendering high-quality and temporally-coherent videos which don't flicker. In this article, I will introduce the different parameters of this AI-powered tool to guide you through creating your own video. To give a better understanding, I will be demonstrating the steps using Hugging Face Space at https://huggingface.co/spaces/Anonymous-sub/Rerender.

Limitations

Only the huggingface demo is avalible at the moment which is very limited and only showcases the keyframe generation ability of Rerender A Video. The Run Propagation button is useless for now and is supposed to be used to generate the whole video when the full code is released. The maximum number of keyframes is also 8 and the maximum frame resolution is 512x768. The second limitation of note is videos of large or quick motions are unstable. Due to using optical flow, which is only an estimate of movement, the model may not be correct if the subject in the video changes position too much or too quickly. The third limitation that is quite common with any kind of video generation is that it takes quite a long time. The running time of a video of size 512x640 is about 1 minute per keyframe under T4 GPU. From my experience, it takes 2-3 minutes to Run All for 8 keyframes and under a minute for the first keyframe. I will be using these video below to showcase the limitations of rerendering.

The first video is of a person break dancing with the prompt "man dancing in CG style", the second is a person looking at the camera without much movement with the prompt "a woman in CG style".

The breakdancing video doesn't work even though the model is understands the subject is breakdancing. While the women is rendered correctly as she doesn't move much in the other video.

Prompt

Prompts will change the inputted video a lot and may take a lot of trail and error to get right. The model may understand some prompts better than others even if they mean the same thing. In the following examples the prompt I used is "a one eyed man in CG style", "a man with one eye in CG style " and "a man with an eyepatch in CG ".

The model seems to understand what "one eyed man" and "man with one eye" means as it shows them wearing an eyepatch in one of the keyframes. However, it doesn't give the eyepatch on the first keyframe which is the most important. So if your prompt doesn't give you want you want, consider wording your prompt differently.

Negative prompts are also very helpful for solving issues. The negative prompt can be accessed in the Advanced options for the 1st frame translation. I encountered a bug where the the subject would be generated into a room for some reason despite the prompt "a beautiful women in cg style" which makes no mention of a room. So i had to add "room" in the negative prompt so this won't happen which quickly fixed the problem.

The first time it happened I was confused and though it just random, and the second time I realized it might be a bug so the third time around, I fixed it by adding room to the negative prompt.

Advanced Options

Rerender a video provides some advanced options to finetune generation for the first keyframe. I will discuss a few key ones: ControlNet Strength - The ControlNet improves temporal consistancy but might limit the effect of the prompt. I suggest lowering the ControlNet strength to 0.2 if the first keyframe doesn't give you the result you want. In the following example, i used the prompt "a man with an eyepatch in CG style".

The first video has ControlNet strength 0.7 which only shows an eyepatch for one keyframe, same for the second video which has ControlNet strength 0.5. It is only for the last video which has ControlNet strength 0.2 that the eyepatch is present in all keyframes. Denoising strength - As image generation works by adding and removing noise, the more noise you add and remove, the more the input is changed. Denoising strength of 0 fully recovers the input.(Outputted keyframes are the same as input). Denoising strength of 1.05 fully rerenders the input.(Outputted leyframes are completely different from input). In the following example, the prompt "beautiful women in CG style" is used.

The first video is the original video, the second video has Denoise strength of 0.2, which is very similar to original video and the third has Denoise strength of 0.5 which is very different from the original video. Seed - A prompt is not all that affects the output of the text to video model. A seed in generative AI is a starting point or initial input that is used to generate an output. With the same prompt, using the same seed will yeild the same results and vice versa. After several generations on Hugging Face, I have discovered that the subjects you have focused on in your prompt will not change when different seeds are used. In the examples below, I have used the prompt "a man in CG style" with the seeds being different.

As you can see, the man's face doesn't change but the background does. In the first example, the background has become a room even though the original video, the background is blurry and doesn't seem like solid walls. In the second example, the background even seems to morph into a wreath that is glued to his head and rest appear to become like a curtain or cloth. And the third example goes back to becoming a room with solid walls. This shows that subjects the prompt doesn't describe will experience change that may be added onto the subject the prompt doesn describe.

Conclusion

Rerender a video is just a demo which can only generate keyframes but it shows great potential as the key frames are very consistent and will be a the powerful tool if the keyframe propagation works as intended.

Reference

Shuai, Y., Yifan, Z., Ziwei, L., & Chen Change, L. (2023, June 14). Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. arXiv.org. https://arxiv.org/abs/2306.07954