HKUST AIGC Studio

Fatezero

Yifan, WANG

Introduction of Fatezero

Generating and editing visual content with text prompts has become an exciting new capability enabled by recent advances in AI. However, due to the enormous randomness in the generation process, applying such models to real-world visual content editing, especially for videos, remains challenging. In this studio, we present FateZero, a novel zero-shot text-based video editing tool that allows users to make targeted edits to real-world video content without any per-prompt training or use-specific masking. To make the tool more accessible, I will demonstrate the steps using Hugging Face Space at https://huggingface.co/spaces/chenyangqi/FateZero.

introduction video

Idea of building Fatezero

We introduce several techniques to enable consistent video editing based on pre-trained models. First, in contrast to straightforward DDIM inversion, our approach captures intermediate attention maps during inversion conditioned on the source prompt, effectively retaining both structural and motion information. These maps are directly fused in the editing process rather than regenerated during denoising. To further minimize semantic leakage from the source video, we fuse self-attentions with a blending mask derived from the cross-attention features of the source prompt. Furthermore, we reformulate the self-attention mechanism in the denoising UNet via spatial-temporal attention to ensure frame consistency.

Significance of Fatezero

Although brief, our tool represents the first demonstration of the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also achieve improved zero-shot shape-aware editing based on a one-shot text-to-video model. Extensive experiments demonstrate superior temporal consistency and editing capabilities compared to previous works.

Car ➜Posche Car

+Ukiyo-e Style

Usage of Fatezero

In the tool, we need to upload our unedited video and choose one model for use first. Second, we will enter what we want to change(e.g., the video into Ukiyo-e Style). Then, we will change FateZero parameters for attention fusing in the output interface. In terms of Cross-att replace steps, more steps mean replacing more cross-attention to preserve the semantic layout. Besides, in terms of self-att replace steps, more steps mean replacing more spatial-temporal self-attention to preserve geometry and motion. However, more steps also mean more time for computing and figuring.

Cat ➜ Black Cat, Grass

A man ➜ A Spider-Man

Conclusion

Collectively, Fatezero offers you an impressive AI-powered tool for real-world visual content editing. Hopefully, this tutorial provides an introduction and guidance for effectively using Fatezero.

Reference

Chenyang Qi (2023, March)FateZero: Fusing Attentions for Zero-shot Text-based Video Editing arXiv.org.https://arxiv.org/abs/2303.09535