HKUST AIGC Studio

Drag GAN - Interactive Point-based Manipulation on the Generative Image Manifold

Muzi ZHANG

Introduction

Drag GAN is a new kind of GAN that allows user to modify multiple features such as shape, body pose, emotion of the images in a generative approach, users can achieve this by adding one more multiple pairs of points and DragGAN will perform a series of iterations that "drag" the handle points to the corresponding target points without modifying pixels outside a region specified by the user.

How to use

DragGAN include a GUI and users can do the image editing by simply clicking the image, then DragGAN will do multiple iterations so that the image is transform slowly into the target. model During the transformation, you can also stop at any step and modify the handle and target point(s), so that the image transformation can go on with a different target.

Since DragGAN is based on the StyleGAN2 model and perform image editing by modifying the mapped latent vector w of StyleGAN2 instead of doing this directly on the input image, this tool can only handle generated images from StyleGAN2. Therefore, if user want to upload their own image, an extra GAN Inversion that transform the image back to latent vector is required, which can probably affect the details of the original image. model

Technical details

This tool is based on the StyleGAN2 architecture [Karras et al. 2020].
Different to common GAN models, the 512 dimentional latent vector 𝒛 ∈ N(0, 𝑰 ) is mapped to an intermediate latent code 𝒘 via a mapping network in the StyleGAN2 architecture, which is used to control the style of the image. Then this vector is fed to every layer in the synthesis network, with an addtional noise vector. model DragGAN will do optimization base on the latent vector 𝒘 so that the handle points can move closer to the target point. To move a handle point 𝒑𝑖 to the target 𝒕𝑖, the DragGAN supervise a small region around 𝒑𝑖 towards 𝒕𝑖, and design the loss function like this: model where F(𝒒) denotes the feature values of F at pixel 𝒒, 𝒅𝑖 is a normalized vector pointing from 𝒑𝑖 to 𝒕𝑖 (𝒅𝑖 = 0 if 𝒕𝑖 = 𝒑𝑖 ), and F0 is the feature maps corresponding to the initial image. Note that the first term is summed up over all handle points {𝒑𝑖 }. As the components of 𝒒𝑖 +𝒅𝑖 are not integers, we obtain F(𝒒𝑖 +𝒅𝑖 ) via bilinear interpolation.

References

Xingang Pan [2023, May 18] Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. https://arxiv.org/abs/2305.10973
Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. 2022. Efficient Geometry-aware 3D Generative Adversarial Networks. In CVPR.