Scaling up GANs for Text-to-Image Synthesis

1POSTECH, 2Carnegie Mellon University, 3Adobe Research
in CVPR 2023 (Highlight)

GigaGAN: Large-scale GAN for Text-to-Image Synthesis

Can GANs also be trained on a large dataset for a general text-to-image synthesis task? We present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALLĀ·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive models, and inherits the disentangled, continuous, and controllable latent space of GANs. We also train a fast upsampler that can generate 4K images from the low-res outputs of text-to-image models.

Disentangled Prompt Interpolation

GigaGAN comes with a disentangled, continuous, and controllable latent space.
In particular, it can achieve layout-preserving fine style control by applying a different prompt at fine scales.


Changing texture with prompting. At coarse layers, we use the prompt "A teddy bear on tabletop" to fix the layout. Then at fine layers, we use "A teddy bear with the texture of [fleece, crochet, denim, fur] on tabletop". (Youtube link)

Changing style with prompting. At coarse layers, we use the prompt "A mansion" to fix the layout. Then at fine layers, we use "A [modern, Victorian] mansion in [sunny day, dramatic sunset]". (Youtube link)

Upscaling to 16-megapixel photos with GigaGAN

Our GigaGAN framework can also be used to train an efficient, higher-quality upsampler. This can be applied to real images, or to the outputs of other text-to-image models like diffusion. GigaGAN can synthesize ultra high-res images at 4k resolution in 3.66 seconds.

Abstract

The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL·E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that naÏvely increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.

GigaGAN architecture

GigaGAN generator

GigaGAN generator consists of a text encoding branch, style mapping network, multi-scale synthesis network, augmented by stable attention and adaptive kernel selection. In the text encoding branch, we first extract text embeddings using a pretrained CLIP model and a learned attention layers T. The embedding is passed to the style mapping network M to produce the style vector w, similar to StyleGAN. Now the synthesis network uses the style code as modulation and the text embeddings as attention to produce an image pyramid. Furthermore, we introduce sample-adaptive kernel selection to adaptively choose convolution kernels based on the input text conditioning.

GigaGAN discriminator

Similar to the generator, our discriminator consists of two branches for processing the image and the text conditioning. The text branch processes the text similar to the generator. The image branch receives an image pyramid and makes independent predictions for each image scale. Moreover, the predictions are made at all subsequent scales of the downsampling layers. We also employ additional losses to encourage effective convergence. Please see our paper for full details.

Latent space editing applications

Prompt interpolation

GigaGAN enables smooth interpolation between prompts, as shown in the interpolation grid. The four corners are generated from the same latent z but with different text prompts.


Disentangled prompt mixing

GigaGAN retains a disentangled latent space, enabling us to combine the coarse style of one sample with the fine style of another. Moreover, GigaGAN can directly control the style with text prompts.

Interpolation end reference image.

Coarse-to-fine sytle swapping

Our GAN-based architecture retains a disentangled latent space, enabling us to blend the coarse style of one sample with the fine style of another.


Click images to see more examples


Related works

  • Nupur Kumari, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Ensembling Off-the-shelf Models for GAN Training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

  • Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

  • Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.

  • Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and Improving the Image Quality of StyleGAN. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

  • Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. In IEEE International Conference on Computer Vision (ICCV), 2017.

  • Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative Adversarial Text to Image Synthesis. In International Conference on Machine Learning (ICML), 2016.

  • Acknowledgements

    We thank Simon Niklaus, Alexandru Chiculita, and Markus Woodson for building the distributed training pipeline. We thank Nupur Kumari, Gaurav Parmar, Bill Peebles, Phillip Isola, Alyosha Efros, and Joonghyuk Shin for their helpful comments. We also want to thank Chenlin Meng, Chitwan Saharia, and Jiahui Yu for answering many questions about their fantastic work. We thank Kevin Duarte for discussions regarding upsampling beyond 4K. Part of this work was done while Minguk Kang was an intern at Adobe Research.

    BibTeX

    @inproceedings{kang2023gigagan,
      author    = {Kang, Minguk and Zhu, Jun-Yan and Zhang, Richard and Park, Jaesik and Shechtman, Eli and Paris, Sylvain and Park, Taesung},
      title     = {Scaling up GANs for Text-to-Image Synthesis},
      booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year      = {2023},
    }