Creating high-quality, coherent, and extended videos unconditionally is a complex task. In a recent paper submitted to the arXiv* server, researchers addressed this challenge by utilizing pre-trained StyleGAN image generators for frame synthesis and enhancing motion generation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Despite previous efforts, unconditional video generation remains challenging due to the requirement for high-resolution, coherent, and lengthy videos. Current methods rely on powerful image generators such as style-based generative adversarial networks (StyleGAN) for achieving single-frame quality and then shifting their focus to motion generation. However, autoregressive motion models encounter training limitations and motion collapse when generating long videos. To overcome these issues and produce long, high-resolution videos, researchers introduced a non-autoregressive motion generation framework. This framework leverages a GAN inversion network is a learning-based framework.
GAN inversion and unconditional video generation techniques
The GAN inversion network is an encoder-decoder network. Its encoder's goal is to identify the corresponding latent vector within a pre-trained GAN's space, allowing for the reconstruction of an input image. The generator operates in the latent space and utilizes StyleGAN-generated latents as initial content codes to guide the modulation process.
Unconditional video generation seeks to replicate the distribution of real videos and create videos from random noise vectors. A range of techniques has arisen, frequently inspired by the achievements of GANs in image generation. These include video-GAN, temporal-GAN, and content-motion separation using GAN (MoCoGAN). Some methods aim to reduce computational costs, while others explore higher-resolution and longer-duration video generation. This work draws inspiration from these studies but distinguishes itself with a non-autoregressive motion generator that utilizes GAN inversion, enhancing motion consistency and semantics. Additionally, diffusion models have made progress in unconditional video generation, but they face challenges in maintaining temporal consistency compared to GAN-based models, which excel in inference speed.
Proposed model: StyleInV
The StyleInV model employs an inversion encoder, which takes an input image and produces a vector in the latent space of the StyleGAN2 generator. This vector enables the faithful reconstruction of the input image. Notably, the StyleGAN model, trained on video datasets, clusters its latent space by content subject. This clustering effect holds across various video datasets, leading to the development of the StyleInV model. In StyleInV, motion latents are generated through the modulation of a GAN inversion network with temporal styles. The temporal style for a timestamp consists of two components: the motion code and the latent code of the initial frame.
Researchers computed a dynamic embedding of the timestamp using an acyclic positional encoding (APE) module based on previous studies. This dynamic embedding ensures the stability of zero timestamp embeddings, and the latent code of the initial frame is concatenated with the motion code for content-adaptive affine transformation. The temporal styles are injected into the inversion encoder through adaptive instance normalization (AdaIN) layers. The encoder is expressed in terms of the initial frame and timestamp, and the modulated inversion process is defined accordingly.
In the training process, the raw inversion encoder is first trained using all video frames. Subsequently, this network serves as the foundation for initializing the convolution layers in the StyleInV encoder. The remaining parameters are initialized randomly.
Lastly, the proposed encoder-decoder framework uses the inversion encoder as the encoder and StyleGAN, a pre-trained model as the generator. This configuration allows for fine-tuning for different styles while retaining motion generation capabilities. Fine-tuning is performed using an image dataset while keeping certain components fixed. This process maintains the distribution of the latent space during fine-tuning and employs perceptual and identity loss to reduce artifacts and preserve identity. The style-transferred video retains the motion pattern of the original model while adopting a new style. Importantly, this fine-tuning process is independent of video generation training.
Experiments and results of StyleInV
For experimental purposes, four video datasets were utilized: a large-scale dataset for real-world face forgery detection (DeeperForensics), SkyTimelapse, a video dataset for human face forgery detection (FaceForensics), and a video reconstruction dataset (TaiChi). Different cropping strategies were employed for the datasets DeeperForensics and FaceForensics. Four state-of-the-art models were used for comparison: MoCoGAN-HD, dynamic-aware implicit GAN (DIGAN), StyleGAN-V, and Long-Video-GAN. The first two models were trained with a clip length of 16 frames, consistent with StyleGAN-V's default setting. An optimized setting was also explored for DIGAN and MoCoGAN-HD on DeeperForensics.
Quantitative evaluation was conducted using Frechet Inception Distance (FID) and Frechet Video Distance (FVD). The FID results demonstrated competitive quantitative performance for the proposed method. In contrast, the qualitative evaluation highlighted issues with motion consistency for other methods, whereas the proposed method consistently exhibited stable results on all datasets.
Conclusion
In summary, researchers introduced an innovative approach to unconditional video generation using a pre-trained StyleGAN image generator. The proposed motion generator model, StyleInV, produces latents within the StyleGAN2 latent space by modulating an encoder network, inheriting its informative initial latent priors. It offers non-autoregressive training and supports fine-tuning for style transfer. Extensive experiments showcased StyleInV's superiority in generating long, high-resolution videos, surpassing state-of-the-art benchmarks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.