New VAE For Stable Diffusion
Stable Diffusion architecture uses three components: denoise random gaussian noise, reduce input to lower dimensional latent space, and decode input using VAE architecture. The VAE architecture comprises encoder and decoder blocks to encode and decode input to lower latent representation and pass it to the next block. The final block uses reverse diffusion to transform denoised samples back to their original dimensional latent space.
VAE encoder
In order to generate new images from a text description, a Stable Diffusion algorithm is used. It begins with random noise in the latent space representation of the image and iteratively reduces the noise to produce the desired image. In the de-noising step, the text encoder guides the process and enables the algorithm to successfully predict faces and shapes in the resulting image.
The Stable Diffusion algorithm can generate images with dimensions up to 512×512 pixels. However, the size of the image argument is crucial, and the height and width arguments must be multiples of eight. Using values less than this would result in lower-quality images, while going over the 512-pixel limit in both directions would result in repeated image areas.
Latent representation
Stable Diffusion is a type of compression with the least noise and smallest file size. It works by encoding and decoding images from image space to latent space. The latent representation of images is typically low-resolution 64×64 pixels with high-precision four-bit binary data.
Latent diffusion reduces compute and memory complexity by applying the diffusion process to a lower-dimensional space called the latent space, rather than pixel space. The latent representation is a mathematical representation of the image, which is generated by a model that consists of an encoder and a decoder. The encoder transforms the image into the low-dimensional latent representation, while the decoder transforms the latent representation back into the original image.
Variational inference
Variational inference is a statistical technique used to find a model’s marginal parameter posterior. This method takes a latent seed as input, and uses it to generate random latent image representations. The text prompt then transforms the latent images to text embeddings.
The Stable Diffusion model uses a text-to-image latent diffusion model. It was developed by CompVis, Stability AI, and LAION. It was trained on 512×512 images from the LAION-5B database, which is the largest multi-modal dataset freely available. In addition, the model utilizes the Diffusers library and weights to model the distribution of latents.
CLIP embeddings
Stable Diffusion is a classification algorithm that maps images to their CLIP latent space. It compares images with pre-computed embeddings of unsafe concepts (list in Appendix E). Images that match the threshold value are deemed unsafe, and the image is blacked out.
The CLIP embeddings help create 3D models, and a new version of the algorithm combines CLIP encoders with a diffusion model to guide the generation process. In recent years, CLIP has been used in research on multimodal AI, and researchers have been working to improve the technology. To train the algorithm, researchers at OpenAI have used 400 million images. After training the algorithm with these images, CLIP can automatically predict correct captions for images. The authors then release larger versions of CLIP for open-source usage.
Score matching
A recent paper suggests a score matching approach for stable diffusion models. This approach combines the power of variational autoencoders with the performance of a diffusion model. It is also more stable than learning variance directly. The authors found that this method significantly improved log likelihood over methods using a fixed model variance parameter.
The key to Stable Diffusion is its low dimensional space, which greatly reduces compute and memory requirements. For example, using an autoencoder requires 64 times less memory than pixel-space diffusion models. This means that a 16GB Colab GPU can generate 512×512 images quickly.
Image quality
Stable Diffusion is an open-source model for text-to-image diffusion. It can produce photo-realistic images from text inputs. The model’s main feature is that it is able to de-noise the input image during training. It also has a built-in U-Net that is trained to detect faces and shapes from the noise.
The Stable Diffusion architecture is comprised of three parts: an encoder, a decoder, and a reverse diffusion process. The encoder converts the input to the lower dimensional latent space, while the decoder transforms the latent representation back into the image.