Stable Diffusion is a software that creates images and videos using open source text description or image modification. It was created and released in 2022. Representatives of CompVis, Runway, EleutherAI, and LAION worked on it together.
Stable Diffusion architecture
It uses the diffusion model (DM) developed by the CompVis team at LMU Munich. In 2015, the first developments were presented. The models are learning. The purpose of this process is to eliminate consecutive applications of Gaussian noise on training images, which is commonly considered as the autoencoder noise removal sequence.
Stable Diffusion consists of:
- Variational auto-encoder (VAE) – compresses the image from pixel space to a smaller dimensional hidden space,
- U-Net-eliminates noise at the output of the backward diffusion to obtain a hidden representation.
- An additional text encoder.
How it works
Stable Diffusion
Open the program. In the “Enter your prompt” window, type a description of the image, for example, “Delicate blooming flower. Rich colors. Photo for Instagram” about a delicate blooming flower. After that, click “Generate image”.
We get the finished result. It looks like 4 images that differ from each other in certain elements. The generation process usually takes 2 to 3 minutes. If you are not satisfied with the number of images, you can change the number of images in the ‘Advanced options’ window.
From time to time, you may receive the text “This application is too busy! Try again soon”. This is due to the popularity of the neural network and the large number of visitors.
Let’s take a closer look at what each field is responsible for when creating an image:
Number of images. You can choose as many as you want from the offered ones.
Steps. This criterion is responsible for how many steps the AI will take to generate the result you want. The default setting is 30-50. If you are satisfied with the main part of the image, but have questions, for example, only about the eyes, it is better not to increase the number of steps, but to detail the text that relates to this aspect.
Creativity (Guidance Scale). Here, the AI chooses how accurately it interprets what you wrote. 2-6 – the AI does whatever it wants, 7-11 – it will take only half of your request, 12-15 – it will try to use all your text, 16+ – the result is identical to your request.
Resolution. The quality of the resulting image.
Features of generating text in an image
The scenario of text to image conversion in this neural network is called “txt2img”. It uses the text as a hint, combining it with other parameters such as the sample type, output image dimensions, and output values.
The image is formed by analyzing and interpreting all the data entered by the user. The generated images have an invisible digital watermark that allows you to identify the result as having been made in Stable Diffusion. But if you change it, i.e. the image, the size, the watermark loses its effectiveness.
Modify the finished image
An interesting and necessary option that allows the user to add an image that will serve as a basis for further actions.
Thus, the AI-based Stable Diffusion neural network is a good opportunity for all users to feel like an artist and create great digital art.