How artificial intelligence generates images. ML Engineer Explains

Generative adversarial networks and their shortcomings

Just a few years ago, state-of-the-art models in these

The tasks were generative adversarial networks (GANs), which were proposed in 2014 by Goodfellow et al and have been significantly improved over the past nine years.For example, the 2021 StyleGAN 3 retains facial details exactly, even when shifting and turning, whereas its predecessors generateIn this case, "noisy" details, such as hair, beards, or patterns on clothes.Professionals and enthusiasts alike marveled at how well GANs could generate photos of non-existent people, animals, or apartments.

However, due to the competitive natureGAN models are very unstable in training, and they do not show a very large variety of image types when generated. In addition, they are poorly applicable in the task of generating images from text, although examples of this exist.

Results of Image Generation by StyleGAN 3 Model

The boom in diffusion models

Diffusion models, on the contrary, havesufficient variability of the generated images and are quite stable. Their main disadvantage is the speed of learning and generation. Dozens or even hundreds of video cards are needed to train a model, and generating an image using an already trained model takes several seconds, unlike GAN, where the count goes to tens of milliseconds.

Generation results from the diffusion model of Ho et al

The boom around diffusion models is fueled by the exitlarge generative text-to-image models. Surely many readers have seen the results generated by DALL·E 2, MidJourney, Imagen or Stable Diffusion. Some artists and illustrators worry that neural networks will take away their work, while others believe that this will only help in the creative process. Programmers and artists master prompt engineering - the art of selecting text to get more accurate generation results - and share interesting requests and no less interesting results.

Lofi alien invasion to relax and study to (Midjourney neural network)

17th century painting of The Beatles (Model Stable Diffusion 2.1)

A dragon fruit wearing karate belt in the snow (Imagen model)

How do diffusion models work?

Diffusion models are iterative models thataccept random noise as input. To begin, consider the most basic diffusion model, DDPM (Denoising Diffusion Probabilistic Model), presented by Ho et al. This model is trained step by step on a sample of hundreds of thousands of images, where random noise of some known strength is applied to the image from the sample at each step, and the model learns to reverse this noise, thus improving image quality. If we iteratively apply the trained model in this way to a picture of completely random noise, inverting "weak" noise at each step, the model can generate a completely new image, gradually getting rid of random noise - using back diffusion.

Illustration of the basic diffusion process (from the CVPR 2022 tutorial)

The random noise from which an image is generated can be combined with a condition— a requirement for the result, expressed by text or another example image.First, let's look at an example from the SDEdit article, where the user specifiesThis trisun is further noisy to the point where it cannot be distinguished, for example, fromnoisy photography, and then an iterative reverse diffusion process is applied, which restoresA high-quality image based on the picture provided.

An illustration of the pattern-driven diffusion process (from the SDEdit article)

Another way to direct the generation to the desiredthe result is the conditioning of the model by the text. To do this, language models are used, trained on pairs of images and captions to them, which are able to understand the meaning of images and texts at the same time. An example of such a model is CLIP (Contrastive Language - Image Pre-training) released by OpenAI. This model is able to translate images and texts into a common latent vector space (where a vector is just a column of some values). In this space it becomes, for example, possible to find the nearest images to some text query, since this is just an algebraic operation on vectors.

Latent Diffusion Model,introduced in 2021 conditions a model on a vector space of texts to generate images from directional noise. This model uses the properties of the common latent space of texts and images. Stable Diffusion, Imagen and other large text-to-image neural networks work on this principle.

Another important technique that improves the qualitygeneration used in training conditioned diffusion models is the classifier free guidance. In simple terms, the higher the value of the classifier free guidance parameter, the more the result resembles a text query, which often translates into less variability in the results.

Problems of diffusion models

Of course, diffusion models are notuniversal solution for the problem of image generation. They are still subject to the same problems as GANs - at first glance, real images have significant drawbacks - generated people can have more than five fingers or 32 teeth. Also, these models are quite bad at generating text on images and even invent their own “language”.

The artists accuse Midjourney and Stability AI (the company that develops Stable Diffusion) of copyright infringement in the preparation of data for training — they claim that the companies downloaded images from the Internet without the consent of the artists and proper compensation.There is also a lot of talk about how generative networks, including Stable Diffusion, exacerbateNegative stereotypes about race, gender, and other social issues, as they are trained on knowingly biased data obtained from the Internet.

The story of Adam and Eve, Noah, and Zeus in the style of DC Comics (DALL model· E 2)

How to try for free

Unlike many previous developments in computer vision, which were often only available to programmers, new technologies in the field of diffusion networks can most oftenThe general trend towards open source software and the publication of demo versions of neural networks allows you tostartups such as Hugging Face to aggregate many versions of models, such as Stable Diffusion 2.1.They also develop the diffusers library, which is designed to simplify the use of models in code.

The Google Colab service allows you to run code on the GPU and TPU, so many enthusiasts useto publish their own versions of the model, for example, the Disco Diffusion Warp model, which is able to change the style of the video.

There are also convenient interfaces to models.So, the MidJourney neural network has a free trial version for several dozen generations, which is enough to try text-to-image models. OpenAI is also providing trial access to the DALL·E 2 model.

What's next

We can confidently say that we are living through the golden era of neural network image generation.The community is looking forward to the future products of Google, which has released the non-public diffusion model Imagen and a large number of articles on the topic of image editing and generation, including the use of other artificial intelligence technologies.

New startups in the field of image creation and editing are emerging that are successfully competingwith such giants as OpenAI or Google.New articles about diffusion models are coming outalmost weekly, and the scope of their application today is not limited to the listed 2D computer vision tasks — they are used in medical imaging, video generation and 3D text-based tasks.

Read more:

The mystery of the red stripes on the satellite of Jupiter is revealed

Found "impossible" planet. She defies modern science

Mysterious hexagonal "honeycombs" in salt deserts found an explanation

Geek Tech Online

Everything about technology and gadgets

How artificial intelligence generates images. ML engineer explains