当前位置：网站首页>Diffusion + super-resolution model strong combination, the technology behind Google image generator image

Diffusion + super-resolution model strong combination, the technology behind Google image generator image

2022-07-28 00:23:00 【Opencv school】

 Click on the above ↑↑↑“OpenCV School ” Source... Pay attention to me ： official account   Almost Human   to grant authorization

This article explains in detail Imagen How it works , Analyze and understand its high-level components and the relationship between them .

In recent years , Multimodal learning is valued , Especially the text - Image synthesis and image processing - Text contrast learning in two directions . some AI The model is generated in the creative image 、 The application of editing has attracted wide public attention , for example OpenAI Text image models launched successively DALL・E and DALL-E 2, And NVIDIA's GauGAN and GauGAN2.

Google doesn't want to fall behind , stay 5 At the end of the month, I released my text to image model Imagen, It seems to further expand the subtitle conditions （caption-conditional） The boundary of image generation .

Just give a description of the scene ,Imagen Can produce high quality 、 High resolution images , Whether this scenario is logical in the real world . The following figure for Imagen Several examples of text generated images , The corresponding subtitles are displayed below the image .

These impressive generated images make people want to know ：Imagen How does it work ？

In the near future , Developer lecturer Ryan O'Connor stay AssemblyAI Blog wrote a long article 《How Imagen Actually Works》, Read in detail Imagen How it works , Yes Imagen Gave an overview , Analyze and understand its high-level components and the relationship between them .

Imagen Working principle overview

In this part , The author shows Imagen The overall structure of , And made a high-level interpretation of other working principles ; Then it analyzes more thoroughly Imagen Every component of . The following dynamic diagram shows Imagen workflow .

First , Input subtitles into the text encoder . The encoder converts text captions into numerical representations , The latter encapsulates semantic information in text .Imagen The text encoder in is a Transformer Encoder , It ensures that text encoding can understand how words in subtitles relate to each other , Self attention method is used here .

If Imagen Focus on individual words rather than their associations , Although we can obtain high-quality images that can capture various elements of subtitles , However, the caption semantics cannot be reflected in an appropriate way when describing these images . As shown in the example below , If we don't consider the relationship between words , There will be completely different generation effects .

Although the text encoder is Imagen The subtitle input of generates a useful representation , However, it is still necessary to design a method to generate images using this representation , That is to say Image Generator . So ,Imagen Diffusion model is used , It's a generative model , In recent years, it has benefited from its multi task SOTA Performance and popular .

The diffusion model destroys the training data by adding noise to realize training , Then learn to recover the data by reversing the noise process . Given input image , The diffusion model will iteratively use Gaussian noise to destroy the image in a series of time steps , Finally, Gaussian noise or TV noise is left static （TV static）. The figure below shows the iterative noise process of the diffusion model ：

then , The diffusion model will be backward work, Learn how to isolate and eliminate noise at each time step , Counteract the destructive process that just happened . After training , The model can be divided into two . This can start with random sampling of Gaussian noise , Use the diffusion model to gradually denoise to generate an image , The details are shown in the following figure ：

All in all , The trained diffusion model starts with Gaussian noise , Then iteratively generate images similar to the training images . It's obvious that , Unable to control the actual output of the image , Just input Gaussian noise into the model , And it will output a random image that seems to belong to the training data set .

however , The goal is to create the ability to input into Imagen The semantic information of subtitles is encapsulated in images , Therefore, we need a method to merge subtitles into the diffusion process . How to do this ？

The text encoder mentioned above produces a representative subtitle encoding , This coding is actually a vector sequence . In order to inject this encoded information into the diffusion model , These vectors are aggregated , And adjust the diffusion model based on them . By adjusting this vector , The diffusion model learns how to adjust its denoising process to generate an image that matches the caption well . The process visualization diagram is shown below ：

Because the image generator or basic model outputs a small 64x64 Images , In order to sample this model to the final 1024x1024 edition , Use the super-resolution model to intelligently up sample the image .

For super-resolution models ,Imagen The diffusion model is used again . The overall process is basically the same as the basic model , In addition to only adjusting based on subtitle coding , Also adjust with the smaller image being upsampled . The visualization of the whole process is shown below ：

The output of this super-resolution model is not actually the final output , It's a medium-sized image . In order to enlarge the image to the final 1024x1024 The resolution of the , Another super-resolution model is used . The two super-resolution architectures are roughly the same , So I won't repeat it . The output of the second super-resolution model is Imagen Final output of .

Why? Imagen Than DALL-E 2 Better ？

Answer exactly why Imagen Than DALL-E 2 Better is difficult . However , Part of the performance gap that cannot be ignored is due to the differences in subtitles and prompts .DALL-E 2 Use the comparison target to determine the text encoding and image （ Essentially, CLIP） Degree of relevance . Text and image coders adjust their parameters , Make similar subtitles - Maximize the cosine similarity of image pairs , And different subtitles - Minimize the cosine similarity of image pairs .

A significant part of the performance gap stems from Imagen Text encoder ratio DALL-E 2 The text encoder of is much larger , And trained with more data . As evidence of this hypothesis , We can check when the text encoder is extended Imagen Performance of . The following is Imagen Pareto curve of performance ：

The effect of magnifying text encoder is amazing , And zoom in U-Net The effect is surprisingly low . This result shows that , A relatively simple diffusion model is conditional on strong coding , Can produce high-quality results .

Whereas T5 Text encoder ratio CLIP The text encoder is much larger , In addition, natural language training data must be better than images - The fact that subtitles are richer , Most of the performance gaps may be attributed to this difference .

in addition to , The author also lists Imagen Several key points of , Including the following ：

Extended text encoder is very effective ;
Expanding the text encoder is better than expanding U-Net Size is more important ;
Dynamic thresholds are critical ;
Noise condition enhancement is very important in super-resolution models ;
It is important to use cross attention for text conditioning ;
efficient U-Net crucial .

These insights provide valuable directions for researchers who are studying diffusion models , It is not only useful in the sub field of text to image .

Link to the original text ：https://www.assemblyai.com/blog/how-imagen-actually-works/

原网站

版权声明
本文为[Opencv school]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207141240433423.html

当前位置：网站首页>Diffusion + super-resolution model strong combination, the technology behind Google image generator image

Diffusion + super-resolution model strong combination, the technology behind Google image generator image

边栏推荐

猜你喜欢

随机推荐