当前位置:网站首页>Diffusion + super-resolution model strong combination, the technology behind Google image generator image
Diffusion + super-resolution model strong combination, the technology behind Google image generator image
2022-07-28 00:23:00 【Opencv school】
Click on the above ↑↑↑“OpenCV School ” Source... Pay attention to me : official account Almost Human to grant authorization This article explains in detail Imagen How it works , Analyze and understand its high-level components and the relationship between them .
In recent years , Multimodal learning is valued , Especially the text - Image synthesis and image processing - Text contrast learning in two directions . some AI The model is generated in the creative image 、 The application of editing has attracted wide public attention , for example OpenAI Text image models launched successively DALL・E and DALL-E 2, And NVIDIA's GauGAN and GauGAN2.
Google doesn't want to fall behind , stay 5 At the end of the month, I released my text to image model Imagen, It seems to further expand the subtitle conditions (caption-conditional) The boundary of image generation .
Just give a description of the scene ,Imagen Can produce high quality 、 High resolution images , Whether this scenario is logical in the real world . The following figure for Imagen Several examples of text generated images , The corresponding subtitles are displayed below the image .
These impressive generated images make people want to know :Imagen How does it work ?
In the near future , Developer lecturer Ryan O'Connor stay AssemblyAI Blog wrote a long article 《How Imagen Actually Works》, Read in detail Imagen How it works , Yes Imagen Gave an overview , Analyze and understand its high-level components and the relationship between them .
Imagen Working principle overview
In this part , The author shows Imagen The overall structure of , And made a high-level interpretation of other working principles ; Then it analyzes more thoroughly Imagen Every component of . The following dynamic diagram shows Imagen workflow .
First , Input subtitles into the text encoder . The encoder converts text captions into numerical representations , The latter encapsulates semantic information in text .Imagen The text encoder in is a Transformer Encoder , It ensures that text encoding can understand how words in subtitles relate to each other , Self attention method is used here .
If Imagen Focus on individual words rather than their associations , Although we can obtain high-quality images that can capture various elements of subtitles , However, the caption semantics cannot be reflected in an appropriate way when describing these images . As shown in the example below , If we don't consider the relationship between words , There will be completely different generation effects .
Although the text encoder is Imagen The subtitle input of generates a useful representation , However, it is still necessary to design a method to generate images using this representation , That is to say Image Generator . So ,Imagen Diffusion model is used , It's a generative model , In recent years, it has benefited from its multi task SOTA Performance and popular .
The diffusion model destroys the training data by adding noise to realize training , Then learn to recover the data by reversing the noise process . Given input image , The diffusion model will iteratively use Gaussian noise to destroy the image in a series of time steps , Finally, Gaussian noise or TV noise is left static (TV static). The figure below shows the iterative noise process of the diffusion model :
then , The diffusion model will be backward work, Learn how to isolate and eliminate noise at each time step , Counteract the destructive process that just happened . After training , The model can be divided into two . This can start with random sampling of Gaussian noise , Use the diffusion model to gradually denoise to generate an image , The details are shown in the following figure :
All in all , The trained diffusion model starts with Gaussian noise , Then iteratively generate images similar to the training images . It's obvious that , Unable to control the actual output of the image , Just input Gaussian noise into the model , And it will output a random image that seems to belong to the training data set .
however , The goal is to create the ability to input into Imagen The semantic information of subtitles is encapsulated in images , Therefore, we need a method to merge subtitles into the diffusion process . How to do this ?
The text encoder mentioned above produces a representative subtitle encoding , This coding is actually a vector sequence . In order to inject this encoded information into the diffusion model , These vectors are aggregated , And adjust the diffusion model based on them . By adjusting this vector , The diffusion model learns how to adjust its denoising process to generate an image that matches the caption well . The process visualization diagram is shown below :
Because the image generator or basic model outputs a small 64x64 Images , In order to sample this model to the final 1024x1024 edition , Use the super-resolution model to intelligently up sample the image .
For super-resolution models ,Imagen The diffusion model is used again . The overall process is basically the same as the basic model , In addition to only adjusting based on subtitle coding , Also adjust with the smaller image being upsampled . The visualization of the whole process is shown below :
The output of this super-resolution model is not actually the final output , It's a medium-sized image . In order to enlarge the image to the final 1024x1024 The resolution of the , Another super-resolution model is used . The two super-resolution architectures are roughly the same , So I won't repeat it . The output of the second super-resolution model is Imagen Final output of .
Why? Imagen Than DALL-E 2 Better ?
Answer exactly why Imagen Than DALL-E 2 Better is difficult . However , Part of the performance gap that cannot be ignored is due to the differences in subtitles and prompts .DALL-E 2 Use the comparison target to determine the text encoding and image ( Essentially, CLIP) Degree of relevance . Text and image coders adjust their parameters , Make similar subtitles - Maximize the cosine similarity of image pairs , And different subtitles - Minimize the cosine similarity of image pairs .
A significant part of the performance gap stems from Imagen Text encoder ratio DALL-E 2 The text encoder of is much larger , And trained with more data . As evidence of this hypothesis , We can check when the text encoder is extended Imagen Performance of . The following is Imagen Pareto curve of performance :
The effect of magnifying text encoder is amazing , And zoom in U-Net The effect is surprisingly low . This result shows that , A relatively simple diffusion model is conditional on strong coding , Can produce high-quality results .
Whereas T5 Text encoder ratio CLIP The text encoder is much larger , In addition, natural language training data must be better than images - The fact that subtitles are richer , Most of the performance gaps may be attributed to this difference .
in addition to , The author also lists Imagen Several key points of , Including the following :
- Extended text encoder is very effective ;
- Expanding the text encoder is better than expanding U-Net Size is more important ;
- Dynamic thresholds are critical ;
- Noise condition enhancement is very important in super-resolution models ;
- It is important to use cross attention for text conditioning ;
- efficient U-Net crucial .
These insights provide valuable directions for researchers who are studying diffusion models , It is not only useful in the sub field of text to image .
Link to the original text :https://www.assemblyai.com/blog/how-imagen-actually-works/
边栏推荐
- 新媒体内容输出方式-短视频
- 【zer0pts CTF 2022】 Anti-Fermat
- Posture recognition and simple behavior recognition based on mediapipe
- UE4 official AEC blueprint case course learning notes
- How to use FTP to realize automatic update of WinForm
- The 4-hour order exceeds 20000+, claiming to be "the most luxurious in a million". Is the domestic brand floating?
- Oracle password expiration solution
- XSS payload learning browser decoding
- Annual comprehensive analysis of China's online video market in 2022
- Common errors reported by ant sword
猜你喜欢
![[极客大挑战 2019]RCE ME](/img/ff/aff58f40f2051f7415d1e16517f824.png)
[极客大挑战 2019]RCE ME

How difficult is it to apply for a doctorate under the post system in northern Europe?

Remote monitoring of pump station
![[roarctf2019] babyrsa Wilson theorem](/img/c1/52e79b6e40390374d48783725311ba.gif)
[roarctf2019] babyrsa Wilson theorem

【打新必读】魅视科技估值分析,分布式视听产品及解决方案

MFC提示this application has requested the runtime to terminate it in an unusual way editbox框已经删了还在使用

好漂亮的彩虹

传奇服务端:GOM GeeM2引擎更新时必须要修改哪些地方?

JS ATM output

Unity implements simple Sketchpad drawing function (notes)
随机推荐
Annual comprehensive analysis of China's online video market in 2022
XSS payload learning browser decoding
BUUCTF-bbbbbbrsa
Xss.haozi.me practice customs clearance
【读书会第13期】音视频文件的封装格式
BUU-CTF basic rsa
Liux common commands (view and open firewall port number + view and kill process)
论文写作全攻略|一篇学术科研论文该怎么写
[roarctf2019] babyrsa Wilson theorem
What foundation does Yolo need? How to learn Yolo?
[C language] string reverse order (recursive implementation)
Window function over
Glory launched a number of products at the same time. The price of notebook magicbook V 14 starts from 6199 yuan
MQTT----mqtt.fx客户端软件
【飞控开发基础教程6】疯壳·开源编队无人机-SPI(六轴传感器数据获取)
抖音直播监控-循环值守24小时-直播弹幕
[geek challenge 2019] rce me
Those "experiences and traps" in the data center
NPM related information
洛谷 P1009 [NOIP1998 普及组] 阶乘之和