当前位置:网站首页>Diffusion + super-resolution model strong combination, the technology behind Google image generator image
Diffusion + super-resolution model strong combination, the technology behind Google image generator image
2022-07-28 00:23:00 【Opencv school】
Click on the above ↑↑↑“OpenCV School ” Source... Pay attention to me : official account Almost Human to grant authorization This article explains in detail Imagen How it works , Analyze and understand its high-level components and the relationship between them .
In recent years , Multimodal learning is valued , Especially the text - Image synthesis and image processing - Text contrast learning in two directions . some AI The model is generated in the creative image 、 The application of editing has attracted wide public attention , for example OpenAI Text image models launched successively DALL・E and DALL-E 2, And NVIDIA's GauGAN and GauGAN2.
Google doesn't want to fall behind , stay 5 At the end of the month, I released my text to image model Imagen, It seems to further expand the subtitle conditions (caption-conditional) The boundary of image generation .
Just give a description of the scene ,Imagen Can produce high quality 、 High resolution images , Whether this scenario is logical in the real world . The following figure for Imagen Several examples of text generated images , The corresponding subtitles are displayed below the image .
These impressive generated images make people want to know :Imagen How does it work ?
In the near future , Developer lecturer Ryan O'Connor stay AssemblyAI Blog wrote a long article 《How Imagen Actually Works》, Read in detail Imagen How it works , Yes Imagen Gave an overview , Analyze and understand its high-level components and the relationship between them .
Imagen Working principle overview
In this part , The author shows Imagen The overall structure of , And made a high-level interpretation of other working principles ; Then it analyzes more thoroughly Imagen Every component of . The following dynamic diagram shows Imagen workflow .
First , Input subtitles into the text encoder . The encoder converts text captions into numerical representations , The latter encapsulates semantic information in text .Imagen The text encoder in is a Transformer Encoder , It ensures that text encoding can understand how words in subtitles relate to each other , Self attention method is used here .
If Imagen Focus on individual words rather than their associations , Although we can obtain high-quality images that can capture various elements of subtitles , However, the caption semantics cannot be reflected in an appropriate way when describing these images . As shown in the example below , If we don't consider the relationship between words , There will be completely different generation effects .
Although the text encoder is Imagen The subtitle input of generates a useful representation , However, it is still necessary to design a method to generate images using this representation , That is to say Image Generator . So ,Imagen Diffusion model is used , It's a generative model , In recent years, it has benefited from its multi task SOTA Performance and popular .
The diffusion model destroys the training data by adding noise to realize training , Then learn to recover the data by reversing the noise process . Given input image , The diffusion model will iteratively use Gaussian noise to destroy the image in a series of time steps , Finally, Gaussian noise or TV noise is left static (TV static). The figure below shows the iterative noise process of the diffusion model :
then , The diffusion model will be backward work, Learn how to isolate and eliminate noise at each time step , Counteract the destructive process that just happened . After training , The model can be divided into two . This can start with random sampling of Gaussian noise , Use the diffusion model to gradually denoise to generate an image , The details are shown in the following figure :
All in all , The trained diffusion model starts with Gaussian noise , Then iteratively generate images similar to the training images . It's obvious that , Unable to control the actual output of the image , Just input Gaussian noise into the model , And it will output a random image that seems to belong to the training data set .
however , The goal is to create the ability to input into Imagen The semantic information of subtitles is encapsulated in images , Therefore, we need a method to merge subtitles into the diffusion process . How to do this ?
The text encoder mentioned above produces a representative subtitle encoding , This coding is actually a vector sequence . In order to inject this encoded information into the diffusion model , These vectors are aggregated , And adjust the diffusion model based on them . By adjusting this vector , The diffusion model learns how to adjust its denoising process to generate an image that matches the caption well . The process visualization diagram is shown below :
Because the image generator or basic model outputs a small 64x64 Images , In order to sample this model to the final 1024x1024 edition , Use the super-resolution model to intelligently up sample the image .
For super-resolution models ,Imagen The diffusion model is used again . The overall process is basically the same as the basic model , In addition to only adjusting based on subtitle coding , Also adjust with the smaller image being upsampled . The visualization of the whole process is shown below :
The output of this super-resolution model is not actually the final output , It's a medium-sized image . In order to enlarge the image to the final 1024x1024 The resolution of the , Another super-resolution model is used . The two super-resolution architectures are roughly the same , So I won't repeat it . The output of the second super-resolution model is Imagen Final output of .
Why? Imagen Than DALL-E 2 Better ?
Answer exactly why Imagen Than DALL-E 2 Better is difficult . However , Part of the performance gap that cannot be ignored is due to the differences in subtitles and prompts .DALL-E 2 Use the comparison target to determine the text encoding and image ( Essentially, CLIP) Degree of relevance . Text and image coders adjust their parameters , Make similar subtitles - Maximize the cosine similarity of image pairs , And different subtitles - Minimize the cosine similarity of image pairs .
A significant part of the performance gap stems from Imagen Text encoder ratio DALL-E 2 The text encoder of is much larger , And trained with more data . As evidence of this hypothesis , We can check when the text encoder is extended Imagen Performance of . The following is Imagen Pareto curve of performance :
The effect of magnifying text encoder is amazing , And zoom in U-Net The effect is surprisingly low . This result shows that , A relatively simple diffusion model is conditional on strong coding , Can produce high-quality results .
Whereas T5 Text encoder ratio CLIP The text encoder is much larger , In addition, natural language training data must be better than images - The fact that subtitles are richer , Most of the performance gaps may be attributed to this difference .
in addition to , The author also lists Imagen Several key points of , Including the following :
- Extended text encoder is very effective ;
- Expanding the text encoder is better than expanding U-Net Size is more important ;
- Dynamic thresholds are critical ;
- Noise condition enhancement is very important in super-resolution models ;
- It is important to use cross attention for text conditioning ;
- efficient U-Net crucial .
These insights provide valuable directions for researchers who are studying diffusion models , It is not only useful in the sub field of text to image .
Link to the original text :https://www.assemblyai.com/blog/how-imagen-actually-works/
边栏推荐
- BUUCTF-Dangerous RSA
- Sum of factorials of Luogu p1009 [noip1998 popularization group]
- 【C语言】字符串逆序(递归实现)
- 学yolo需要什么基础?怎么学YOLO?
- Prepare for the interview and stick to the third sentence of the question - Branch sentences!
- XSS payload learning browser decoding
- liux常用命令(查看及其开放防火墙端口号+查看及其杀死进程)
- Mqtt---mqtt.fx client software
- require、loadfile、dofile、load、loadstring
- Shell(3)
猜你喜欢

Shell编程规范与变量

Annual comprehensive analysis of China's online video market in 2022

『百日百题 · 基础篇』备战面试,坚持刷题 第三话——分支语句!
![[GWCTF 2019]BabyRSA1](/img/31/6727fd04be13ddd6bd46969fe2c50f.png)
[GWCTF 2019]BabyRSA1

It was dog days, but I was scared out of a cold sweat: how far is the hidden danger of driving safety from us?

2022年中国网络视频市场年度综合分析

HarmonyOS 3纯净模式可限制华为应用市场检出的风险应用获取个人数据
![[roarctf2019] babyrsa Wilson theorem](/img/c1/52e79b6e40390374d48783725311ba.gif)
[roarctf2019] babyrsa Wilson theorem

New media content output method - short video

元宇宙的应用场景展示
随机推荐
xss.haozi.me练习通关
Description and analysis of main parameters of R language r native plot function and lines function (type, PCH, CEX, lty, LWD, col, xlab, ylab)
AI briefing how to use loss surfaces a model integration
『百日百题 · 基础篇』备战面试,坚持刷题 第三话——分支语句!
[RoarCTF2019]RSA
Assertion mechanism in test class
Shell programming specifications and variables
The R language uses the hexsticker package to convert the visualized results of ggplot2 package into hexagonal diagrams (hexagonal stickers, hexagonal stickers, ggplot2 plot to hex stickers)
Implement Gobang game with C language
Buuctf childrsa Fermat theorem
Database tuning - principle analysis and JMeter case sharing
Sum of factorials of Luogu p1009 [noip1998 popularization group]
C event related exercise code.
liux常用命令(查看及其开放防火墙端口号+查看及其杀死进程)
[GWCTF 2019]BabyRSA1
[sel object of Objective-C language]
传奇服务端:GOM GeeM2引擎更新时必须要修改哪些地方?
好漂亮的彩虹
BUUCTF-bbbbbbrsa
A great thinking problem cf1671d insert a progression