当前位置：网站首页>2022 T2i text generated image Chinese Journal Paper quick view-1 (ecagan: text generated image method based on channel attention mechanism +cae-gan: text generated image technology based on transforme

2022 T2i text generated image Chinese Journal Paper quick view-1 (ecagan: text generated image method based on channel attention mechanism +cae-gan: text generated image technology based on transforme

2022-07-27 04:56:00 【Medium coke with ice】

2022 year T2I Text generated images Quick view of Chinese Journal Papers -1
One 、ECAGAN: Text image generation method based on channel attention mechanism
1.1、 Major innovations
1.2、 Main framework
1.2.1、 Low resolution image generation stage
1.2.2、 Image refining stage
1.3、 Loss function
1.4、 experiment
Two 、CAE-GAN: be based on Transformer Cross attention text generation image technology
2.1、 Major innovations
2.2、 Main framework
2.2.1、 Cross attention coder
2.2.2、 Dynamic enclosure
2.3、 Loss function
2.4、 experiment
Last

One 、ECAGAN: Text image generation method based on channel attention mechanism

Source of the article ： Computer engineering 2022 year 4 month
Citation format ： Zhang Yunfan , Yi Yaohua , Tang Ziwei , Wang Xinyu . Text image generation method based on channel attention mechanism [J]. Computer engineering ,2022,48(04):206-212+222.DOI:10.19678/j.issn.1000-3428.0062998.

1.1、 Major innovations

In the task of generating images for text The details of the generated image are missing And There is a structural error in the image generated in the low resolution stage （ If a bird has two heads , Lack of claws ） The problem of , Generate confrontation network based on dynamic attention mechanism （DMGAN）, Introduce content aware upsampling module and channel attention convolution module , A new method of text image generation is proposed ECAGAN.
The main innovations are ：

Adopt content aware up sampling method , The reconstructed convolution kernel is obtained by calculating the input characteristic graph , The convolution operation is carried out by using the reconstructed convolution kernel and the characteristic graph , Ensure semantic alignment ;
Use the channel attention mechanism to learn the importance of each feature channel of the feature map , Highlight important feature channels , Suppress invalid information , Enrich the details of the generated image ;
Combined with condition enhancement and perceptual loss function auxiliary training , Enhance the robustness of the training process .

1.2、 Main framework

Insert picture description here
The main structure is still similar StackGAN++ Three layer stacking , The network structure can be divided into Low resolution image generation stage and image refining stage , Generator generation in low resolution image generation stage 64×64 Pixel low resolution image , Generator generation in image refining stage 128×128 Pixels and 256×256 Pixel image .

1.2.1、 Low resolution image generation stage

Text encoder stage and StackGAN、AttnGAN、DMGAN And so on , Generate sentence features and word features through text encoder , The sentence features are stitched together after random noise FC Input to Content aware upsampling module （CAUPBlock） in , The first stage image is formed after up sampling the input feature map .

Content aware upsampling module It consists of an adaptive convolution kernel prediction module and a content aware feature reorganization module ：

Adaptive convolution In the nuclear prediction module , The feature map passes through the content encoder ,Reshape And normalized to a size of $k^2_{up}$ Recombined convolution kernel of . Content aware feature reorganization module Simply put, it is to dot product each region of the feature map with the corresponding predicted convolution kernel .

Structure diagram is as follows , Characteristics of figure R After input, in the adaptive convolution kernel prediction module ψ in
For output characteristic graph R′ Every area of l′ Predict the convolution kernel $γ_l$ , Then the original feature map is in the content aware feature reorganization module ξ Neutralize the predicted convolution kernel and dot multiply to get the result $\begin{array}{l} \gamma_{l^{\prime}}=\psi\left(Z\left(R_{l}, k_{\mathrm{encoder}}\right)\right) \\ R_{l^{\prime}}^{\prime}=\xi\left(Z\left(R_{l}, k_{\mathrm{up}}\right), \gamma_{l^{\prime}}\right) \end{array}$ , among $Z（R_l,k_{up}）$ Represents the midpoint of the feature map l The surrounding $k_{up}×k_{up}$ Sub areas of .
Insert picture description here
After up sampling, the characteristic graph is input into the generator , After convolution operation with channel attention mechanism, the image is generated .
The attention convolution module weights the feature map through channel attention , Make the generated image more detailed , The realization of channel attention will not be repeated , See the original text for details .
Insert picture description here

1.2.2、 Image refining stage

Insert picture description here
Image refining stage and AttnGAN Very similar , The dynamic attention layer is used to calculate the correlation between each word in the word vector and the image sub region , Then calculate the attention weight of the image sub region according to the Correlation , Finally, the update of the feature map is controlled according to the attention weight of the feature map , Get a new feature map and then upsampling , Zoom in on the image .（ Three methods that can effectively fuse text and image information second ）

1.3、 Loss function

The generator loss function is in the form of ：

$L_{G}=\sum\left(L_{G_{i}}+\lambda_{1} L_{\mathrm{per}}\left(I_{i}^{\prime}, I_{i}\right)\right)+\lambda_{2} L_{\mathrm{CA}}+\lambda_{3} L_{\mathrm{DAMSM}}$

$L_{Gi}$ Represents the loss function of generators at all levels ; $L_{per}$ Represents the perceived loss function ; $L_{CA}$ Represents the conditional enhancement loss function ; $L_{DAMSM}$ Express DAMSM Module loss function .

$L_{G_{i}}=\underbrace{-\frac{1}{2} E_{x \sim p_{\epsilon}}\left[\log _{a} D_{i}\left(\widehat{x_{i}}\right)\right]}_{\text {unconditional loss }} \underbrace{-\frac{1}{2} E_{x \sim p_{\sigma},}\left[\log _{a} D_{i}\left(\widehat{x}_{i}, s\right)\right]}_{\text {conditional loss }}$

$\begin{array}{l} L_{D_{i}}= \\ \underbrace{-\frac{1}{2} E_{x \sim p_{\text {datu }}} \log _{a} D_{i}\left(x_{i}\right)-\frac{1}{2} E_{x \sim p_{\sigma_{i}}} \log _{a}\left(1-D_{i}\left(\widehat{x_{i}}\right)\right)}_{\text {unconditional loss }} \\ \underbrace{-\frac{1}{2} E_{x \sim p_{\text {datu }}} \log _{a} D_{i}\left(x_{i}, s\right)-\frac{1}{2} E_{x \sim p_{\sigma_{i}}} \log _{a}\left(1-D_{i}\left(\widehat{x}_{i}, s\right)\right)}_{\text {conditional loss }} \end{array}$

$\begin{array}{l} L_{\mathrm{CA}}=D_{\mathrm{KL}}(\mathcal{N}(\boldsymbol{\mu}(\boldsymbol{s}), \boldsymbol{\Sigma}(\boldsymbol{s})) \| \mathcal{N}(0, I))\end{array}$

$\begin{array}{l}L_{\mathrm{per}}\left(I^{\prime}, I\right)=\frac{1}{C_{i} H_{i} W_{i}}\left\|\phi_{i}\left(I^{\prime}\right)-\phi_{i}(I)\right\|_{2}^{2} \end{array}$

1.4、 experiment

Insert picture description here

Two 、CAE-GAN: be based on Transformer Cross attention text generation image technology

Source of the article ： Computer science 2022 year 2 month
Citation format ： Tan Xinyue , He Xiaohai , Wang Zhengyong , Luo Xiaodong , Sparkling waves . be based on Transformer Cross attention text generation image technology [J]. Computer science ,2022,49(02):107-115.

2.1、 Major innovations

at present , The mainstream method is to complete the encoding of the input text description by pre training the text encoder , but Current methods encode text descriptions , The mapping relationship with the corresponding image is not considered , Ignoring the semantic gap between language space and image space , As a result, the matching degree between the generated image and the text semantics in the initial stage is still low , And the image quality is also affected .

Innovation points ：

Through the cross attention encoder , Translate and align text information with visual information , To capture the cross modal mapping relationship between text and image information , So as to improve the fidelity of the generated image and the matching degree with the input text description .

2.2、 Main framework

Insert picture description here
Pictured above , The text first passes through the cross attention encoder , Generate cross attention eigenvectors $f_c$ And word feature matrix W, The cross attention feature vector is fully connected after adding noise 、 Initial image features are formed in the upper sampling stage , Then word characteristic matrix Ｗ Input to dynamic memory module and fuse with primary image features , Get the new image features after fusion , The process is as follows ：

$\begin{array}{l} \boldsymbol{f}_{c}, \boldsymbol{W}=C_{E}\left(\boldsymbol{s}, \boldsymbol{F}_{R}\right) \\ \boldsymbol{F}_{0}=G_{0}\left(\boldsymbol{f}_{c}+\boldsymbol{z}\right) \\ \boldsymbol{F}_{1}=G_{1}\left(D M\left(\boldsymbol{F}_{0}, \boldsymbol{W}\right)\right) \\ \boldsymbol{F}_{2}=G_{2}\left(D M\left(\boldsymbol{F}_{1}, \boldsymbol{W}\right)\right) \end{array}$

2.2.1、 Cross attention coder

Cross attention coder is used for joint cross coding and alignment of language information and visual information . As shown in the figure below ：
Insert picture description here
It mainly includes text feature extraction 、 Image feature extraction 、 Cross attention coding and self attention coding ：

Text feature extraction , Use two-way LSTM Encode the original text into a global sentence feature vector W And a word eigenvector s;
Image feature extraction , Use InceptionV3 The Internet Extraction of image features $f_c$ ;
Cross attention code , It is mainly used to construct the internal relationship between language features and image features , Realize joint coding . Word eigenvector s And image features $f_c$ They are mapped into $q_s,k_s,v_s,q_v,k_v,v_v$ , namely $q_s,k_s,v_s=Linear(s),q_v,k_v,v_v=(f_v)$ , Then calculate the cross attention score score, Then get the attention code $l_c$
$\begin{array}{l} \text { score }=\lambda_{c} \boldsymbol{q}_{v} \boldsymbol{k}_{s}^{\mathrm{T}} \\ \text { score }=\operatorname{Softm}(\text { score }) \\ \boldsymbol{s}_{c}=\operatorname{dropout}\left(\text { score }{ }^{\prime}\right) \\ \boldsymbol{l}=\boldsymbol{s}_{c} \cdot \boldsymbol{v}_{s} ,l_c=Normalization(A_1l+B_1)\end{array}$
Self attention encoding , The main use is still Self attention mechanism babbling , I won't repeat
$\begin{array}{l} \boldsymbol{q}_{l}, \boldsymbol{k}_{l}, \boldsymbol{v}_{l}=\text { Linear }\left(\boldsymbol{l}_{c}\right) \\ \boldsymbol{s}_{s}=\operatorname{Dropout}\left(\operatorname{softm}\left(\lambda_{s} \boldsymbol{q}_{l} \boldsymbol{k}_{l}^{\mathrm{T}}\right)\right) \\ \boldsymbol{l}_{c s}=\boldsymbol{s}_{s} \cdot \boldsymbol{v}_{l} \\ \boldsymbol{f}_{c}=\operatorname{Normalization}\left(A_{2} \boldsymbol{l}_{c s}+B_{2}\right) \end{array}$

2.2.2、 Dynamic enclosure

Insert picture description here
Dynamic enclosure and DM-GAN The internal dynamic memory mechanism is similar .

2.3、 Loss function

The global loss function L Divided into three parts , The total loss function is as follows ：

$L=\sum_{i} L_{G^{i}}+\tau_{1} L_{C A}+\tau_{2} L_{D A M S M}$

among $L_{G^i}$ Is the generator loss , $L_{CA}$ Is the conditional loss function , $L_{DAMSM}$ It is the loss of deep attention multimodal similarity （ And AttnGAN be similar ）

$\begin{aligned} L_{G^{i}} &=-\frac{1}{2}\left[E_{x \sim p G^{i}} \log D_{i}(x)+E_{x \sim p G^{i}} \log D_{i}(x, s)\right] \\ L_{C A} &=D_{K L}(N(\mu(s)) \| N(0, I)) \end{aligned}$

The loss function of the discriminator consists of conditional loss and unconditional loss ：

$\begin{array}{l} L_{D^{i}}=-\frac{1}{2}\left[L_{D}+L_{C D}\right] \\ L_{D}=E_{x \sim P \text { data }} \log D_{i}(x)+E_{x \sim p G^{i}} \log \left(1-D_{i}(x)\right) \\ L_{C D}=E_{x \sim P \text { data }} \log D_{i}(x, s)+E_{x \sim p G^{i}} \log \left(1-D_{i}(x, s)\right) \end{array}$

2.4、 experiment

Insert picture description here

Last

Personal profile ： Graduate students in the field of artificial intelligence , At present, I mainly focus on text generation and image generation （text to image） Direction

Personal home page ： Medium coke with more ice

Time limited free subscribe ： Text generated images T2I special column

Stand by me ： give the thumbs-up + Collection ️+ Leaving a message.

in addition , We have established wechat T2I Learning exchange group , If you are T2I Fans or researchers in this field can send me a private message to join .

If this article helps you a lot , I hope you can give me a coke ！ Add more ice ！

原网站

版权声明
本文为[Medium coke with ice]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207262241048389.html