当前位置:网站首页>CVPR 2022 𞓜 text guided entity level image manipulation manitrans

CVPR 2022 𞓜 text guided entity level image manipulation manitrans

2022-06-11 11:42:00 Zhiyuan community

This article mainly introduces an article on the cooperation between Fudan University fuyanwei's research group and Huawei Noah's Ark laboratory ,ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation. This article was accepted in CVPR 2022(Oral).

author :Jianan Wang, Guansong Lu, Hang Xu*, Zhenguo Li, Chunjing Xu and Yanwei Fu*

arxiv: https://arxiv.org/abs/2204.04428

Project home page :https://jawang19.github.io/manitrans/

 

Introduce

lately OpenAI Released the latest DALLE-2 edition (https://openai.com/dall-e-2/) It has aroused widespread concern in academia and industry .DALLE-2 Has a good understanding of known pictures , The ability to make entity level image modification based on text . Allied , This article introduces us CVPR2022 The job of , It is also an entity image operation ability focusing on text guidance . suffer DALL-E[1]、VQGAN[2] Inspired by work , We propose a new framework based on the two-stage image generation method , namely ManiTrans, It can not only edit the appearance of entities , You can also generate a new entity corresponding to the text guide , It also supports operations on multiple entities .

 

Method

ManiTrans frame

ManiTrans It's a two-stage framework , from (1) Automatic image coder , And (2) Fitting the joint distribution of text and image Transformer Model composition .

​ (1) Automatic image coder learned the coder 、 Decoder and image embedding are three parts . It first samples the input image , Then image embedding is used to quantify the feature map after down sampling , Finally, a decoder is used for the quantized feature map , Regenerate the image .

​ (2) Medium Transformer It's an autoregression (auto-regressive) Model , Take the text sequence and the index sequence of image quantization as the input , Predict the next possible element in the sequence . In this stage of training , To help Transformer It can better capture the corresponding relationship between text and image , Also for the sake of (1) The decoding process of the generated image in , We designed semantic alignment loss

The purpose of semantic alignment loss is to maximize the similarity between the text and the generated image .

 

When operating on the entity of an image , We need three inputs , Include a visual input : original image (image); Two language input : The entity you want to modify (prompt)、 Target text (text). The operation process is as follows :

(a) Segment the entities on the original image ;

(b) according to prompt Similarity with image entities , Determine the position of the entity to be modified in the image , And corresponding to the position of the index sequence ;

(c) Subject to the target text , Index the image that needs to be changed , namely (b) Index determined in , Make a new forecast . When the model only needs to operate on the appearance of the entity , Another condition is to add the gray image of the original image , To provide prior information about the structure of the original entity .

 

result

Multi entity operation COCO Cross category operations on datasets CUB And Oxford Cross category operation of flowers and birds on dataset

If you have details about the model 、 More results or analysis of interest , Please move our article .

 

Postscript

In recent years , With the help of Transformer technology 、 Pre training techniques and the improvement of computing power , The field of vision and language multimodal understanding has developed rapidly , Also began to be concerned by more people . In the near future DALL-E-2 Work is even more amazing , Let us have greater expectations for the future of visual language . in fact , There are still many directions in this field that have not been thoroughly explored , Text based image operations are in this column . The work of this paper is not perfect , There is still room for further improvement , But we hope that the work of this paper can represent a step forward in the direction of text guided image manipulation . Last , I wish everyone can make what they think is interesting 、 Valuable work .

 

[1] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021.

[2] Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020.

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111127588853.html