当前位置：网站首页>CVPR 2022 𞓜 text guided entity level image manipulation manitrans

CVPR 2022 𞓜 text guided entity level image manipulation manitrans

2022-06-11 11:42:00 【Zhiyuan community】

This article mainly introduces an article on the cooperation between Fudan University fuyanwei's research group and Huawei Noah's Ark laboratory ,ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation. This article was accepted in CVPR 2022（Oral）.

author ：Jianan Wang, Guansong Lu, Hang Xu*, Zhenguo Li, Chunjing Xu and Yanwei Fu*

arxiv: https://arxiv.org/abs/2204.04428

Project home page ：https://jawang19.github.io/manitrans/

Introduce

lately OpenAI Released the latest DALLE-2 edition （https://openai.com/dall-e-2/） It has aroused widespread concern in academia and industry .DALLE-2 Has a good understanding of known pictures , The ability to make entity level image modification based on text . Allied , This article introduces us CVPR2022 The job of , It is also an entity image operation ability focusing on text guidance . suffer DALL-E[1]、VQGAN[2] Inspired by work , We propose a new framework based on the two-stage image generation method , namely ManiTrans, It can not only edit the appearance of entities , You can also generate a new entity corresponding to the text guide , It also supports operations on multiple entities .

Method

ManiTrans frame

ManiTrans It's a two-stage framework , from （1） Automatic image coder , And （2） Fitting the joint distribution of text and image Transformer Model composition .

（1） Automatic image coder learned the coder 、 Decoder and image embedding are three parts . It first samples the input image , Then image embedding is used to quantify the feature map after down sampling , Finally, a decoder is used for the quantized feature map , Regenerate the image .

（2） Medium Transformer It's an autoregression （auto-regressive） Model , Take the text sequence and the index sequence of image quantization as the input , Predict the next possible element in the sequence . In this stage of training , To help Transformer It can better capture the corresponding relationship between text and image , Also for the sake of （1） The decoding process of the generated image in , We designed semantic alignment loss

The purpose of semantic alignment loss is to maximize the similarity between the text and the generated image .

When operating on the entity of an image , We need three inputs , Include a visual input ： original image （image）; Two language input ： The entity you want to modify （prompt）、 Target text （text）. The operation process is as follows ：

（a） Segment the entities on the original image ;

（b） according to prompt Similarity with image entities , Determine the position of the entity to be modified in the image , And corresponding to the position of the index sequence ;

（c） Subject to the target text , Index the image that needs to be changed , namely （b） Index determined in , Make a new forecast . When the model only needs to operate on the appearance of the entity , Another condition is to add the gray image of the original image , To provide prior information about the structure of the original entity .

result

Multi entity operation COCO Cross category operations on datasets CUB And Oxford Cross category operation of flowers and birds on dataset

If you have details about the model 、 More results or analysis of interest , Please move our article .

Postscript

In recent years , With the help of Transformer technology 、 Pre training techniques and the improvement of computing power , The field of vision and language multimodal understanding has developed rapidly , Also began to be concerned by more people . In the near future DALL-E-2 Work is even more amazing , Let us have greater expectations for the future of visual language . in fact , There are still many directions in this field that have not been thoroughly explored , Text based image operations are in this column . The work of this paper is not perfect , There is still room for further improvement , But we hope that the work of this paper can represent a step forward in the direction of text guided image manipulation . Last , I wish everyone can make what they think is interesting 、 Valuable work .

[1] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. arXiv:2102.12092, 2021.

[2] Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv:2012.09841, 2020.

原网站

版权声明
本文为[Zhiyuan community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111127588853.html