当前位置：网站首页>Multimodal unsupervised image to image translation

Multimodal unsupervised image to image translation

2022-07-29 02:37:00 【A fan boy addicted to bicycles】

Preface ： be based on GAN The direction of image translation has been very popular , Last time I introduced an unrepeatable SketchyGAN, Very disappointed . This time we introduce an unsupervised from NVIDIA research GAN Image translation for MUNIT, The next article is also about unsupervised image translation 《Unsupervised Sketch-to-Photo Synthesis》 Compare the similarities and differences between the two , Thinking can bring inspiration to the current work .

Catalog

Main contributions

Methods,

Some shared potential space assumptions

Encoder - Decoder structure

Bidirectional reconstruction loss function

Code reappearance

Personal feelings

Reference resources

Main contributions

An image in a given source domain , The goal is to learn the conditional distribution of the corresponding image in the target domain , You don't need to see any examples of corresponding image pairs . Suppose that the image representation can be decomposed into domain invariant content codes and style codes that capture domain specific attributes . To convert an image to another domain , We recombine its content code with the random style code sampled from the style space of the target domain .

Sketch to photo synthesis is challenging , There are two reasons ：

1、 The sketch is inconsistent with the photo in shape , The sketch commonly used by amateurs has great deformation in space and geometry . therefore , Converting sketches into photos requires correction of deformation .

2、 The sketch is colorless , Lack of visual details . Sketch on white paper with black strokes , Internal marks that mainly outline the boundaries and characteristics of objects . In order to synthesize a picture , Shadows and colored textures must be filled correctly .

In this paper , We propose a principled framework for multimodal unsupervised image to image translation . Pictured 1 (a) Shown , Our framework makes several assumptions . We first assume that the potential space of an image can be decomposed into content space and style space . We further assume that images in different domains share a common content space , Not style space . In order to convert the image to the target domain , We recombine its content code with the random style code in the target style space ( chart 1 (b)). The information that should be retained in the process of content code translation , The style code represents other variants that are not included in the input image . By sampling different styles of code , Our model can produce different multimodal outputs . A large number of experiments have proved the effectiveness of our method in modeling multimodal output distribution and its superior image quality than the most advanced method . Besides , The decomposition of content and style space allows our framework to perform example guided image translation , The style of translation output is controlled by the sample image provided by the user in the target domain .

Methods,

Some shared potential space assumptions

hypothesis $x_{1}\in{\mathcal{X}}_{1}$ 、 $x_{2}\in{\mathcal{X}}_{2}$ Belong to two different domains , Sample from two edge distributions p(x_1) and p(x_2) , So the goal of generation is $p(x_{1\rightarrow2}\left|x_{1}\right)$ and $p(x_{2\rightarrow1}|x_{2})$ .

Suppose each image $x_{i}\in{\mathcal{X}}_{i}$ It is the content potential code shared by two domains and the style potential code specific to a single domain $s_{i} \in S_{i}$ Generated . The goal of the network is to learn potential generator and encoder functions and Neural Networks .

This assumption is consistent with UNIT The shared potential space hypothesis proposed in is closely related . although UNIT Suppose there is a fully shared potential space , But we assume that there is only part of the potential space ( Content ) Can be shared across domains , And the rest ( style ) Is domain specific , When cross domain mapping is many to many , This is a more reasonable assumption .

Encoder - Decoder structure

The model consists of two automatic encoders ( It is indicated by red and blue arrows respectively ), One for each domain . The potential code of each self encoder consists of content code c And style code s form . We fight against the target ( Dotted line ) Training models , To ensure that the translated image is indistinguishable from the real image in the target domain , And the goal of two-way reconstruction ( Dotted line ), Rebuild images and potential code .

The potential code of each automatic encoder is decomposed into a content code c_i And a style code s_i , Image to image conversion is through the exchange encoder - The decoder performs , Although the prior distribution is unimodal , However, due to the nonlinearity of the decoder , The output image distribution can be multimodal .

The loss function includes bidirectional reconstruction loss ( Make sure the encoder and decoder are reversed ) And confrontational losses ( Match the distribution of the translation image with that of the target domain ).

Bidirectional reconstruction loss function

In order to learn reciprocal encoder and decoder pairs , We use the objective function to encourage the reconstruction of both image -> latent -> image and latent -> image -> latent.

Image reconstruction loss function . Given an image sampled from the data distribution , We should be able to reconstruct it after encoding and decoding ：

Potential reconstruction loss function . Given a potential code sampled from the potential distribution at the time of translation ( Style and content ), We should be able to reconstruct it after decoding and encoding .

Author use L1 Reconstruction losses , Because it can promote clear output image .

Against the loss . utilize GANs To match the distribution of the translated image and the target data ：

Total loss

Code reappearance

I have to say that the papers of NVIDIA Research Institute are very conscientious , Can be reproduced quickly , Not like last time sketchy gan, If there is a problem with the code, email the author 、 carry issue No reply ……

Code address ：GitHub - NVlabs/MUNIT: Multimodal Unsupervised Image-to-Image Translation

Use the address ：imaginaire/projects/munit at master · NVlabs/imaginaire · GitHub

I reviewed this code , He provided. shoe Data set pre training model , Although the effect on the edge graph is very good , But I changed to sketchy datasets The effect is very general , The author proposes a general framework , Not for sketch Data optimization , The effect is generally reasonable .

I also calculated FID and IS, The index score ratio is unsupervised GAN The method is higher , It's a little awkward .

Personal feelings

Unfortunately, this paper is rough , The general framework proposed by the author is a little complicated , I haven't studied deeply on the basis that the author can use it directly , Look back at this part when you have time .