当前位置：网站首页>N ¨UWA: Visual Synthesis Pre-training for Neural visUal World creAtionChenfei

N ¨UWA: Visual Synthesis Pre-training for Neural visUal World creAtionChenfei

2022-07-27 11:47:00 【Xiao Chen who wants money】

NUWA： A multimodal approach , Manipulate visual images .

contribution ：

1、 One 3D transformer, Can include text 、 Picture and video input .

2、 Put forward 3D Nearby attention(3DNA).3DNA It is composed of local characteristics in spatial and time domain . It not only reduces the complexity , At the same time, the quality of the final visual image is improved .

3、 stay T2I(text-to-image）,T2V(text-to-video),Video prediction And so on SOTA result . And the model is not only text-guided image manipulation（ Text manipulation picture ）（ The first row and fourth column of Figure 1 ） It shows a good zero-shot Ability , stay text-guide video manipulation（ Text manipulation video ）（ chart 1 The second row and the first column of ） Also showed a very good ability .

introduction ：

some Auto-regressive Autoregressive models are based on pixel-by-pixel The way , So there is a disadvantage ： Cannot process high dimensions high-dimensional visual data, Can only handle some low resolution low-resolution Pictures and videos .

lately ,VQ-VAE Is a discrete visual token The method of transformation , Can be effective and in large-scale Training visual synthesis task. But it has one drawback , Namely VQ-VAE Separate video from pictures , It's not friendly for training .

Method ：

How to separate standard texts 、 Images 、 video ？

1、 Use a common dimension to get input $X \in \mathbb{R}^{h*w*s*d}$ , among h and w Represents the height and width of the image ,s How many token（NLP The number of word vectors ）,d For each token Dimensions .

2、 Text with a lower-case byte pair encodeing(BPE) Embed text into $\mathbb{R}^{1*1*s*d}$ in . The text is in h and w Direction has no dimension , So with 1 Express ;

Input of pictures $I \in \mathbb{R}^{h*w*c}$ , It also needs coding , The formula is as follows ：

E(I) Representing one encoder, take raw data Send in encoder, obtain E(I) , Compare E(I) and $B_{j}$ codebook Distance of , among $E(I)\in \mathbb R^{h*w*d_{B}}$ , $B\in \mathbb R^{N*d_{B}}$ , Get away from $B_{j}$ Current token, Discretize it , And make use of decoder(G） restructure I_hat. This part is VQ-VAE, And then through G and D Continuous training of , obtain B. final $B[z]\in \mathbb R^{h*w*1*d}$ Used for training ,1 It means there is no temporal dimensions

3、 Video can be regarded as the time extension of images , Recent works such as VideoGPT[48] and VideoGen[51] take VQ-V AE Convolution in encoder starts from 2D Extended to 3D, And train video specific representations . However , This cannot share a common codebook for images and videos . In this paper , We showed how to simply use 2D VQ-GAN Each frame of encoded video can also produce time consistent video , Benefit from both image and video data . The result is expressed as asRh×w×s×d, Where represents the number of frames .