当前位置:网站首页>N ¨UWA: Visual Synthesis Pre-training for Neural visUal World creAtionChenfei
N ¨UWA: Visual Synthesis Pre-training for Neural visUal World creAtionChenfei
2022-07-27 11:47:00 【Xiao Chen who wants money】
NUWA: A multimodal approach , Manipulate visual images .

contribution :
1、 One 3D transformer, Can include text 、 Picture and video input .
2、 Put forward 3D Nearby attention(3DNA).3DNA It is composed of local characteristics in spatial and time domain . It not only reduces the complexity , At the same time, the quality of the final visual image is improved .
3、 stay T2I(text-to-image),T2V(text-to-video),Video prediction And so on SOTA result . And the model is not only text-guided image manipulation( Text manipulation picture )( The first row and fourth column of Figure 1 ) It shows a good zero-shot Ability , stay text-guide video manipulation( Text manipulation video )( chart 1 The second row and the first column of ) Also showed a very good ability .
introduction :
some Auto-regressive Autoregressive models are based on pixel-by-pixel The way , So there is a disadvantage : Cannot process high dimensions high-dimensional visual data, Can only handle some low resolution low-resolution Pictures and videos .
lately ,VQ-VAE Is a discrete visual token The method of transformation , Can be effective and in large-scale Training visual synthesis task. But it has one drawback , Namely VQ-VAE Separate video from pictures , It's not friendly for training .
Method :
How to separate standard texts 、 Images 、 video ?
1、 Use a common dimension to get input
, among h and w Represents the height and width of the image ,s How many token(NLP The number of word vectors ),d For each token Dimensions .
2、 Text with a lower-case byte pair encodeing(BPE) Embed text into
in . The text is in h and w Direction has no dimension , So with 1 Express ;
Input of pictures
, It also needs coding , The formula is as follows :

Representing one encoder, take raw data Send in encoder, obtain
, Compare
and
codebook Distance of , among
,
, Get away from
Current token, Discretize it , And make use of decoder(G) restructure I_hat. This part is VQ-VAE, And then through G and D Continuous training of , obtain B. final
Used for training ,1 It means there is no temporal dimensions
3、 Video can be regarded as the time extension of images , Recent works such as VideoGPT[48] and VideoGen[51] take VQ-V AE Convolution in encoder starts from 2D Extended to 3D, And train video specific representations . However , This cannot share a common codebook for images and videos . In this paper , We showed how to simply use 2D VQ-GAN Each frame of encoded video can also produce time consistent video , Benefit from both image and video data . The result is expressed as asRh×w×s×d, Where represents the number of frames .
3DNA
A subtraction algorithm , The original paper is written clearly , Not here ( Mainly do K and V Subtraction of )
Loss

边栏推荐
- torch‘ has no attribute ‘inference_mode‘
- STM32编译出现error: L6235E: More than one section matches selector - cannot all be FIRST/L
- Maker Hongmeng application development training notes 02
- 检定和校准的区别
- 1.Flume 简介及基本使用
- 哈希表 详细讲解
- 日本福岛废堆安全监视协议会认可排海计划“安全”
- compute_class_weight() takes 1 positional argument but 3 were given
- [machine learning whiteboard derivation series] learning notes - probability graph model and exponential family distribution
- SQL statement learning and the use of pymysql
猜你喜欢

Shell脚本文本三剑客之sed

第8章 多线程

为什么选择智能电视?

Maker Hongmeng application development training 04

Modelarts voice detection and text classification

第12章 泛型

PWM的原理和PWM波的产生

JUC框架 从Runnable到Callable到FutureTask 使用浅析

Solution of digital tube flash back after proteus8 professional version cracking

Moveit2 - 4. robot model and robot state
随机推荐
TLC549Proteus仿真&Sallen-Key滤波器&AD736Vrms到DC转换&Proteus查看51寄存器值
82.(cesium之家)cesium点在3d模型上运动
剑指 Offer 笔记: T57 - I. 和为 s 的两个数字
LAN SDN technology hard core insider 12 cloud CP's daily love - hardware vxlan forwarding plane
LNMP架构搭建(部署Discuz论坛)
IDEA: Can‘t use Subversion command line client:svn 解决方案
LeetCode-SQL练习题总结(MySQL实现)
Summary of C language knowledge involved in learning STM32F103 (link only)
[machine learning whiteboard derivation series] learning notes - probability graph model and exponential family distribution
zabbix自定义监控项
Moveit2 -- 2. Quick start of moveit in rviz
JUC框架 从Runnable到Callable到FutureTask 使用浅析
C programming language (2nd Edition) -- Reading Notes -- 1.4
Why choose smart TV?
C programming language (2nd Edition) -- Reading Notes -- 1.5.2
C programming language (2nd Edition) -- Reading Notes -- 1.5.1
The C programming language (2nd) -- Notes -- 1.9
Greek alphabet reading
[machine learning whiteboard derivation series] learning notes - support vector machine and principal component analysis
剑指 Offer 笔记: T53 - II. 0~n-1 中缺失的数字