当前位置：网站首页>Tsinghua & Zhiyuan | cogview2: faster and better text image generation model

Tsinghua & Zhiyuan | cogview2: faster and better text image generation model

2022-06-27 01:13:00 【Zhiyuan community】

The title of the paper ：CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers（arxiv）

The work of the team of vice president Tang Jie of Zhiyuan , First author Dingming , It is the latest development of the enlightenment model . stay Reddit Get on A lot of attention .GitHub There is already 500 Multi star .

Abstract

be based on Transformer The development of text to image model , The slow generation and complexity of high-resolution images . In this paper , We propose a method based on layering Transformer And local parallel autoregressive generation . We pre trained a with a simple and flexible self supervised task 60 Billion parameter Transformer Model —— Cross modal common language model (CogLM) , And fine tune it to achieve fast super-resolution . Compared with the most advanced DALL·E 2 comparison , New text to image system CogView2 Show very competitive generation , And it naturally supports interactive text guided editing of images .

The last part of the paper is very interesting ：

Autoregression or diffusion ？ Even though GPT Great success in text generation , But diffusion model is becoming more and more popular in image generation . We compare the diffusion model with the autoregressive model in terms of speed , This is the first 1 The biggest drawback of the autoregressive model discussed in section . Under the same architecture , The diffusion model needs more FLOP, But it has a high degree of parallelism . They can also make a trade-off between quality and time consumption by manually arranging the sampling step . for example ,Glide [19] sampling 250 A diffusion step is evaluated , as well as 27 Steps for interactive sampling , This reduces the delay to 15 second .
The autoregressive model must generate images one by one , But our LoPAR The image can be upsampled with high parallelism , therefore （ Potentially ） We can design the model by introducing more hierarchies , Thus, the time cost can be reduced faster than the diffusion model .
DALL-E-2 and CogView2 Comparison . DALL·E 2 [27] Is a recently released for use in 1024 × 1024 The parallel work of generating text to image on resolution . Although its probabilistic model and architecture are similar to CogView2 There's a big difference , But both have the same spirit —— Hierarchical generation .CogView2 Can be based on DALL-E2 A limited demonstration of compositing similar scenes , for example “ Lion teacher ”（ chart 1） And “ Panda scientists ”（DALL·E 2）, Even though CogView2 Only trained. DALL·E 2 Of the total data used 5% about . And CogView2 comparison ,DALL·E 2 The main difference between the three-level super-resolution and “ zero ” Level image prior generation . Because training a three-level super-resolution is very resource consuming , And it is more engineering oriented , We leave it to future work .

Code ： https://github.com/THUDM/CogView2

Students who want to experiment may want to pay attention to , This model has high hardware requirements , recommend NVIDIA A100 machine .

原网站

版权声明
本文为[Zhiyuan community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206270032476193.html