当前位置:网站首页>GPT plus money (OpenAI CLIP,DALL-E)
GPT plus money (OpenAI CLIP,DALL-E)
2022-07-25 12:03:00 【Shangshanxianger】
Connect images and text , More multimodal articles can be found in the series compiled by bloggers ( Large scale cross modal pre training portal ), This article mainly collates OpenAI Published 2 An article . among CLIP Can complete the image and text category matching ,DALL·E Then the image can be generated directly based on the text description , And the performance is very excellent .

CLIP
First of all CLIP, Just look at the model , There are three steps :Contrastive Pretraning,Create dataset classifier from label text and use for zero-shot prediction.
The overall structure of the first part is shown in the figure above , It is a two stream branch of picture and text matching , On one side is the image encoder ( Such as resnet50 perhaps ViT etc. ), On the other side is the text encoder ( Such as Transformer) Get the feature , Then to a batch Text graph of pair Calculate the inner product of the data to get the matching matrix , The row direction of the matrix is the classifier of the image , From the text point of view, the column direction is a similar classifier . Finally, maximize the probability of the blue part of the diagonal ( Cause to match pair Maximize inner product similarity ), Comparative learning Bloggers have sorted out Do not go into .
This step is mainly to use a large amount of training data ( Sentences obtained directly from the Internet - The image is right ) Get the representation of the feature . The next two steps are the testing process , The flow is as follows: :
Similar to the training phase , Firstly, the image to be classified is encoded to get the features , Then each tag of the target task data set is converted into a corresponding text ( because CLIP Of Pretraning Data are sentences , For the words of the classification task label Not applicable ), As shown in the figure above dog this label Will be transformed into “A photo of a dog”, also dog The word is mask, Try to predict the word by calculating the inner product similarity of the model , You can do a good job of classification , Because it is the feeling of generating sentences , So in fact, it is very suitable for zero-shot The classification of .
meanwhile , be based on CLIP You can also define your own classifiers freely ! That is to say, it can be used conveniently CLIP Combined with a lot of work , For example, it will be sorted out later DALL-E I used CLIP To feature .
Take a look CLIP The logic flow
def forward(self, image, text):
image_features = self.encode_image(image) # code image
text_features = self.encode_text(text) # code text
# norm The following features
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Calculate the inner product similarity logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
logits_per_text = logit_scale * text_features @ image_features.t()
# shape = [global_batch_size, global_batch_size]
return logits_per_image, logits_per_text
Why the effect is good ?
- Large data sets .Contrastive Pretraning Some of the data used is about... Collected by the author from social media 4 Billion pairs of data , And no one is required to mark ( It saves a lot of manpower and expands the generalization ). There are up to for each image 32,768 A text candidate , This thought SimCLR Big enough CLIP yes 2 times …
- Study object Instead of predicting the entire text description . Namely the dog become "A photo of a dog" This form , Then predict dog, It can accelerate the speed of comparative learning to 4-10 times .
- Vision Transformer. This blogger has also sorted it out and will not repeat it : Portal . In the code, the author uses ViT, It can also be used more than ordinary resnet Fast 3 times , This can make CLIP On larger datasets , Spend more time burning money ( Training ).
For details, please refer to the original paper :
blog:https://openai.com/blog/clip/
paper:https://arxiv.org/pdf/2103.00020.pdf
code:https://github.com/openai/CLIP

DALL-E
And then there was DALL-E Model ,CLIP It can mainly do tasks such as classification and retrieval , And it can generate very good images directly from the text .motivation The goal is to train a transformer Automatic modeling , About text and pictures tokens Convert to a single data stream , Therefore, it is mainly necessary to consider how to 2D The picture is also converted to a single data stream .
It is also a direct look at the model , As shown in the figure above, it can be divided into three stages :dVAE,Transformer and CLIP.
- Stage One.dVAE Used for each of the images patch Generate token Express ( Get a single data stream ). Specifically, it will 256×256 The pictures of are divided into 32×32 individual patch, Then use the trained discrete variational self encoder dVAE The model will each patch Map to size 8192 In the vocabulary of , Finally, a picture will be transformed into 1024 individual token It means . This stage will make transformer Context size of (context size) Reduce 192 times , At the same time, it will not significantly reduce “ Vision ” quality .
- Stage Two.Transformer The architecture is similar to GPT-3 The generative pre training method . As shown in the figure, you will first use BPE-encoder Embed text , obtain 256 individual token( Not enough padding), then concat Images token Splicing , And then input it directly into the trained with 120 Billion parameter Transformer Modeling joint distribution in the model (64 layer , Each layer 62 head , Each head 64 dimension , The final dimension is 3968).
- sample and ranking. Finally, the image generated by the model can be sampled , And then use CLIP The model sorts the sampling results , So as to get the generated image that best matches the text .
Something worth noting trick:
- Gumbel-Softmax. Map image patch The word list is discrete , So we should use relaxed conditions ELB(evidence lower bound).
- stay VAE The end of the encoder and the beginning of the decoder 1×1 Convolution
- Multiply the output activation of the encoder and decoder by a small constant
- To text - Images token The cross entropy loss is normalized
- Because the main image modeling , So multiply the cross entropy loss of the text by 1/8, Multiply the cross entropy loss of the image by 7/8
- Use Adam Algorithm , The optimization is carried out by the exponential weighted iterative average method
- Mixed precision training , In order to save GPU Memory and improved throughput
- Distributed optimization
For details, please refer to the original paper :
blog:https://openai.com/blog/dall-e/
paper:https://arxiv.org/pdf/2102.12092.pdf
official code( Only for the time being dVAE Part of ):https://github.com/openai/DALL-E
Come back, big man code:https://github.com/lucidrains/DALLE-pytorch
The reproduced library can directly call the training , It seems to work very well , If you have enough cards then pip Then you can. :
$ pip install dalle-pytorch
import torch
from dalle_pytorch import CLIP
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 10000,
text_enc_depth = 6,
text_seq_len = 256,
text_heads = 8,
num_visual_tokens = 512,
visual_enc_depth = 6,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8
) # Set up CLIP Parameters of
text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()
loss = clip(text, images, text_mask = mask, return_loss = True) # Direct training CLIP
loss.backward()
The next article will describe it in more detail Prompt This recently popular technology .
边栏推荐
- JS流程控制
- Pycharm connects to the remote server SSH -u reports an error: no such file or directory
- Review in the middle of 2022 | understand the latest progress of pre training model
- Transformer variants (routing transformer, linformer, big bird)
- R语言ggplot2可视化:使用ggpubr包的ggstripchart函数可视化点状条带图、设置palette参数配置不同水平数据点的颜色、设置add参数在点状条带图中添加均值标准差竖线
- php 一台服务器传图片到另一台上 curl post file_get_contents保存图片
- [comparative learning] understanding the behavior of contractual loss (CVPR '21)
- 【GCN-RS】Learning Explicit User Interest Boundary for Recommendation (WWW‘22)
- Oil monkey script link
- 【无标题】
猜你喜欢

11. Reading rumors spread with deep learning

Qin long, a technical expert of Alibaba cloud: a prerequisite for reliability assurance - how to carry out chaos engineering on the cloud

【Debias】Model-Agnostic Counterfactual Reasoning for Eliminating Popularity Bias in RS(KDD‘21)

selenium使用———安装、测试

Differences in usage between tostring() and new string()
![[comparative learning] understanding the behavior of contractual loss (CVPR '21)](/img/96/9b58936365af0ca61aa7a8e97089fe.png)
[comparative learning] understanding the behavior of contractual loss (CVPR '21)

Review in the middle of 2022 | understand the latest progress of pre training model

Javescript loop

阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程

【AI4Code】《Unified Pre-training for Program Understanding and Generation》 NAACL 2021
随机推荐
R语言使用lm函数构建多元回归模型(Multiple Linear Regression)、使用step函数构建前向逐步回归模型筛选预测变量的最佳子集、scope参数指定候选预测变量
Multi label image classification
JVM performance tuning methods
微信公众号开发 入手
银行理财子公司蓄力布局A股;现金管理类理财产品整改加速
[RS sampling] a gain tuning dynamic negative sampler for recommendation (WWW 2022)
Zero shot image retrieval (zero sample cross modal retrieval)
【AI4Code】CodeX:《Evaluating Large Language Models Trained on Code》(OpenAI)
【AI4Code最终章】AlphaCode:《Competition-Level Code Generation with AlphaCode》(DeepMind)
JS中的数组
Multi-Label Image Classification(多标签图像分类)
PHP uploads the FTP path file to the curl Base64 image on the Internet server
油猴脚本链接
Introduction to pl/sql, very detailed notes
Meta-learning(元学习与少样本学习)
brpc源码解析(八)—— 基础类EventDispatcher详解
【AI4Code】《Pythia: AI-assisted Code Completion System》(KDD 2019)
阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程
JS中的函数
11. Reading rumors spread with deep learning