当前位置：网站首页>GPT plus money (OpenAI CLIP，DALL-E)

GPT plus money (OpenAI CLIP，DALL-E)

2022-07-25 12:03:00 【Shangshanxianger】

Connect images and text , More multimodal articles can be found in the series compiled by bloggers （ Large scale cross modal pre training portal ）, This article mainly collates OpenAI Published 2 An article . among CLIP Can complete the image and text category matching ,DALL·E Then the image can be generated directly based on the text description , And the performance is very excellent .

Insert picture description here
CLIP
First of all CLIP, Just look at the model , There are three steps ：Contrastive Pretraning,Create dataset classifier from label text and use for zero-shot prediction.

The overall structure of the first part is shown in the figure above , It is a two stream branch of picture and text matching , On one side is the image encoder （ Such as resnet50 perhaps ViT etc. ）, On the other side is the text encoder （ Such as Transformer） Get the feature , Then to a batch Text graph of pair Calculate the inner product of the data to get the matching matrix , The row direction of the matrix is the classifier of the image , From the text point of view, the column direction is a similar classifier . Finally, maximize the probability of the blue part of the diagonal （ Cause to match pair Maximize inner product similarity ）, Comparative learning Bloggers have sorted out Do not go into .

This step is mainly to use a large amount of training data （ Sentences obtained directly from the Internet - The image is right ） Get the representation of the feature . The next two steps are the testing process , The flow is as follows: ：
Insert picture description here
Similar to the training phase , Firstly, the image to be classified is encoded to get the features , Then each tag of the target task data set is converted into a corresponding text （ because CLIP Of Pretraning Data are sentences , For the words of the classification task label Not applicable ）, As shown in the figure above dog this label Will be transformed into “A photo of a dog”, also dog The word is mask, Try to predict the word by calculating the inner product similarity of the model , You can do a good job of classification , Because it is the feeling of generating sentences , So in fact, it is very suitable for zero-shot The classification of .

meanwhile , be based on CLIP You can also define your own classifiers freely ！ That is to say, it can be used conveniently CLIP Combined with a lot of work , For example, it will be sorted out later DALL-E I used CLIP To feature .

Take a look CLIP The logic flow

def forward(self, image, text):
        image_features = self.encode_image(image) # code image
        text_features = self.encode_text(text) # code text

        # norm The following features 
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        #  Calculate the inner product similarity logits
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.t()
        logits_per_text = logit_scale * text_features @ image_features.t()

        # shape = [global_batch_size, global_batch_size]
        return logits_per_image, logits_per_text

Why the effect is good ？

Large data sets .Contrastive Pretraning Some of the data used is about... Collected by the author from social media 4 Billion pairs of data , And no one is required to mark （ It saves a lot of manpower and expands the generalization ）. There are up to for each image 32,768 A text candidate , This thought SimCLR Big enough CLIP yes 2 times …
Study object Instead of predicting the entire text description . Namely the dog become "A photo of a dog" This form , Then predict dog, It can accelerate the speed of comparative learning to 4-10 times .
Vision Transformer. This blogger has also sorted it out and will not repeat it ： Portal . In the code, the author uses ViT, It can also be used more than ordinary resnet Fast 3 times , This can make CLIP On larger datasets , Spend more time burning money （ Training ）.

For details, please refer to the original paper ：
blog：https://openai.com/blog/clip/
paper：https://arxiv.org/pdf/2103.00020.pdf
code：https://github.com/openai/CLIP

Insert picture description here
DALL-E
And then there was DALL-E Model ,CLIP It can mainly do tasks such as classification and retrieval , And it can generate very good images directly from the text .motivation The goal is to train a transformer Automatic modeling , About text and pictures tokens Convert to a single data stream , Therefore, it is mainly necessary to consider how to 2D The picture is also converted to a single data stream .

It is also a direct look at the model , As shown in the figure above, it can be divided into three stages ：dVAE,Transformer and CLIP.

Stage One.dVAE Used for each of the images patch Generate token Express （ Get a single data stream ）. Specifically, it will 256×256 The pictures of are divided into 32×32 individual patch, Then use the trained discrete variational self encoder dVAE The model will each patch Map to size 8192 In the vocabulary of , Finally, a picture will be transformed into 1024 individual token It means . This stage will make transformer Context size of （context size） Reduce 192 times , At the same time, it will not significantly reduce “ Vision ” quality .
Stage Two.Transformer The architecture is similar to GPT-3 The generative pre training method . As shown in the figure, you will first use BPE-encoder Embed text , obtain 256 individual token（ Not enough padding）, then concat Images token Splicing , And then input it directly into the trained with 120 Billion parameter Transformer Modeling joint distribution in the model （64 layer , Each layer 62 head , Each head 64 dimension , The final dimension is 3968）.
sample and ranking. Finally, the image generated by the model can be sampled , And then use CLIP The model sorts the sampling results , So as to get the generated image that best matches the text .

Something worth noting trick：

Gumbel-Softmax. Map image patch The word list is discrete , So we should use relaxed conditions ELB（evidence lower bound）.
stay VAE The end of the encoder and the beginning of the decoder 1×1 Convolution
Multiply the output activation of the encoder and decoder by a small constant
To text - Images token The cross entropy loss is normalized
Because the main image modeling , So multiply the cross entropy loss of the text by 1/8, Multiply the cross entropy loss of the image by 7/8
Use Adam Algorithm , The optimization is carried out by the exponential weighted iterative average method
Mixed precision training , In order to save GPU Memory and improved throughput
Distributed optimization

For details, please refer to the original paper ：
blog：https://openai.com/blog/dall-e/
paper：https://arxiv.org/pdf/2102.12092.pdf
official code（ Only for the time being dVAE Part of ）：https://github.com/openai/DALL-E
Come back, big man code：https://github.com/lucidrains/DALLE-pytorch

The reproduced library can directly call the training , It seems to work very well , If you have enough cards then pip Then you can. ：

$ pip install dalle-pytorch

import torch
from dalle_pytorch import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8
) # Set up CLIP Parameters of 

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True) # Direct training CLIP
loss.backward()

The next article will describe it in more detail Prompt This recently popular technology .

原网站

版权声明
本文为[Shangshanxianger]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251108383272.html

当前位置：网站首页>GPT plus money (OpenAI CLIP，DALL-E)

GPT plus money (OpenAI CLIP，DALL-E)

边栏推荐

猜你喜欢

随机推荐