当前位置：网站首页>[untitled] multimodal model clip

[untitled] multimodal model clip

2022-07-27 11:47:00 【Xiao Chen who wants money】

Paper and code links

https://arxiv.org/pdf/2103.00020.pdf
https://github.com/openai/CLIP

Introduce

CLIP It's a bi modal task , For example, enter a sentence , Output an image ; Some previous work is to predict text descriptions through images , and CLIP Is to output images through text ;

Bright spot

1、 Two modes , Input is text and image , Text and image enter respectively encoder code ;

2、 Use comparative learning contrastive learning;

3、 Transform the classification model into a picture and text matching problem ;

Model

`There are many graphic pairings on the network , The author used 50w individual query Search the Internet for pictures , Every query 2w A picture , in total 4E A picture .`

Input is N Picture and text , This article and the picture are respectively through the corresponding encoder, obtain embeding, by force of contrast loss, Calculation 2 Between modes cosine similarity, Want to pair loss Maximum （ That is, the diagonal value in the graph is the largest ）, The remaining values are the smallest . yes zero-shot One of the ways of .

among text encoder Use transformer,image encoder Adopted 2 Kind of model , Respectively ：

5 Kind of ResNet：ResNet-50, ResNet-101, EfficientNet-style Of ResNet, Include RN50x4, RN50x16, RN50x64;
3 Kind of ViT：ViT-B/32, ViT-B/16, ViT-L/14;

The pseudocode is as follows ：

summary ：

This is more than Imagenet Simple classification is good , Because if it's just classification ,encoder Only one element will be considered , With ‘ Dog ’ For example , When classifying dogs , Only gather some characteristics about dogs ; But if it is in the way of graphic matching , In addition to the dog's information in the text , It also includes other redundant information . for example , This is a field dog , It can be subdivided and so on .

原网站

版权声明
本文为[Xiao Chen who wants money]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271059007439.html