当前位置:网站首页>[untitled] multimodal model clip
[untitled] multimodal model clip
2022-07-27 11:47:00 【Xiao Chen who wants money】
Paper and code links
https://arxiv.org/pdf/2103.00020.pdf
https://github.com/openai/CLIP
Introduce
CLIP It's a bi modal task , For example, enter a sentence , Output an image ; Some previous work is to predict text descriptions through images , and CLIP Is to output images through text ;
Bright spot
1、 Two modes , Input is text and image , Text and image enter respectively encoder code ;
2、 Use comparative learning contrastive learning;
3、 Transform the classification model into a picture and text matching problem ;
Model
There are many graphic pairings on the network , The author used 50w individual query Search the Internet for pictures , Every query 2w A picture , in total 4E A picture .

Input is N Picture and text , This article and the picture are respectively through the corresponding encoder, obtain embeding, by force of contrast loss, Calculation 2 Between modes cosine similarity, Want to pair loss Maximum ( That is, the diagonal value in the graph is the largest ), The remaining values are the smallest . yes zero-shot One of the ways of .
among text encoder Use transformer,image encoder Adopted 2 Kind of model , Respectively :
- 5 Kind of ResNet:ResNet-50, ResNet-101, EfficientNet-style Of ResNet, Include RN50x4, RN50x16, RN50x64;
- 3 Kind of ViT:ViT-B/32, ViT-B/16, ViT-L/14;
The pseudocode is as follows :

summary :
This is more than Imagenet Simple classification is good , Because if it's just classification ,encoder Only one element will be considered , With ‘ Dog ’ For example , When classifying dogs , Only gather some characteristics about dogs ; But if it is in the way of graphic matching , In addition to the dog's information in the text , It also includes other redundant information . for example , This is a field dog , It can be subdivided and so on .
边栏推荐
- 为什么TCP三次握手的时候ACK=Seq+1
- (8) Shell function
- Smart pointer (shared_ptr, unique_ptr, weak_ptr)
- USB 网卡驱动数据流
- 剑指 Offer 笔记: T53 - I. 在排序数组中查找数字
- Principle of PWM and generation of PWM wave
- Finding the finite zero point of transfer function under different sampling periods
- LeetCode 01: T1. 两数之和 ; T1108. IP 地址无效化 ; T344. 反转字符串
- STM32 compilation error: l6235e: more than one section matches selector - cannot all be first/l
- IDEA: Can‘t use Subversion command line client:svn 解决方案
猜你喜欢
随机推荐
The C programming language (2nd) -- Notes -- 1.10
What is private traffic?
LeetCode 02: 剑指 Offer 58 - I. 翻转单词顺序(简单); T123. 验证回文串 ; T9. 回文数
【机器学习-白板推导系列】学习笔记---支持向量机和主成分分析法
C programming language (2nd Edition) -- Reading Notes -- 1.5.2
Moveit2 - 4. robot model and robot state
剑指 Offer 笔记: T57 - II. 和为 s 的连续正数序列
剑指 Offer 笔记: T53 - II. 0~n-1 中缺失的数字
(7) Process control
Keil MDK编译出现..\USER\stm32f10x.h(428): error: #67: expected a “}“错误的解决办法
Codeforces round #664C
多种进制之间的转换
LNMP architecture setup (deploy discuz Forum)
C programming language (2nd Edition) -- Reading Notes -- 1.5.1
日本福岛废堆安全监视协议会认可排海计划“安全”
Detailed explanation of hash table
JUC框架 从Runnable到Callable到FutureTask 使用浅析
The C programming language (2nd) -- Notes -- 1.9
为什么TCP三次握手的时候ACK=Seq+1
82.(cesium之家)cesium点在3d模型上运动








