当前位置:网站首页>[untitled] multimodal model clip
[untitled] multimodal model clip
2022-07-27 11:47:00 【Xiao Chen who wants money】
Paper and code links
https://arxiv.org/pdf/2103.00020.pdf
https://github.com/openai/CLIP
Introduce
CLIP It's a bi modal task , For example, enter a sentence , Output an image ; Some previous work is to predict text descriptions through images , and CLIP Is to output images through text ;
Bright spot
1、 Two modes , Input is text and image , Text and image enter respectively encoder code ;
2、 Use comparative learning contrastive learning;
3、 Transform the classification model into a picture and text matching problem ;
Model
There are many graphic pairings on the network , The author used 50w individual query Search the Internet for pictures , Every query 2w A picture , in total 4E A picture .

Input is N Picture and text , This article and the picture are respectively through the corresponding encoder, obtain embeding, by force of contrast loss, Calculation 2 Between modes cosine similarity, Want to pair loss Maximum ( That is, the diagonal value in the graph is the largest ), The remaining values are the smallest . yes zero-shot One of the ways of .
among text encoder Use transformer,image encoder Adopted 2 Kind of model , Respectively :
- 5 Kind of ResNet:ResNet-50, ResNet-101, EfficientNet-style Of ResNet, Include RN50x4, RN50x16, RN50x64;
- 3 Kind of ViT:ViT-B/32, ViT-B/16, ViT-L/14;
The pseudocode is as follows :

summary :
This is more than Imagenet Simple classification is good , Because if it's just classification ,encoder Only one element will be considered , With ‘ Dog ’ For example , When classifying dogs , Only gather some characteristics about dogs ; But if it is in the way of graphic matching , In addition to the dog's information in the text , It also includes other redundant information . for example , This is a field dog , It can be subdivided and so on .
边栏推荐
- torch‘ has no attribute ‘inference_mode‘
- C programming language (2nd Edition) -- Reading Notes -- 1.5.4
- CH340模块无法识别/烧写不进的一种可能性
- 为什么选择智能电视?
- C programming language (2nd Edition) -- Reading Notes -- 1.5
- Vscode removes style / syntax highlighting / code highlighting / black background when copying code
- Arduino常见供电问题与解决
- CTF crypto RSA getting started
- Tlc549proteus simulation &sallen key filter &ad736vrms to DC conversion &proteus view 51 register value
- VSCode复制代码时去掉样式/语法高亮/代码高亮/黑色背景
猜你喜欢

JUC框架 从Runnable到Callable到FutureTask 使用浅析

LNMP architecture setup (deploy discuz Forum)

Everything cannot be searched for startup_ Lpc11x.s file

SMA TE: Semi-Supervised Spatio-Temporal RepresentationLearning on Multivariate Time Series

微博评论爬虫+可视化

求不同采样周期下的传递函数有限零点

Beyond compare 3 next difference segment / down search arrow not found

Vscode removes style / syntax highlighting / code highlighting / black background when copying code

Moveit2 - 4. robot model and robot state
Synchronous use reference of the new version of data warehouse (for beginners)
随机推荐
TapNet: Multivariate Time Series Classification with Attentional Prototypical Network
一些MathType常用快捷键
日本福岛废堆安全监视协议会认可排海计划“安全”
第13章 IO流
A possibility that ch340 module cannot be recognized / burned
局域网SDN硬核技术内幕 25 展望未来——RDMA(下)
Keil MDK compilation appears..\user\stm32f10x H (428): error: # 67: expected a "}" wrong solution
剑指 Offer 笔记: T53 - II. 0~n-1 中缺失的数字
Tlc549proteus simulation &sallen key filter &ad736vrms to DC conversion &proteus view 51 register value
Proteus8专业版破解后用数码管闪退的解决
makefile模板
PWM的原理和PWM波的产生
82.(cesium之家)cesium点在3d模型上运动
Several banks adjusted the redemption rules of cash management financial products: the confirmation time limit of redemption changed from "t+0" to "t+1"
第10章 枚举类与注解
希腊字母读法
基于反馈率的控制系统原理
LAN SDN technology hard core insider 13 from LAN to Internet
SQL statement learning and the use of pymysql
Firewalld防火墙