当前位置:网站首页>[paper introduction] clip: image and natural language pairing pre training transferable visual models from natural language supervision
[paper introduction] clip: image and natural language pairing pre training transferable visual models from natural language supervision
2022-06-09 08:03:00 【Zeng xiaofrog】
Thesis link : 2103.Learning Transferable Visual Models From Natural Language Supervision
Project website : CLIP: Contrastive Language–Image Pre-training——Connecting Text and Images
Code :https://github.com/openai/CLIP
summary
- CLIP(
ContrastiveLanguage–Image Pre-training)Language based on contrastive learning - Image pre training) Based on zero sample migration (zero-shot transfer)、 Supervised learning of natural language ( natural language supervision, ) And a lot of work on multimodal learning .
CLIP It's a pre training model , It's like BERT、GPT、ViT Just like the pre training model . First, use a large amount of data to train these models , Then the trained model can be realized ,
Enter a piece of text ( Or an image ), The output text ( Images ) Vector representation of.CLIP and BERT、GPT、ViT The difference is that ,CLIP It's multimodal , Including image processing and text processing , and BERT、GPT Is single text modal ,ViT It's single image mode .
————————————————
Link to the original text :https://blog.csdn.net/me_yundou/article/details/123033447
Abstract
- SOTA Computer vision systems are trained to predict a fixed set of predetermined (predetermined) Object category . This limited form of Supervision (restricted form of supervision) Limits their versatility (generality) And usability (usability), because (since) Additional tag data is required (additional labeled data) To specify any other visual concept (visual concept).
- Directly from
Original text(raw text ) Learning about images in is a promising choice , It takes advantage of it. (leverages) Broader sources of oversight (broader source of supervusion). - We demonstrated a simple pre training task ,
Predict which title ( written words ) Which image to match, Is an effective and scalable method , Collected from the Internet 4 Billion (Images - The text is right(pairs)) Dataset learning SOTA The image shows . - After pre training , Natural language is used to refer to learned visual concepts ( Or describe new visual concepts ), Make the model
zero-shotMigrate to downstream tasks (downstrem tasks) - We studied 30 Performance of several different computer vision datasets , Across tasks (spanning), Such as OCR、 Action recognition in video 、 Geographical location (geo-localization) And many types of fine-grained ( fine-grained ) Object classification .
- The model is nontrivial to most tasks , And it usually competes with a fully supervised baseline , and
No training of any particular data set is required - . for example , We are ImageNet Up to the original ResNet50 The accuracy of the , Without using what it trains 128 Any of the ten thousand training examples .
Reference resources CLIP Chinese analysis
- You know @ Kakarot —— Detailed explanation CLIP ( One ) | Get through the text - Image pre training implementation ImageNet Of zero-shot classification , Fully supervised training ResNet50/101
- [email protected]_yundou—— In the form of Q & A , Elaborated : Topic interpretation ( Moveable )、 What is? CLIP、CLIP contribution 、CLIP motivation 、 figure
The data comes from the training methods
Open AI The team collects 4 Billion (400 million) individual Text - The image is right ((image, text) pairs), In order to train them to put forward CLIP Model . Text - An example of an image pair is as follows :
Model structure
chart 1 summary :
CLIP Training an image coder and a text coder to predict A group of ( Images ( T 1 , . . . T N T_1,...T_N T1,...TN)、 Text ( I 1 , . . . , I N I_1,...,I_N I1,...,IN)) Correct pairing of training examples . We calculate with I i And T j I_i And T_j Ii And Tj Between Cosine similarity (cosine similarity) Used to measure the correspondence between the corresponding text and image . The greater the cosine similarity .
pepper the aussie pup : Pepper the Australian dog
CLIP The model is direct (zero_shot) For image classification ( On the one hand )
take 1000 individual IMAGENET The image label of is simplified to 
Conclusion
- We've studied (investigated) Is it possible to NLP in Irrelevant to the task (task-agnostic)、 network size (web-sacle)( Hundreds of millions of data ) The success of pre training shifts to another area .. We found that , Using this method (formula) It will lead to similar behavior in the field of computer vision , The social significance of this research field is also discussed .
- To optimize their training objectives ,CLIP The model learns to perform various tasks before training
- then , Can pass
Natural language tips, send zero-shot transfer To many existing datasets . - On a sufficient scale , The performance of this method can compete with the task specific monitoring model , Although there is still much room for improvement .
Official code example
Example 1—— Predict which text the figure below matches
- “a diagram”, “a dog”, “a cat”

Pairing code
The result is “a diagram”
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
Example 2—— Predict multi graph and multi text pairing colab
picture

Picture text description
# images in skimage to use and their textual descriptions
descriptions = {
"page": "a page of text about segmentation",
"chelsea": "a facial photo of a tabby cat",
"astronaut": "a portrait of an astronaut with the American flag",
"rocket": "a rocket standing on a launchpad",
"motorcycle_right": "a red motorcycle standing in a garage",
"camera": "a person looking at a camera on a tripod",
"horse": "a black-and-white silhouette of a horse",
"coffee": "a cup of coffee on a saucer"
}
Part of the code
....................... Omit some code , see colab Complete code
for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
name = os.path.splitext(filename)[0]
if name not in descriptions:
continue
image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
original_images.append(image)
images.append(preprocess(image))
texts.append(descriptions[name])
image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()
with torch.no_grad():
image_features = model.encode_image(image_input).float()
text_features = model.encode_text(text_tokens).float()
# Calculating cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T
# Result visualization
count = len(descriptions)
plt.figure(figsize=(20, 14))
plt.imshow(similarity, vmin=0.1, vmax=0.3)
# plt.colorbar()
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])
for i, image in enumerate(original_images):
plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")
for x in range(similarity.shape[1]):
for y in range(similarity.shape[0]):
plt.text(x, y, f"{
similarity[y, x]:.2f}", ha="center", va="center", size=12)
for side in ["left", "top", "right", "bottom"]:
plt.gca().spines[side].set_visible(False)
plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])
plt.title("Cosine similarity between text and image features", size=20)
Pairing results

边栏推荐
- redis核心知識點總結(超詳細)
- Market Research - current market situation and future development trend of aloe leaf powder in the world and China
- Market Research - current market situation and future development trend of pirox ketolamine for personal care in the world and China
- Breakthrough in customer experience in the insurance industry through low code and no code
- Summary of redis core knowledge points (ultra detailed)
- Redis---07---basic command operations of redis (mainly about how to expand the struct and map of go into redis parameters, and how to use struct to obtain the key value batch array returned by redis)
- Google browser F12 (developer tool) -- function introduction
- C语言复习11
- Senior management information systems and data warehouses and external / unstructured data and data warehouses
- 实时监控,智能预警,疾控中心的战疫“速度”
猜你喜欢

Specific steps for yolov5 to add attention mechanism

Senior management information systems and data warehouses and external / unstructured data and data warehouses

【读点论文】EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks网络结构要像身材一样匀称且体量和处理能力匹配

C语言复习7

TCP transmission control protocol

mysql常见面试知识点

1. Talking about the system construction model -- one of the learning directions of advanced management post for testers

A group of interesting pictures to explain the status code

2022 tower crane driver (construction special type of work) examination question simulation examination question bank and simulation examination

【读点论文】GhostNet: More Features from Cheap Operations 卷积操作还是比较昂贵,特征图冗余可以线性变换获得
随机推荐
【读点论文】GhostNet: More Features from Cheap Operations 卷积操作还是比较昂贵,特征图冗余可以线性变换获得
Framework global listening screen click event input_ EVENT_ INJECTION
Market Research - current situation and future development trend of global and Chinese dental zirconia disc Market
puzzle(105)平面逆推
MySQL: merge query results and aliases
Market Research - current situation and future development trend of global and Chinese cosmetic grade ascorbic acid glucoside Market
Research and investment strategy report on China's photovoltaic aluminum frame industry (2022 Edition)
C language review 7
User name and password are encrypted in clear text during transmission cryptojs encryption (php/js encryption and decryption)
Robot_ Framework: common built-in keywords
Market Research - current market situation and future development trend of functional paper additives in the world and China
Sql Or NoSql,看完这一篇你就懂了
Mechanical keyboard shaft (red shaft, tea shaft, black shaft, green shaft)
Google browser F12 (developer tool) -- function introduction
Specific steps for yolov5 to add attention mechanism
MySQL: single table query
Understand the whole test process with one diagram
mysql常见面试知识点
【学校实验+蓝桥杯题目】接水问题:学校里有一个水房,水房里一共装有m个龙头可供同学们打开水,每个龙头每秒钟的供水量相等,均为1。现在有n名同学准备接水,他们的初始接水顺序已经确定......
企业运用通兑吧数字会员卡进行营销的优势