当前位置：网站首页>[paper introduction] clip: image and natural language pairing pre training transferable visual models from natural language supervision

[paper introduction] clip: image and natural language pairing pre training transferable visual models from natural language supervision

2022-06-09 08:03:00 【Zeng xiaofrog】

Thesis link : 2103.Learning Transferable Visual Models From Natural Language Supervision
Project website : CLIP: Contrastive Language–Image Pre-training——Connecting Text and Images
Code ：https://github.com/openai/CLIP

summary

CLIP（ Contrastive Language–Image Pre-training） Language based on contrastive learning - Image pre training ） Based on zero sample migration （zero-shot transfer）、 Supervised learning of natural language ( natural language supervision, ) And a lot of work on multimodal learning .

CLIP It's a pre training model , It's like BERT、GPT、ViT Just like the pre training model . First, use a large amount of data to train these models , Then the trained model can be realized , Enter a piece of text （ Or an image ）, The output text （ Images ） Vector representation of .CLIP and BERT、GPT、ViT The difference is that ,CLIP It's multimodal , Including image processing and text processing , and BERT、GPT Is single text modal ,ViT It's single image mode .
————————————————
Link to the original text ：https://blog.csdn.net/me_yundou/article/details/123033447

Abstract

SOTA Computer vision systems are trained to predict a fixed set of predetermined （predetermined） Object category . This limited form of Supervision （restricted form of supervision） Limits their versatility (generality) And usability (usability), because (since) Additional tag data is required (additional labeled data) To specify any other visual concept (visual concept).
Directly from Original text (raw text ) Learning about images in is a promising choice , It takes advantage of it. (leverages) Broader sources of oversight (broader source of supervusion).
We demonstrated a simple pre training task , Predict which title ( written words ) Which image to match , Is an effective and scalable method , Collected from the Internet 4 Billion （ Images - The text is right （pairs）） Dataset learning SOTA The image shows .
After pre training , Natural language is used to refer to learned visual concepts （ Or describe new visual concepts ）, Make the model zero-shot Migrate to downstream tasks （downstrem tasks）
We studied 30 Performance of several different computer vision datasets , Across tasks (spanning), Such as OCR、 Action recognition in video 、 Geographical location （geo-localization） And many types of fine-grained ( fine-grained ) Object classification .
The model is nontrivial to most tasks , And it usually competes with a fully supervised baseline , and No training of any particular data set is required
. for example , We are ImageNet Up to the original ResNet50 The accuracy of the , Without using what it trains 128 Any of the ten thousand training examples .

Reference resources CLIP Chinese analysis

The data comes from the training methods

Open AI The team collects 4 Billion （400 million） individual Text - The image is right （(image, text) pairs）, In order to train them to put forward CLIP Model . Text - An example of an image pair is as follows ：
Insert picture description here

Model structure

chart 1 summary ：

CLIP Training an image coder and a text coder to predict A group of （ Images ( $T_1,...T_N$ )、 Text ( $I_1,...,I_N$ )） Correct pairing of training examples . We calculate with $I_i And T_j$ Between Cosine similarity （cosine similarity） Used to measure the correspondence between the corresponding text and image . The greater the cosine similarity .
Insert picture description here
pepper the aussie pup : Pepper the Australian dog

CLIP The model is direct （zero_shot） For image classification （ On the one hand ）

take 1000 individual IMAGENET The image label of is simplified to
Insert picture description here

Conclusion

We've studied （investigated） Is it possible to NLP in Irrelevant to the task (task-agnostic)、 network size （web-sacle）（ Hundreds of millions of data ） The success of pre training shifts to another area .. We found that , Using this method （formula） It will lead to similar behavior in the field of computer vision , The social significance of this research field is also discussed .
To optimize their training objectives ,CLIP The model learns to perform various tasks before training
then , Can pass Natural language tips , send zero-shot transfer To many existing datasets .
On a sufficient scale , The performance of this method can compete with the task specific monitoring model , Although there is still much room for improvement .

`Official code example`

Example 1—— Predict which text the figure below matches

“a diagram”, “a dog”, “a cat”

Insert picture description here

Pairing code

The result is “a diagram”

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937 0.00421068 0.00299572]]

Example 2—— Predict multi graph and multi text pairing colab

picture

Insert picture description here

Picture text description

# images in skimage to use and their textual descriptions
descriptions = {
    
    "page": "a page of text about segmentation",
    "chelsea": "a facial photo of a tabby cat",
    "astronaut": "a portrait of an astronaut with the American flag",
    "rocket": "a rocket standing on a launchpad",
    "motorcycle_right": "a red motorcycle standing in a garage",
    "camera": "a person looking at a camera on a tripod",
    "horse": "a black-and-white silhouette of a horse", 
    "coffee": "a cup of coffee on a saucer"
}

Part of the code

....................... Omit some code , see colab Complete code 
for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
    name = os.path.splitext(filename)[0]
    if name not in descriptions:
        continue

    image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(descriptions[name])


image_input = torch.tensor(np.stack(images)).cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()

with torch.no_grad():
    image_features = model.encode_image(image_input).float()
    text_features = model.encode_text(text_tokens).float()

# Calculating cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = text_features.cpu().numpy() @ image_features.cpu().numpy().T


#  Result visualization 
count = len(descriptions)

plt.figure(figsize=(20, 14))
plt.imshow(similarity, vmin=0.1, vmax=0.3)
# plt.colorbar()
plt.yticks(range(count), texts, fontsize=18)
plt.xticks([])
for i, image in enumerate(original_images):
    plt.imshow(image, extent=(i - 0.5, i + 0.5, -1.6, -0.6), origin="lower")
for x in range(similarity.shape[1]):
    for y in range(similarity.shape[0]):
        plt.text(x, y, f"{
      similarity[y, x]:.2f}", ha="center", va="center", size=12)

for side in ["left", "top", "right", "bottom"]:
  plt.gca().spines[side].set_visible(False)

plt.xlim([-0.5, count - 0.5])
plt.ylim([count + 0.5, -2])

plt.title("Cosine similarity between text and image features", size=20)