A concise but complete implementation of CLIP with various experimental improvements from recent papers

Overview

x-clip (wip)

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Install

$ pip install x-clip

Usage

import torch
from x_clip import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = True   # whether to use fine-grained contrastive learning (FILIP)
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()

Citations

@misc{radford2021learning,
    title   = {Learning Transferable Visual Models From Natural Language Supervision}, 
    author  = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
    year    = {2021},
    eprint  = {2103.00020},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{yao2021filip,
    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training}, 
    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},
    year    = {2021},
    eprint  = {2111.07783},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
Comments
  • Model forward outputs to text/image similarity score

    Model forward outputs to text/image similarity score

    Any insight on how to take the image/text embeddings (or nominal model forward output) to achieve a simple similarity score as done in the huggingface implementation? HF example here

    In the original paper I see the dot products of the image/text encoder outputs were used, but here I was having troubles with the dimensions on the outputs.

    opened by paulcjh 12
  • Using different encoders in CLIP

    Using different encoders in CLIP

    Hi, I am wondering if it was possible to use different encoders in CLIP ? For images not using vit but resnet for example. And is it possible to replace the text encoder by a features encoder for example ? If I have a vector of features for a given image and I want to use x-clip how should I do that ? I have made a code example that doesnt seems to work, here is what I did:

    import torch
    from x_clip import CLIP
    import torch.nn as nn
    from torchvision import models
    
    class Image_Encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(Image_Encoder, self).__init__()
            self.model_pre = models.resnet18(pretrained=False)
            self.base=nn.Sequential(*list(self.model_pre.children()))
            self.base[0]=nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
            self.resnet=self.base[:-1]
    
        def forward(self, x):
            out=self.resnet(x).squeeze()
            return out
    
    
    class features_encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(features_encoder, self).__init__()
            self.model =nn.Linear(2048,512)
    
        def forward(self, x):
            out=self.model(x)
            return out
    
    images_encoder=Image_Encoder()
    features_encoder=features_encoder()
    
    clip = CLIP(
        image_encoder = images_encoder,
        text_encoder = features_encoder,
        dim_image = 512,
        dim_text = 512,
        dim_latent = 512
    )
    
    features= torch.randn(4,2048)
    images = torch.randn(4, 3, 256, 256)
    
    loss = clip(features, images, return_loss = True)
    loss.backward()
    

    but I got the following error : forward() takes 2 positional arguments but 3 were given

    Thanks

    opened by ethancohen123 8
  • Visual ssl with channels different than 3

    Visual ssl with channels different than 3

    Hi, seems to be a bug when trying to use visual ssl with a different number of channel than 3 . I think the error came from the visual ssl type ~row 280 here:

    #send a mock image tensor to instantiate parameters self.forward(torch.randn(1, 3, image_size, image_size))

    opened by ethancohen123 4
  • Allow other types of visual  SSL when initiating CLIP

    Allow other types of visual SSL when initiating CLIP

    In the following code as part of CLIP.__init__

            if use_visual_ssl:
                if visual_ssl_type == 'simsiam':
                    ssl_type = SimSiam
                elif visual_ssl_type == 'simclr':
                    ssl_type = partial(SimCLR, temperature = simclr_temperature)
                else:
                    raise ValueError(f'unknown visual_ssl_type')
    
                self.visual_ssl = ssl_type(
                    self.visual_transformer,
                    image_size = visual_image_size,
                    hidden_layer = visual_ssl_hidden_layer
                )
    

    the visual self-supervised learning is hardcoded. I would suggest changing this to accept the visual SSL module as an argument when instantiating CLIP to allow flexibility in the same manner as it does for the image encoder and text encoder.

    Example:

    barlow = BarlowTwins(augmentatation_fns)
    clip = CLIP(..., visual_ssl=barlow)
    
    opened by Froskekongen 4
  • Extract Text and Image Latents

    Extract Text and Image Latents

    Hi, in the current implementation we can only extract text and image embedding (by set return_encodings=True) which are obtained before applying latent linear layers. Isn't it better to add an option to extract latent embeddings? Another importance of this is that with the current code, it is impossible to extract the similarity matrix between a batch of images and a batch of text.

    opened by mmsamiei 2
  • NaN with mock data

    NaN with mock data

    Hi lucidrains,

    Try this and it will NaN within 100 steps (latest Github code). The loss looks fine before NaN.

    import torch
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True    
    torch.backends.cudnn.benchmark = True
    
    import random
    import numpy as np
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    num_text_tokens = 10000
    batch_sz = 12
    text_seq_len = 256
    visual_image_size = 256
    
    # mock data
    
    data_sz = 1000
    all_text = torch.randint(0, num_text_tokens, (data_sz, text_seq_len)).cuda()
    all_images = torch.randn(data_sz, 3, visual_image_size, visual_image_size).cuda()
    
    text = torch.zeros((batch_sz, text_seq_len), dtype=torch.long).cuda()
    images = torch.zeros((batch_sz, 3, visual_image_size, visual_image_size)).cuda()
    
    ##########################################################################################
    
    import wandb
    import datetime
    wandb.init(project="Test", name=datetime.datetime.today().strftime('%Y-%m-%d-%H-%M-%S'), save_code=False)
    
    from x_clip import CLIP
    
    clip = CLIP(
        dim_text = 512,
        dim_image = 512,
        dim_latent = 512,
        num_text_tokens = num_text_tokens,
        text_enc_depth = 6,
        text_seq_len = text_seq_len,
        text_heads = 8,
        visual_enc_depth = 6,
        visual_image_size = visual_image_size,
        visual_patch_size = 32,
        visual_heads = 8,
        use_all_token_embeds = False,           # whether to use fine-grained contrastive learning (FILIP)
        decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
        extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
        use_visual_ssl = True,                  # whether to do self supervised learning on iages
        visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
        use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
        text_ssl_loss_weight = 0.05,            # weight for text MLM loss
        image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
    ).cuda()
    
    optimizer = torch.optim.Adam(clip.parameters(), lr=1e-4, betas=(0.9, 0.99))
    
    for step in range(999999):
        for i in range(batch_sz):
            data_id = random.randrange(0, data_sz - 1)
            text[i] = all_text[data_id]
            images[i] = all_images[data_id]
    
        loss = clip(
            text,
            images,
            freeze_image_encoder = False,   # whether to freeze image encoder if using a pretrained image net, proposed by LiT paper
            return_loss = True              # needs to be set to True to return contrastive loss
        )
        clip.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(clip.parameters(), 1.0)
        optimizer.step()
    
        now_loss = loss.item()
        wandb.log({"loss": now_loss}, step = step)
        print(step, now_loss)
    
        if 'nan' in str(now_loss):
            break
    
    opened by BlinkDL 1
  • Unable to train to convergence (small dataset)

    Unable to train to convergence (small dataset)

    Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.

    Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).

    Any thoughts on what are sane training params/configs on such a small dataset using x-clip?

    opened by jacobwjs 9
Releases(0.12.0)
Owner
Phil Wang
Working with Attention. It's all we need
Phil Wang
Fast EMD for Python: a wrapper for Pele and Werman's C++ implementation of the Earth Mover's Distance metric

PyEMD: Fast EMD for Python PyEMD is a Python wrapper for Ofir Pele and Michael Werman's implementation of the Earth Mover's Distance that allows it to

William Mayner 433 Dec 31, 2022
Flexible Networks for Learning Physical Dynamics of Deformable Objects (2021)

Flexible Networks for Learning Physical Dynamics of Deformable Objects (2021) By Jinhyung Park, Dohae Lee, In-Kwon Lee from Yonsei University (Seoul,

Jinhyung Park 0 Jan 09, 2022
Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

DynaBOA Code repositoty for the paper: Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation Shanyan Guan, Jingwei Xu, Michell

198 Dec 29, 2022
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions Project Page | Paper If you find our work useful for your research, please con

96 Jan 04, 2023
Hyperbolic Procrustes Analysis Using Riemannian Geometry

Hyperbolic Procrustes Analysis Using Riemannian Geometry The code in this repository creates the figures presented in this article: Please notice that

Ronen Talmon's Lab 2 Jan 08, 2023
🚀 PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)"

PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)" Unofficial PyTorch Implementation of Progressi

Vitaliy Hramchenko 58 Dec 19, 2022
Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Deep Constrained Least Squares for Blind Image Super-Resolution [Paper] This is the official implementation of 'Deep Constrained Least Squares for Bli

MEGVII Research 141 Dec 30, 2022
The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

P2PNet (ICCV2021 Oral Presentation) This repository contains codes for the official implementation in PyTorch of P2PNet as described in Rethinking Cou

Tencent YouTu Research 208 Dec 26, 2022
Official repo for BMVC2021 paper ASFormer: Transformer for Action Segmentation

ASFormer: Transformer for Action Segmentation This repo provides training & inference code for BMVC 2021 paper: ASFormer: Transformer for Action Segme

42 Dec 23, 2022
NeurIPS 2021 Datasets and Benchmarks Track

AP-10K: A Benchmark for Animal Pose Estimation in the Wild Introduction | Updates | Overview | Download | Training Code | Key Questions | License Intr

AP-10K 82 Dec 11, 2022
Pytorch Lightning Distributed Accelerators using Ray

Distributed PyTorch Lightning Training on Ray This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed compu

167 Jan 02, 2023
Complete* list of autonomous driving related datasets

AD Datasets Complete* and curated list of autonomous driving related datasets Contributing Contributions are very welcome! To add or update a dataset:

Daniel Bogdoll 13 Dec 19, 2022
A collection of easy-to-use, ready-to-use, interesting deep neural network models

Interesting and reproducible research works should be conserved. This repository wraps a collection of deep neural network models into a simple and un

Aria Ghora Prabono 16 Jun 16, 2022
General Virtual Sketching Framework for Vector Line Art (SIGGRAPH 2021)

General Virtual Sketching Framework for Vector Line Art - SIGGRAPH 2021 Paper | Project Page Outline Dependencies Testing with Trained Weights Trainin

Haoran MO 118 Dec 27, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
《Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement》(ECCV 2020) GitHub: [fig9]

Unsupervised 3D Human Pose Representation [Paper] The implementation of our paper Unsupervised 3D Human Pose Representation with Viewpoint and Pose Di

42 Nov 24, 2022
Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集

English | 简体中文 Latest News 2021.10.25 Paper "Docking-based Virtual Screening with Multi-Task Learning" is accepted by BIBM 2021. 2021.07.29 PaddleHeli

633 Jan 04, 2023
You can draw the corresponding bounding box into the image and save it according to the result file (txt format) run by the tracker.

You can draw the corresponding bounding box into the image and save it according to the result file (txt format) run by the tracker.

Huiyiqianli 42 Dec 06, 2022
This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).

DCL-PyTorch Pytorch implementation for the Dynamic Concept Learner (DCL). More details can be found at the project page. Framework Grounding Physical

Zhenfang Chen 31 Jan 06, 2023