Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Last update: Dec 30, 2022

Related tags

Deep Learning DeCLIP

Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Overview

DeCLIP

Updates

Introduction

Model

Our pretrain visual backbone model (w/o text encoder)

Citing DeCLIP

Owner

Sense-GVT

Code-free deep segmentation for computational pathology

DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe.

Multivariate Time Series Transformer, public version

3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

PyTorch Implementations for DeeplabV3 and PSPNet

Use MATLAB to simulate the signal and extract features. Use PyTorch to build and train deep network to do spectrum sensing.

LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

Official implementation for "QS-Attn: Query-Selected Attention for Contrastive Learning in I2I Translation" (CVPR 2022)

This is Official implementation for "Pose-guided Feature Disentangling for Occluded Person Re-Identification Based on Transformer" in AAAI2022

Bi-level feature alignment for versatile image translation and manipulation (Under submission of TPAMI)

simple artificial intelligence utilities

A hyperparameter optimization framework

ConvMixer unofficial implementation

Neural Scene Flow Fields using pytorch-lightning, with potential improvements

PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning"

A module for solving and visualizing Schrödinger equation.

Gym for multi-agent reinforcement learning

Image Segmentation Animation using Quadtree concepts.