ZeroVL - The official implementation of ZeroVL

Last update: Nov 04, 2022

Related tags

Overview

This repository contains source code necessary to reproduce the results presented in the paper ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources.

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we provide a comprehensive training guidance, which allows us to conduct dual-encoder multi-modal representation alignment with limited resources. Meanwhile, we provide a reproducible strong baseline of competitive results, namely ZeroVL, with publicly accessible academic datasets and a popular experimental environment.

Performance

Image-text retreival RSUM scores on MSCOCO and Flickr30K datasets:

method	computation	data	COCO(zs.)	COCO(ft.)	F30K(zs.)	F30K(ft.)
CLIP	256 V100	400M	400.2	-	540.6	-
ALIGN	1024 TPUv3	1800M	425.3	500.4	553.3	576.0
baseline	8 V100	14.2M	363.5	471.9	476.8	553.0
ZeroVL	8 V100	14.2M	425.0	485.0	536.2	561.6
ZeroVL	8 V100	100M	442.1	500.5	546.5	573.6

zs.: zero-shot setting, ft.: fine-tuned setting.

Installation

Requirements:

Python 3.7
Pytorch 1.8.1
torchvision 0.9.1
cuda 11.1

Install requirements:

pip3 install -r requirements.txt

Getting Started

Check GETTING_STARTED.md for codebase usage.

Model Zoo

We will release pre-trained models soon.

Citing ZeroVL

If you use ZeroVL in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{cui2021zerovl,
  title={ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources},
  author={Cui, Quan and Zhou, Boyan and Guo, Yu and Yin, Weidong and Wu, Hao and Yoshie, Osamu},
  journal={arXiv preprint arXiv:2112.09331},
  year={2021}
}

License

ZeroVL is released under the MIT license. See LICENSE for details.

ZeroVL - The official implementation of ZeroVL

Related tags

Overview

Performance

Installation

Getting Started

Model Zoo

Citing ZeroVL

License

Owner

Let's Git - Versionsverwaltung & Open Source Hausaufgabe

The code for "Deep Level Set for Box-supervised Instance Segmentation in Aerial Images".

Code for "ATISS: Autoregressive Transformers for Indoor Scene Synthesis", NeurIPS 2021

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

Implementation of Memory-Compressed Attention, from the paper "Generating Wikipedia By Summarizing Long Sequences"

Improving XGBoost survival analysis with embeddings and debiased estimators

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

GUI for a Vocal Remover that uses Deep Neural Networks.

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

Use unsupervised and supervised learning to predict stocks

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

这是一个利用facenet和retinaface实现人脸识别的库，可以进行在线的人脸识别。

Preprocessed Datasets for our Multimodal NER paper

YOLOv5 detection interface - PyQt5 implementation

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving

Change Detection in SAR Images Based on Multiscale Capsule Network

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

Calculates JMA (Japan Meteorological Agency) seismic intensity (shindo) scale from acceleration data recorded in NumPy array