[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Last update: Dec 13, 2022

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

VQA Codebase
Pre-training Codebase
Other Downstream Tasks

Installation

conda create -n soho python=3.7
conda activate soho
git clone https://github.com/researchmm/soho.git
cd soho
bash tools/install.sh

Getting Started

Download the training, validation and test data

mkdir -p $SOHO_ROOT/data/coco
cd $SOHO_ROOT/data/coco
# need to update
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val2014.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test2015.zip
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/train_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/val_data_qa_caption_new_box.json
wget https://vqasc.blob.core.windows.net/t-zhihuawork/code_10/MultiScalePretrain/data/coco/test_data_qa.json

Download the Pre-training models

cd $SOHO_ROOT
mkdir -p $SOHO_ROOT/pretrained
cd $SOHO_ROOT/pretrained
# the following need to update
wget

Training a VQA model

cd $SOHO_ROOT
#use 8 GPUS to train the model
bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8

Evaluate a VQA model

bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Related tags

Overview

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

Introduction

Architecture

Release Progress

Installation

Getting Started

Citation

Acknowledgements

Owner

Multimedia Research

you can add any codes in any language by creating its respective folder (if already not available).

An Unsupervised Graph-based Toolbox for Fraud Detection

Code release for Local Light Field Fusion at SIGGRAPH 2019

A library of multi-agent reinforcement learning components and systems

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation

This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

a generic C++ library for image analysis

This is the face keypoint train code of project face-detection-project

Runtime type annotations for the shape, dtype etc. of PyTorch Tensors.

Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

Deep Learning GPU Training System

This repo is customed for VisDrone.

A framework for attentive explainable deep learning on tabular data

Neural Articulated Radiance Field

Deduplicating Training Data Makes Language Models Better

Syntax-Aware Action Targeting for Video Captioning

Pytorch implementation of Zero-DCE++

To prepare an image processing model to classify the type of disaster based on the image dataset

Official implementation for "Symbolic Learning to Optimize: Towards Interpretability and Scalability"