CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Last update: Dec 29, 2022

Related tags

Overview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP.

CLIP2Video is a video-text retrieval model based on CLIP (ViT-B/32), which transfers the image-language pre-training model to video-text retrieval in an end-to-end manner. Our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Introduction

This is the source code of CLIP2Video, a method for Video-Text Retrieval based on temporal correlations. It is built on top of the CLIP4Clip by ( Huaishao Luo et al.) in PyTorch.

Requirement

pip install -r requirements.txt

Download data and Pre-trained Model

Supported public training sets:

MSR-VTT(9k)
MSR-VTT(full)
MSVD
VATEX-English Version

Supported public testing protocols：

MSR-VTT 1k-A protocol (SOTA)
MSR-VTT full protocol (SOTA)
MSVD（SOTA）
VATEX-English version（SOTA）

Download official video: Official videos of different data can be found as follows:

MSRVTT: link.
MSVD: link.
VATEX: link.

Pre-process

To train and test the above datasets: you should use sample_frame.py to transform video into frames.

python sample_frame.py --input_path [raw video path] --output_path [frame path]

(Optional) The splits and captions can be found in the links of used dataset. For the convenience, you can also use the split in data/ directly.

Download CLIP model

To train and test the above datasets based on pre-trained CLIP model, you should visit CLIP and download ViT-B/32.

Test Model

We provide three models trained on MSVD, MSR-VTT and VATEX-English.

Model Name	checkpoint
CLIP2Video_MSVD	link
CLIP2Video_MSRVTT9k	link
CLIP2Video_VATEX	link

To test the trained model, please refer test/.

(Optional) If the path of trained model(--checkpoint) doesn't exist, the parameters of basic CLIP (--clip_path) will be loaded.

Main Article Results of CLIP2Video

T2V:

Protocol	[email protected]	[email protected]	[email protected]	Median Rank	Mean Rank
MSVD	47.0	76.8	85.9	2	9.6
MSRVTT-9k	45.6	72.6	81.7	2	14.6
MSRVTT-Full	29.8	55.5	66.2	4	45.5
Vatex (English) random 1k5 split	57.3	90.0	95.5	1	3.6
Vatex (English) HGR split	61.2	90.9	95.6	1	3.4

V2T:

Protocol	[email protected]	[email protected]	[email protected]	Median Rank	Mean Rank
MSVD	58.7	85.6	91.6	1	4.3
MSRVTT-9k	43.5	72.3	82.1	2	10.2
MSRVTT-Full	54.6	82.1	90.8	1	5.3
Vatex (English) random 1k5 split	76.0	97.7	99.9	1	1.5
Vatex (English) HGR split	77.9	98.1	99.1	1	1.6

(Optional:) Clarification of different results in VATEX:

In our paper, we do not strictly follow HGR's split, but randomly split the test set by ourselves, which is the split in
- data/vatex_data/test1k5_sec_list.txt
In HGR split, we adopt the totally same split following HGR, and the split can be seen as:
- data/vatex_data/test_list.txt
- data/vatex_data/val_list.txt

We will revise the results strictly following HGR split for fair comparison in the paper later!

Citation

If you find CLIP2Video useful in your work, you can cite the following paper:

@article{fang2021clip2video,
  title={CLIP2Video: Mastering Video-Text Retrieval via Image CLIP},
  author={Fang, Han and Xiong, Pengfei and Xu, Luhui and Chen, Yu},
  journal={arXiv preprint arXiv:2106.11097},
  year={2021}
}

Acknowledgments

Some components of this code implementation are adopted from CLIP and CLIP4Clip. We sincerely appreciate for their contributions.

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Related tags

Overview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Introduction

Requirement

Download data and Pre-trained Model

Test Model

Main Article Results of CLIP2Video

Citation

Acknowledgments

Owner

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

《Truly shift-invariant convolutional neural networks》(2021)

Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Language model Prompt And Query Archive

Parallel Latent Tree-Induction for Faster Sequence Encoding

Self-Supervised Image Denoising via Iterative Data Refinement

The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

A full pipeline AutoML tool for tabular data

Code for "On Memorization in Probabilistic Deep Generative Models"

gtfs2vec - Learning GTFS Embeddings for comparing PublicTransport Offer in Microregions

ICLR2021 (Under Review)

Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

Code for the paper "How Attentive are Graph Attention Networks?"

ElasticFace: Elastic Margin Loss for Deep Face Recognition

Code & Experiments for "LILA: Language-Informed Latent Actions" to be presented at the Conference on Robot Learning (CoRL) 2021.

No-reference Image Quality Assessment(NIQA) Algorithms (BRISQUE, NIQE, PIQE, RankIQA, MetaIQA)

PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

Position detection system of mobile robot in the warehouse enviroment

PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'