CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Overview

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

The implementation of paper CLIP2Video: Mastering Video-Text Retrieval via Image CLIP.

CLIP2Video is a video-text retrieval model based on CLIP (ViT-B/32), which transfers the image-language pre-training model to video-text retrieval in an end-to-end manner. Our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

Pipeline Blocks

Introduction

This is the source code of CLIP2Video, a method for Video-Text Retrieval based on temporal correlations. It is built on top of the CLIP4Clip by ( Huaishao Luo et al.) in PyTorch.

Requirement

pip install -r requirements.txt 

Download data and Pre-trained Model

Supported public training sets:

  • MSR-VTT(9k)
  • MSR-VTT(full)
  • MSVD
  • VATEX-English Version

Supported public testing protocols:

  • MSR-VTT 1k-A protocol (SOTA)
  • MSR-VTT full protocol (SOTA)
  • MSVD(SOTA
  • VATEX-English version(SOTA

Download official video: Official videos of different data can be found as follows:

Pre-process

To train and test the above datasets: you should use sample_frame.py to transform video into frames.

python sample_frame.py --input_path [raw video path] --output_path [frame path]

(Optional) The splits and captions can be found in the links of used dataset. For the convenience, you can also use the split in data/ directly.

Download CLIP model

To train and test the above datasets based on pre-trained CLIP model, you should visit CLIP and download ViT-B/32.

Test Model

We provide three models trained on MSVD, MSR-VTT and VATEX-English.

Model Name checkpoint
CLIP2Video_MSVD link
CLIP2Video_MSRVTT9k link
CLIP2Video_VATEX link

To test the trained model, please refer test/.

(Optional) If the path of trained model(--checkpoint) doesn't exist, the parameters of basic CLIP (--clip_path) will be loaded.

Main Article Results of CLIP2Video

T2V:

Protocol [email protected] [email protected] [email protected] Median Rank Mean Rank
MSVD 47.0 76.8 85.9 2 9.6
MSRVTT-9k 45.6 72.6 81.7 2 14.6
MSRVTT-Full 29.8 55.5 66.2 4 45.5
Vatex (English) random 1k5 split 57.3 90.0 95.5 1 3.6
Vatex (English) HGR split 61.2 90.9 95.6 1 3.4

V2T:

Protocol [email protected] [email protected] [email protected] Median Rank Mean Rank
MSVD 58.7 85.6 91.6 1 4.3
MSRVTT-9k 43.5 72.3 82.1 2 10.2
MSRVTT-Full 54.6 82.1 90.8 1 5.3
Vatex (English) random 1k5 split 76.0 97.7 99.9 1 1.5
Vatex (English) HGR split 77.9 98.1 99.1 1 1.6

(Optional:) Clarification of different results in VATEX:

  1. In our paper, we do not strictly follow HGR's split, but randomly split the test set by ourselves, which is the split in

    • data/vatex_data/test1k5_sec_list.txt
  2. In HGR split, we adopt the totally same split following HGR, and the split can be seen as:

    • data/vatex_data/test_list.txt
    • data/vatex_data/val_list.txt

We will revise the results strictly following HGR split for fair comparison in the paper later!


Citation

If you find CLIP2Video useful in your work, you can cite the following paper:

@article{fang2021clip2video,
  title={CLIP2Video: Mastering Video-Text Retrieval via Image CLIP},
  author={Fang, Han and Xiong, Pengfei and Xu, Luhui and Chen, Yu},
  journal={arXiv preprint arXiv:2106.11097},
  year={2021}
}

Acknowledgments

Some components of this code implementation are adopted from CLIP and CLIP4Clip. We sincerely appreciate for their contributions.

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

24 Dec 26, 2022
[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

3DVG-Transformer This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds" Our method "3DV

22 Dec 11, 2022
《Truly shift-invariant convolutional neural networks》(2021)

Truly shift-invariant convolutional neural networks [Paper] Authors: Anadi Chaman and Ivan Dokmanić Convolutional neural networks were always assumed

Anadi Chaman 46 Dec 19, 2022
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
Language model Prompt And Query Archive

LPAQA: Language model Prompt And Query Archive This repository contains data and code for the paper How Can We Know What Language Models Know? Install

127 Dec 20, 2022
Parallel Latent Tree-Induction for Faster Sequence Encoding

FastTrees This repository contains the experimental code supporting the FastTrees paper by Bill Pung. Software Requirements Python 3.6, NLTK and PyTor

Bill Pung 4 Mar 29, 2022
Self-Supervised Image Denoising via Iterative Data Refinement

Self-Supervised Image Denoising via Iterative Data Refinement Yi Zhang1, Dasong Li1, Ka Lung Law2, Xiaogang Wang1, Hongwei Qin2, Hongsheng Li1 1CUHK-S

Zhang Yi 72 Jan 01, 2023
The implementation of CVPR2021 paper Temporal Query Networks for Fine-grained Video Understanding, by Chuhan Zhang, Ankush Gupta and Andrew Zisserman.

Temporal Query Networks for Fine-grained Video Understanding 📋 This repository contains the implementation of CVPR2021 paper Temporal_Query_Networks

55 Dec 21, 2022
A full pipeline AutoML tool for tabular data

HyperGBM Doc | 中文 We Are Hiring! Dear folks,we are offering challenging opportunities located in Beijing for both professionals and students who are k

DataCanvas 240 Jan 03, 2023
Code for "On Memorization in Probabilistic Deep Generative Models"

On Memorization in Probabilistic Deep Generative Models This repository contains the code necessary to reproduce the experiments in On Memorization in

The Alan Turing Institute 3 Jun 09, 2022
gtfs2vec - Learning GTFS Embeddings for comparing PublicTransport Offer in Microregions

gtfs2vec This is a companion repository for a gtfs2vec - Learning GTFS Embeddings for comparing PublicTransport Offer in Microregions publication. Vis

Politechnika Wrocławska - repozytorium dla informatyków 5 Oct 10, 2022
ICLR2021 (Under Review)

Self-Supervised Time Series Representation Learning by Inter-Intra Relational Reasoning This repository contains the official PyTorch implementation o

Haoyi Fan 58 Dec 30, 2022
Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

personalized-breath Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality". Guideline To ex

Manh-Ha Bui 2 Nov 15, 2021
Code for the paper "How Attentive are Graph Attention Networks?"

How Attentive are Graph Attention Networks? This repository is the official implementation of How Attentive are Graph Attention Networks?. The PyTorch

175 Dec 29, 2022
ElasticFace: Elastic Margin Loss for Deep Face Recognition

This is the official repository of the paper: ElasticFace: Elastic Margin Loss for Deep Face Recognition Paper on arxiv: arxiv Model Log file Pretrain

Fadi Boutros 113 Dec 14, 2022
Code & Experiments for "LILA: Language-Informed Latent Actions" to be presented at the Conference on Robot Learning (CoRL) 2021.

LILA LILA: Language-Informed Latent Actions Code and Experiments for Language-Informed Latent Actions (LILA), for using natural language to guide assi

Sidd Karamcheti 11 Nov 25, 2022
No-reference Image Quality Assessment(NIQA) Algorithms (BRISQUE, NIQE, PIQE, RankIQA, MetaIQA)

No-Reference Image Quality Assessment Algorithms No-reference Image Quality Assessment(NIQA) is a task of evaluating an image without a reference imag

Dae-Young Song 26 Jan 04, 2023
PyTorch code for 'Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning'

Efficient Single Image Super-Resolution Using Dual Path Connections with Multiple Scale Learning This repository is for EMSRDPN introduced in the foll

7 Feb 10, 2022
Position detection system of mobile robot in the warehouse enviroment

Autonomous-Forklift-System About | GUI | Tests | Starting | License | Author | 🎯 About An application that run the autonomous forklift paletization a

Kamil Goś 1 Nov 24, 2021
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 06, 2023