Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Overview

ALPRO

Align and Prompt: Video-and-Language Pre-training with Entity Prompts [Paper]

Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi

Official PyTorch code for ALPRO. This repository supports pre-training as well as finetuning on

  • Text-Video Retrieval on MSRVTT and DiDeMo.
  • Video Question Anwsering on MSRVTT and MSVD.

Requirements

Our implementation is tested on Ubuntu 20.04.1 with NVIDIA A100 GPUs. Supports for other platforms and hardwares are possible with no warrant. To install the required packages:

cd env && bash install_pkg.sh

Data Preparation

  1. Download Annotations and Pre-trained Checkpoints

  2. Download raw videos of downstream datasets.

    • MSRVTT:
      • download train_val_videos.zip and test_videos.zip from e.g. here.

      • check md5sum:

        51f2394d279cf84f1642defd9a651e6f  train_val_videos.zip
        0af68454cec9d586e92805739f3911d0  test_videos.zip
      • unzip all the videos into data/msrvtt_ret/videos (10k in total).

      • create the following soft link:

        ln -s data/msrvtt_ret/videos data/msrvtt_qa/videos```
    • MSVD:
      • download from official release:

        wget -nc https://www.cs.utexas.edu/users/ml/clamp/videoDescription/YouTubeClips.tar
      • check md5sum:

        9bdb20fcf14d59524a6febca9f6a8d89  YouTubeClips.tar
      • unzip all the videos to data/msvd_qa/videos (1,970 videos in total).

        mkdir data/msvd_qa/videos/ 
        tar xvf YouTubeClips.tar -C data/msvd_qa/videos --strip-components=1
    • DiDeMo:
      • Following instructions and download from the official release here;
      • unzip all the videos into data/didemo_ret/videos.
      • Note there might be a couple videos missing. See here to download. However, as they account for a small portion of training set, you may feel safe to ignore.
      • Convert all the DiDeMo videos into *.mp4 format using e.g. ffmpeg.
      • We obtained 10,463 videos following these steps (with one video [email protected]_5753455690_1e04ccb364 missing).
  3. The directory is expected to be in the structure below:

    .
    |-config_release  # configuration files
    |-data  # text annotations and raw videos
    |---didemo_ret
    |-----txt
    |-----videos
    |---msrvtt_qa/...
    |---msrvtt_ret/...
    |---msvd_qa/...
    |-env  # scripts to install packages
    |-ext  # external resources, e.g. bert tokenizer
    |-output  # checkpoints for pre-trained/finetuned models
    |---downstreams
    |-----didemo_ret
    |-------public
    |---------ckpt # official finetuned checkpoints
    |---------log # inference log
    |---------results_test
    |-----------step_best_1_mean
    |-----msrvtt_qa/...
    |-----msrvtt_ret/...
    |-----msvd_qa/...
    |-run_scripts  # bash scripts to launch experiments
    |-src  # source code

Inference with Official Checkpoints

cd run_scripts
bash inf_msrvtt_ret.sh
# {'text2video': {'r1': 33.9, 'r5': 60.7, 'r10': 73.2, 'medianR': 3.0, 'meanR': 27.404}}
bash inf_didemo_ret.sh
# {'text2video': {'r1': 35.9, 'r5': 67.5, 'r10': 78.8, 'medianR': 3.0, 'meanR': 19.125}}
bash inf_msrvtt_qa.sh
# {'ratios': {'what_ratio': [68.48, 49872], 'who_ratio': [27.99, 20385], 'how_ratio': [2.25, 1640], 'where_ratio': [0.34, 250], 'when_ratio': [0.93, 677]}, 'overall_acc': 42.12, 'what_acc': 36.05, 'who_acc': 52.24, 'how_acc': 85.67, 'where_acc': 42.8, 'when_acc': 78.88}
bash inf_msvd_qa.sh
# {'ratios': {'what_ratio': [61.93, 8150], 'who_ratio': [34.6, 4554], 'how_ratio': [2.81, 370], 'where_ratio': [0.21, 28], 'when_ratio': [0.44, 58]}, 'overall_acc': 45.91, 'what_acc': 37.02, 'who_acc': 58.59, 'how_acc': 81.62, 'where_acc': 46.43, 'when_acc': 72.41}

Downstream Task Finetuning

  • To finetune on downstream tasks with the pre-trained checkpoint output/pretrain/alpro_pretrained_ckpt.pt

    cd run_scripts
    bash ft_msrvtt_ret.sh
    bash ft_didemo_ret.sh
    bash ft_msrvtt_qa.sh
    bash ft_msvd_qa.sh

    For example, with MSRVTT retrieval:

    cd ALPRO/
    
    export PYTHONPATH="$PYTHONPATH:$PWD"
    echo $PYTHONPATH
    
    CONFIG_PATH='config_release/msrvtt_ret.json'
    
    horovodrun -np 8 python src/tasks/run_video_retrieval.py \ # change -np to GPUs numbers.
        --config $CONFIG_PATH \
        --output_dir /export/home/workspace/experiments/alpro/finetune/msrvtt_ret/$(date '+%Y%m%d%H%M%S')  # change to your local path to store finetuning ckpts and logs 
  • Run inference with locally-finetuned checkpoints.

     cd ALPRO/
    
     export PYTHONPATH="$PYTHONPATH:$PWD"
     echo $PYTHONPATH
    
     STEP='best'
    
     CONFIG_PATH='config_release/msrvtt_ret.json'
     OUTPUT_DIR='[INPUT_YOUR_OUTPUT_PATH_HERE]'
    
     TXT_DB='data/msrvtt_ret/txt/test.jsonl'
     IMG_DB='data/msrvtt_ret/videos'
    
     horovodrun -np 8 python src/tasks/run_video_retrieval.py \
         --do_inference 1 \
         --inference_split test \
         --inference_model_step $STEP \
         --inference_txt_db $TXT_DB \
         --inference_img_db $IMG_DB \
         --inference_batch_size 64 \
         --output_dir $OUTPUT_DIR \
         --config $CONFIG_PATH
    • OUTPUT_DIR is the path after the --output_dir option in the finetuning script.
    • $STEP is a string, which tells the script to use the checkpoint $OUTPUT_DIR/ckpt/model_step_$STEP.pt for inference.

Pretraining

  1. Download WebVid2M and CC-3M.

    • Put WebVid2M videos under data/webvid2m;
    • 💡 we downsample webvid2m videos to 10% of the original FPS to speed-up video loading;
    • change data/cc3m/txt/cc3m.json with local image paths.
  2. Training Prompter:

    cd run_scripts && bash pt_prompter.sh
  3. Training video-language model:

    cd run_scripts && bash pt_alpro.sh

    If you would like to use custom prompter weight, please change teacher_weights_path in config_release/pretrain_alpro.json

  4. To finetune with pre-trained checkpoints, please change e2e_weights_path in the finetuning config files, e.g. config_release/msrvtt_ret.json.

Citation

If you find ALPRO useful for your research, please consider citing:

  @inproceedings{li2021align,
    title={Align and Prompt: Video-and-Language Pre-training with Entity Prompts},
    author={Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven C.H. Hoi},
    booktitle={arxiv},
    year={2021}
  }

Acknowledgement

We thank members at Salesforce Research for their helpful discussions.

The implementation of ALPRO relies on resources from ClipBERT, transformers, TimeSformer, The code is implemented using PyTorch, with multi-GPU support from Horovod and gradient-checkpoint. We thank the original authors for their open-sourcing and encourage ALPRO users to cite their works when applicable.

Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting

Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting Note: You can find here the accompanying seq2seq RNN forecas

Guillaume Chevalier 1k Dec 25, 2022
A python script to lookup Passport Index Dataset

visa-cli A python script to lookup Passport Index Dataset Installation pip install visa-cli Usage usage: visa-cli [-h] [-d DESTINATION_COUNTRY] [-f]

rand-net 16 Oct 18, 2022
Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

tisane Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships TL;DR: Analysts can use Tisane to author gener

Eunice Jun 11 Nov 15, 2022
A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud.

Lidar with Velocity A robust camera and Lidar fusion based velocity estimator to undistort the pointcloud. related paper: Lidar with Velocity : Motion

ISEE Research Group 164 Dec 30, 2022
Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Real-ESRGAN Colab Demo for Real-ESRGAN . Portable Windows executable file. You can find more information here. Real-ESRGAN aims at developing Practica

Xintao 17.2k Jan 02, 2023
Meaningful titles for tabs and PDF downloads! Also supports tab search.

arxiv-utils If you are a researcher that reads a lot on ArXiv, you'll benefit a lot from this web extension. Renames the title of PDF page to the pape

Johnson 174 Dec 20, 2022
Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic S

Ken Lin 17 Oct 12, 2022
Predict halo masses from simulations via graph neural networks

HaloGraphNet Predict halo masses from simulations via Graph Neural Networks. Given a dark matter halo and its galaxies, creates a graph with informati

Pablo Villanueva Domingo 20 Nov 15, 2022
Pytorch implementation of RED-SDS (NeurIPS 2021).

Recurrent Explicit Duration Switching Dynamical Systems (RED-SDS) This repository contains a reference implementation of RED-SDS, a non-linear state s

Abdul Fatir 10 Dec 02, 2022
This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their coordinates and detected labels.

This YoloV5 based model is fit to detect people and different types of land vehicles, and displaying their density on a fitted map, according to their

Liron Bdolah 8 May 22, 2022
Autolfads-tf2 - A TensorFlow 2.0 implementation of Latent Factor Analysis via Dynamical Systems (LFADS) and AutoLFADS

autolfads-tf2 A TensorFlow 2.0 implementation of LFADS and AutoLFADS. Installati

Systems Neural Engineering Lab 11 Oct 29, 2022
Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images

Learning Lightweight Low-Light Enhancement Network using Pseudo Well-Exposed Images This repository contains the implementation of the following paper

Seonggwan Ko 9 Jul 30, 2022
基于DouZero定制AI实战欢乐斗地主

DouZero_For_Happy_DouDiZhu: 将DouZero用于欢乐斗地主实战 本项目基于DouZero 环境配置请移步项目DouZero 模型默认为WP,更换模型请修改start.py中的模型路径 运行main.py即可 SL (baselines/sl/): 基于人类数据进行深度学习

1.5k Jan 08, 2023
The official PyTorch implementation for NCSNv2 (NeurIPS 2020)

Improved Techniques for Training Score-Based Generative Models This repo contains the official implementation for the paper Improved Techniques for Tr

174 Dec 26, 2022
N-Omniglot is a large neuromorphic few-shot learning dataset

N-Omniglot [Paper] || [Dataset] N-Omniglot is a large neuromorphic few-shot learning dataset. It reconstructs strokes of Omniglot as videos and uses D

11 Dec 05, 2022
Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

LiDAR fog simulation Created by Martin Hahner at the Computer Vision Lab of ETH Zurich. This is the official code release of the paper Fog Simulation

Martin Hahner 110 Dec 30, 2022
Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection PyTorch code release of the paper "Attentive Prototypes for Sour

Deepti Hegde 23 Oct 17, 2022
Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes, ICCV 2017

AdaptationSeg This is the Python reference implementation of AdaptionSeg proposed in "Curriculum Domain Adaptation for Semantic Segmentation of Urban

Yang Zhang 128 Oct 19, 2022
Pytorch implementation of the paper "Class-Balanced Loss Based on Effective Number of Samples"

Class-balanced-loss-pytorch Pytorch implementation of the paper Class-Balanced Loss Based on Effective Number of Samples presented at CVPR'19. Yin Cui

Vandit Jain 697 Dec 29, 2022
Individual Tree Crown classification on WorldView-2 Images using Autoencoder -- Group 9 Weak learners - Final Project (Machine Learning 2020 Course)

Created by Olga Sutyrina, Sarah Elemili, Abduragim Shtanchaev and Artur Bille Individual Tree Crown classification on WorldView-2 Images using Autoenc

2 Dec 08, 2022