[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

WebsiteSTVG DemoPaper

PWC PWC PWC

This repository provides the code for our paper. This includes:

  • Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
  • Training scripts and pretrained checkpoints
  • Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

  • To remove time encoding, add --no_time_embed.
  • To remove the temporal self-attention in the space-time decoder, add --no_tsa.
  • To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
  • To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
  • To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
  • To train with the slow-only variant, add --no_fast.
  • To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data parameters url size
MDETR init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + VidSTG k=2 res=224 Drive 3.0GB
ImageNet init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=2 res=224 Drive 3.0GB
MDETR init + HC-STVG1 k=4 res=352 Drive 3.0GB
ImageNet init + HC-STVG1 k=4 res=352 Drive 3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}
Owner
Antoine Yang
PhD Student in Computer Vision at Inria Paris
Antoine Yang
Microscopy Image Cytometry Toolkit

Cytokit Cytokit is a collection of tools for quantifying and analyzing properties of individual cells in large fluorescent microscopy datasets with a

Hammer Lab 106 Jan 06, 2023
Deeper insights into graph convolutional networks for semi-supervised learning

deeper_insights_into_GCNs Deeper insights into graph convolutional networks for semi-supervised learning References data and utils.py come from Implem

Davidham3 17 Dec 16, 2022
DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

Jason Antic 15.8k Jan 04, 2023
Memory Defense: More Robust Classificationvia a Memory-Masking Autoencoder

Memory Defense: More Robust Classificationvia a Memory-Masking Autoencoder Authors: - Eashan Adhikarla - Dan Luo - Dr. Brian D. Davison Abstract Many

Eashan Adhikarla 4 Dec 25, 2022
Repository of our paper 'Refer-it-in-RGBD' in CVPR 2021

Refer-it-in-RGBD This is the repository of our paper 'Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images' in CVPR 2021 Pape

Haolin Liu 34 Nov 07, 2022
This is the code for the paper "Motion-Focused Contrastive Learning of Video Representations" (ICCV'21).

Motion-Focused Contrastive Learning of Video Representations Introduction This is the code for the paper "Motion-Focused Contrastive Learning of Video

11 Sep 23, 2022
This is a official repository of SimViT.

SimViT This is a official repository of SimViT. We will open our models and codes about object detection and semantic segmentation soon. Our code refe

ligang 57 Dec 15, 2022
Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

CP-Cluster Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection, Instance Segme

Yichun Shen 41 Dec 08, 2022
RAMA: Rapid algorithm for multicut problem

RAMA: Rapid algorithm for multicut problem Solves multicut (correlation clustering) problems orders of magnitude faster than CPU based solvers without

Paul Swoboda 60 Dec 13, 2022
A toolset of Python programs for signal modeling and indentification via sparse semilinear autoregressors.

SPAAR Description A toolset of Python programs for signal modeling via sparse semilinear autoregressors. References Vides, F. (2021). Computing Semili

Fredy Vides 0 Oct 30, 2021
Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Predicitng_viability Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for

Gopalika Sharma 1 Nov 08, 2021
Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning By Zhenda Xie*, Yutong Lin*, Zheng Zhang, Yue Ca

Zhenda Xie 293 Dec 20, 2022
banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's app

Bandit ML 51 Dec 22, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules

CapsNet-Tensorflow A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules Notes: The current version

Huadong Liao 3.8k Dec 29, 2022
Python3 / PyTorch implementation of the following paper: Fine-grained Semantics-aware Representation Enhancement for Self-supervisedMonocular Depth Estimation. ICCV 2021 (oral)

FSRE-Depth This is a Python3 / PyTorch implementation of FSRE-Depth, as described in the following paper: Fine-grained Semantics-aware Representation

77 Dec 28, 2022
LSTM Neural Networks for Spectroscopic Studies of Type Ia Supernovae

Package Description The difficulties in acquiring spectroscopic data have been a major challenge for supernova surveys. snlstm is developed to provide

7 Oct 11, 2022
Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

ademxapp Visual applications by the University of Adelaide In designing our Model A, we did not over-optimize its structure for efficiency unless it w

Zifeng Wu 338 Dec 12, 2022
Pytorch implementation of SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation

SenFormer: Efficient Self-Ensemble Framework for Semantic Segmentation Efficient Self-Ensemble Framework for Semantic Segmentation by Walid Bousselham

61 Dec 26, 2022
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

The source code is temporariy removed, as we are solving potential copyright and license issues with GRANSO (http://www.timmitchell.com/software/GRANS

SUN Group @ UMN 28 Aug 03, 2022