Efficient 3D Backbone Network for Temporal Modeling

Overview

VoV3D

report PWC
VoV3D is an efficient and effective 3D backbone network for temporal modeling implemented on top of PySlowFast.

Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification
Youngwan Lee, Hyung-Il Kim, Kimin Yun, and Jinyoung Moon
Electronics and Telecommunications Research Institute (ETRI)
pre-print : https://arxiv.org/abs/2012.00317

Abstract

Video classification researches that have recently attracted attention are the fields of temporal modeling and 3D efficient architecture. However, the temporal modeling methods are not efficient or the 3D efficient architecture is less interested in temporal modeling. For bridging the gap between them, we propose an efficient temporal modeling 3D architecture, called VoV3D, that consists of a temporal one-shot aggregation (T-OSA) module and depthwise factorized component, D(2+1)D. The T-OSA is devised to build a feature hierarchy by aggregating temporal features with different temporal receptive fields. Stacking this T-OSA enables the network itself to model short-range as well as long-range temporal relationships across frames without any external modules. Inspired by kernel factorization and channel factorization, we also design a depthwise spatiotemporal factorization module, named, D(2+1)D that decomposes a 3D depthwise convolution into two spatial and temporal depthwise convolutions for making our network more lightweight and efficient. By using the proposed temporal modeling method (T-OSA), and the efficient factorized component (D(2+1)D), we construct two types of VoV3D networks, VoV3D-M and VoV3D-L. Thanks to its efficiency and effectiveness of temporal modeling, VoV3D-L has 6x fewer model parameters and 16x less computation, surpassing a state-of-the-art temporal modeling method on both Something-Something and Kinetics-400. Furthermore, VoV3D shows better temporal modeling ability than a state-of-the-art efficient 3D architecture, X3D having comparable model capacity. We hope that VoV3D can serve as a baseline for efficient video classification.

Main Result

Our results (X3D & VoV3D) are trained in the same environment.

  • V100 8 GPU machine
  • same training protocols (BASE_LR, LR_POLICY, batch size, etc)
  • pytorch 1.6
  • CUDA 10.1

*Please refer to our paper or configs files for the details.
*When you want to reproduce the same results, you just train the model with configs on the 8 GPU machine. If you change NUM_GPUS or TRAIN.BATCH_SIZE values, you have to adjust BASE_LR.
*IM and K-400 denote ImageNet and Kinetics-400, respectively.

Something-Something-V1

Model Backbone Pretrain #Frame Param. GFLOPs Top-1 Top-5 weight
TSM R-50 K-400 16 24.3M 33x6 48.3 78.1 link
TSM+TPN R-50 IM 8 N/A N/A 50.7 - link
TEA R-50 IM 16 24.4M 70x30 52.3 81.9 -
ip-CSN-152 - - 32 29.7M 74.0x10 49.3 - -
X3D M - 16 3.3M 6.1x6 46.4 75.3 link
VoV3D M - 16 3.3M 5.7x6 48.1 76.9 link
VoV3D M - 32 3.3M 11.5x6 49.8 78.0 link
VoV3D M K-400 32 3.3M 11.5x6 52.6 80.4 link
X3D L - 16 5.6M 9.1x6 47.0 76.4 link
VoV3D L - 16 5.8M 9.3x6 49.5 78.0 link
VoV3D L - 32 5.8M 20.9x6 50.6 78.7 link
VoV3D L K-400 32 5.8M 20.9x6 54.9 82.3 link

Something-Something-V2

Model Backbone Pretrain #Frame Param. GFLOPs Top-1 Top-5 weight
TSM R-50 K-400 16 24.3M 33x6 63.0 88.1 link
TSM+TPN R-50 IM 8 N/A N/A 64.7 - link
TEA R-50 IM 16 24.4M 70x30 65.1 89.9 -
SlowFast 16x8 R-50 K-400 64 34.0M 131.4x6 63.9 88.2 link
X3D M - 16 3.3M 6.1x6 63.0 87.9 link
VoV3D M - 16 3.3M 5.7x6 63.2 88.2 link
VoV3D M - 32 3.3M 11.5x6 64.2 88.8 link
VoV3D M K-400 32 3.3M 11.5x6 65.2 89.4 link
X3D L - 16 5.6M 9.1x6 62.7 87.7 link
VoV3D L - 16 5.8M 9.3x6 64.1 88.6 link
VoV3D L - 32 5.8M 20.9x6 65.8 89.5 link
VoV3D L K-400 32 5.8M 20.9x6 67.3 90.5 link

Kinetics-400

Model Backbone Pretrain #Frame Param. GFLOPs Top-1 Top-5 weight
X3D (PySlowFast, 300e) M - 16 3.8M 6.2x30 76.0 92.3 link
X3D (our, 256e) M - 16 3.8M 6.2x30 75.0 92.1 link
VoV3D M - 16 3.8M 4.4x30 73.9 91.6 link
X3D (PySlowfast) L - 16 6.1M 24.8x30 77.5 92.9 link
VoV3D L - 16 6.2M 9.3x30 76.3 92.9 link

*We note that since X3D-M (PySlowFast) was trained for 300 epochs, we re-train the X3D-M (our, 256e) with the same 256 epochs with VoV3D-M.

Installation & Data Preparation

Please refer to INSTALL.md for installation and DATA.md for data preparation.
Important : We used depthwise 3D Conv pytorch patch for accelearating GPU runtime.

Training & Evaluation

We provide brief examples for getting started. If you want to know more details, please refer to instruction of PySlowFast.

Training

from scratch

  • VoV3D-L on Kinetics-400
python tools/run_net.py \
  --cfg configs/Kinetics/vov3d/vov3d_L.yaml \
  DATA.PATH_TO_DATA_DIR path/to/your/kinetics \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 64

You can also designate each argument in the config file. If you want to train with our default setting (e.g., 8GPUs, 64 batch size, etc), you just use this command. (Set DATA.PATH_TO_DATA_DIR with your real data path)

python tools/run_net.py --cfg configs/Kinetics/vov3d/vov3d_L.yaml
  • VoV3D-L on Something-Something-V1
python tools/run_net.py \
  --cfg configs/SSv1/vov3d/vov3d_L_F16.yaml \
  DATA.PATH_TO_DATA_DIR path/to/your/ssv1 \ 
  DATA.PATH_PREFIX path/to/your/ssv1

Finetuning by using Kinetics-400 pretrained weight.

First, you have to download the weights pretrained on Kinetics-400.

One thing you should keep in mind is that TRAIN.CHECKPOINT_FILE_PATH is the downloaded weight.

For Something-Something-V2,

cd VoV3D
mkdir -p output/pretrained
wget https://dl.dropbox.com/s/lzmq8d4dqyj8fj6/vov3d_L_k400.pth

python tools/run_net.py \
  --cfg configs/SSv2/vov3d/finetune/vov3d_L_F16.yaml \
  TRAIN.CHECKPOINT_FILE_PATH path/to/the/pretrained/vov3d_L_k400.pth \
  DATA.PATH_TO_DATA_DIR path/to/your/ssv2 \
  DATA.PATH_PREFIX path/to/your/ssv2

Testing

When testing, you have to set TRAIN.ENABLE to False and TEST.CHECKPOINT_FILE_PATH to path/to/your/checkpoint.

python tools/run_net.py \
  --cfg configs/Kinetics/vov3d/vov3d_L.yaml \
  TRAIN.ENABLE False \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint

If you want to test with single clip and single-crop, set TEST.NUM_ENSEMBLE_VIEWS and TEST.NUM_SPATIAL_CROPS to 1, respectively.

python tools/run_net.py \
  --cfg configs/Kinetics/vov3d/vov3d_L.yaml \
  TRAIN.ENABLE False \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TEST.NUM_ENSEMBLE_VIEWS 1 \
  TEST.NUM_SPATIAL_CROPS 1

For Kinetics-400, 30-views : TEST.NUM_ENSEMBLE_VIEWS 10 & TEST.NUM_SPATIAL_CROPS 3
For Something-Something, 6-views : TEST.NUM_ENSEMBLE_VIEWS 2 & TEST.NUM_SPATIAL_CROPS 3

License

The code and the models in this repo are released under the CC-BY-NC4.0 LICENSE. See the LICENSE file.

Citing VoV3D

@article{lee2020vov3d,
  title={Diverse Temporal Aggregation and Depthwise Spatiotemporal Factorization for Efficient Video Classification},
  author={Lee, Youngwan and Kim, Hyung-Il and Yun, Kimin and Moon, Jinyoung},
  journal={arXiv preprint arXiv:2012.00317},
  year={2020}
}

@inproceedings{lee2019energy,
  title = {An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection},
  author = {Lee, Youngwan and Hwang, Joong-won and Lee, Sangrok and Bae, Yuseok and Park, Jongyoul},
  booktitle = {CVPR Workshop},
  year = {2019}
}

@inproceedings{lee2020centermask,
  title={CenterMask: Real-Time Anchor-Free Instance Segmentation},
  author={Lee, Youngwan and Park, Jongyoul},
  booktitle={CVPR},
  year={2020}
}

Acknowledgement

We appreciate developers of PySlowFast for such wonderful framework.
This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis and No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network).

FAVD: Featherweight Assisted Vulnerability Discovery

FAVD: Featherweight Assisted Vulnerability Discovery This repository contains the replication package for the paper "Featherweight Assisted Vulnerabil

secureIT 4 Sep 16, 2022
Unified learning approach for egocentric hand gesture recognition and fingertip detection

Unified Gesture Recognition and Fingertip Detection A unified convolutional neural network (CNN) algorithm for both hand gesture recognition and finge

Mohammad 227 Dec 25, 2022
Source code release of the paper: Knowledge-Guided Deep Fractal Neural Networks for Human Pose Estimation.

GNet-pose Project Page: http://guanghan.info/projects/guided-fractal/ UPDATE 9/27/2018: Prototxts and model that achieved 93.9Pck on LSP dataset. http

Guanghan Ning 83 Nov 21, 2022
Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

Creating Robust Representations from Pre-Trained Image Encoders using Contrastive Learning Sriram Ravula, Georgios Smyrnis This is the code for our pr

Sriram Ravula 26 Dec 10, 2022
RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection

RODD Official Implementation of 2022 CVPRW Paper RODD: A Self-Supervised Approach for Robust Out-of-Distribution Detection Introduction: Recent studie

Umar Khalid 17 Oct 11, 2022
Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

DocFormer - PyTorch Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for t

171 Jan 06, 2023
AirCode: A Robust Object Encoding Method

AirCode This repo contains source codes for the arXiv preprint "AirCode: A Robust Object Encoding Method" Demo Object matching comparison when the obj

Chen Wang 30 Dec 09, 2022
Selective Wavelet Attention Learning for Single Image Deraining

SWAL Code for Paper "Selective Wavelet Attention Learning for Single Image Deraining" Prerequisites Python 3 PyTorch Models We provide the models trai

Bobo 9 Jun 17, 2022
Predicting the duration of arrival delays for commercial flights.

Flight Delay Prediction Our objective is to predict arrival delays of commercial flights. According to the US Department of Transportation, about 21%

Jordan Silke 1 Jan 11, 2022
Python TFLite scripts for detecting objects of any class in an image without knowing their label.

Python TFLite scripts for detecting objects of any class in an image without knowing their label.

Ibai Gorordo 42 Oct 07, 2022
links and status of cool gradio demos

awesome-demos This is a list of some wonderful demos & applications built with Gradio. Here's how to contribute yours! 🖊️ Natural language processing

Gradio 96 Dec 30, 2022
This a classic fintech problem that introduces real life difficulties such as data imbalance. Check out the notebook to find out more!

Credit Card Fraud Detection Introduction Online transactions have become a crucial part of any business over the years. Many of those transactions use

Jonathan Hasbani 0 Jan 20, 2022
【Arxiv】Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

SANet Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution Dependencies numpy==1.18.5 scikit_image==0.16.2 torchvision==0.8.1 to

36 Jan 05, 2023
Official PyTorch Implementation of "AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting".

AgentFormer This repo contains the official implementation of our paper: AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecast

Ye Yuan 161 Dec 23, 2022
Code for the paper One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation, CVPR 2021.

One Thing One Click One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation (CVPR2021) Code for the paper One Thi

44 Dec 12, 2022
disentanglement_lib is an open-source library for research on learning disentangled representations.

disentanglement_lib disentanglement_lib is an open-source library for research on learning disentangled representation. It supports a variety of diffe

Google Research 1.3k Dec 28, 2022
OrienMask: Real-time Instance Segmentation with Discriminative Orientation Maps

OrienMask This repository implements the framework OrienMask for real-time instance segmentation. It achieves 34.8 mask AP on COCO test-dev at the spe

45 Dec 13, 2022
Custom TensorFlow2 implementations of forward and backward computation of soft-DTW algorithm in batch mode.

Batch Soft-DTW(Dynamic Time Warping) in TensorFlow2 including forward and backward computation Custom TensorFlow2 implementations of forward and backw

19 Aug 30, 2022
Learning a mapping from images to psychological similarity spaces with neural networks.

LearningPsychologicalSpaces v0.1: v1.1: v1.2: v1.3: v1.4: v1.5: The code in this repository explores learning a mapping from images to psychological s

Lucas Bechberger 8 Dec 12, 2022
This repository contains a PyTorch implementation of the paper Learning to Assimilate in Chaotic Dynamical Systems.

Amortized Assimilation This repository contains a PyTorch implementation of the paper Learning to Assimilate in Chaotic Dynamical Systems. Abstract: T

4 Aug 16, 2022