PyTorch implementation of our paper: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

Overview

Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition, arxiv

This is a PyTorch implementation of our paper.

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; Visdom(optional)

data prepare: Database with the following folder structure:

│NTURGBD/
├──dataset_splits/
│  ├── @CS
│  │   ├── train.txt
                video name               total frames    label
│  │   │    ├──S001C001P001R001A001_rgb      103          0 
│  │   │    ├──S001C001P001R001A004_rgb      99           3 
│  │   │    ├──...... 
│  │   ├── valid.txt
│  ├── @CV
│  │   ├── train.txt
│  │   ├── valid.txt
├──Images/
│  │   ├── S001C002P001R001A002_rgb
│  │   │   ├──000000.jpg
│  │   │   ├──000001.jpg
│  │   │   ├──......
├──nturgb+d_depth_masked/
│  │   ├── S001C002P001R001A002
│  │   │   ├──MDepth-00000000.png
│  │   │   ├──MDepth-00000001.png
│  │   │   ├──......

It is important to note that due to the RGB video resolution in the NTU dataset is relatively high, so we are not directly to resize the image from the original resolution to 320x240, but first crop the object-centered ROI area (640x480), and then resize it to 320x240 for training and testing.

2. Methodology

We propose to decouple and recouple spatiotemporal representation for RGB-D-based motion recognition. The Figure in the first line illustrates the proposed multi-modal spatiotemporal representation learning framework. The RGB-D-based motion recognition can be described as spatiotemporal information decoupling modeling, compact representation recoupling learning, and cross-modal representation interactive learning. The Figure in the second line shows the process of decoupling and recoupling saptiotemporal representation of a unimodal data.

3. Train and Evaluate

All of our models are pre-trained on the 20BN Jester V1 dataset and the pretrained model can be download here. Before cross-modal representation interactive learning, we first separately perform unimodal representation learning on RGB and depth data modalities.

Unimodal Training

Take training an RGB model with 8 GPUs on the NTU-RGBD dataset as an example, some basic configuration:

common:
  dataset: NTU 
  batch_size: 6
  test_batch_size: 6
  num_workers: 6
  learning_rate: 0.01
  learning_rate_min: 0.00001
  momentum: 0.9
  weight_decay: 0.0003
  init_epochs: 0
  epochs: 100
  optim: SGD
  scheduler:
    name: cosin                     # Represent decayed learning rate with the cosine schedule
    warm_up_epochs: 3 
  loss:
    name: CE                        # cross entropy loss function
    labelsmooth: True
  MultiLoss: True                   # Enable multi-loss training strategy.
  loss_lamdb: [ 1, 0.5, 0.5, 0.5 ]  # The loss weight coefficient assigned for each sub-branch.
  distill: 1.                       # The loss weight coefficient assigned for distillation task.

model:
  Network: I3DWTrans                # I3DWTrans represent unimodal training, set FusionNet for multi-modal fusion training.
  sample_duration: 64               # Sampled frames in a video.
  sample_size: 224                  # The image is croped into 224x224.
  grad_clip: 5.
  SYNC_BN: 1                        # Utilize SyncBatchNorm.
  w: 10                             # Sliding window size.
  temper: 0.5                       # Distillation temperature setting.
  recoupling: True                  # Enable recoupling strategy during training.
  knn_attention: 0.7                # Hyperparameter used in k-NN attention: selecting Top-70% tokens.
  sharpness: True                   # Enable sharpness for each sub-branch's output.
  temp: [ 0.04, 0.07 ]              # Temperature parameter follows a cosine schedule from 0.04 to 0.07 during the training.
  frp: True                         # Enable FRP module.
  SEHeads: 1                        # Number of heads used in RCM module.
  N: 6                              # Number of Transformer blochs configured for each sub-branch.

dataset:
  type: M                           # M: RGB modality, K: Depth modality.
  flip: 0.5                         # Horizontal flip.
  rotated: 0.5                      # Horizontal rotation
  angle: (-10, 10)                  # Rotation angle
  Blur: False                       # Enable random blur operation for each video frame.
  resize: (320, 240)                # The input is spatially resized to 320x240 for NTU dataset.
  crop_size: 224                
  low_frames: 16                    # Number of frames sampled for small Transformer.       
  media_frames: 32                  # Number of frames sampled for medium Transformer.  
  high_frames: 48                   # Number of frames sampled for large Transformer.
bash run.sh tools/train.py config/NTU.yml 0,1,2,3,4,5,6,7 8

or

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 train.py --config config/NTU.yml --nprocs 8  

Cross-modal Representation Interactive Learning

Take training a fusion model with 8 GPUs on the NTU-RGBD dataset as an example.

bash run.sh tools/fusion.py config/NTU.yml 0,1,2,3,4,5,6,7 8

or

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port=1234 fusion.py --config config/NTU.yml --nprocs 8  

Evaluation

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 train.py --config config/NTU.yml --nprocs 1 --eval_only --resume /path/to/model_best.pth.tar 

4. Models Download

Dataset Modality Accuracy Download
NvGesture RGB 89.58 Google Drive
NvGesture Depth 90.62 Google Drive
NvGesture RGB-D 91.70 Google Drive
THU-READ RGB 81.25 Google Drive
THU-READ Depth 77.92 Google Drive
THU-READ RGB-D 87.04 Google Drive
NTU-RGBD(CS) RGB 90.3 Google Drive
NTU-RGBD(CS) Depth 92.7 Google Drive
NTU-RGBD(CS) RGB-D 94.2 Google Drive
NTU-RGBD(CV) RGB 95.4 Google Drive
NTU-RGBD(CV) Depth 96.2 Google Drive
NTU-RGBD(CV) RGB-D 97.3 Google Drive
IsoGD RGB 60.87 Google Drive
IsoGD Depth 60.17 Google Drive
IsoGD RGB-D 66.79 Google Drive

Citation

@inproceedings{zhou2021DRSR,
      title={Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition}, 
      author={Benjia Zhou and Pichao Wang and Jun Wan and Yanyan Liang and Fan Wang and Du Zhang and Zhen Lei and Hao Li and Rong Jin},
      journal={arXiv preprint arXiv:2112.09129},
      year={2021},
}

LICENSE

The code is released under the MIT license.

Copyright

Copyright (C) 2010-2021 Alibaba Group Holding Limited.

Owner
DamoCV
CV team of DAMO academy
DamoCV
"Neural Turing Machine" in Tensorflow

Neural Turing Machine in Tensorflow Tensorflow implementation of Neural Turing Machine. This implementation uses an LSTM controller. NTM models with m

Taehoon Kim 1k Dec 06, 2022
PIXIE: Collaborative Regression of Expressive Bodies

PIXIE: Collaborative Regression of Expressive Bodies [Project Page] This is the official Pytorch implementation of PIXIE. PIXIE reconstructs an expres

Yao Feng 331 Jan 04, 2023
A framework for using LSTMs to detect anomalies in multivariate time series data. Includes spacecraft anomaly data and experiments from the Mars Science Laboratory and SMAP missions.

Telemanom (v2.0) v2.0 updates: Vectorized operations via numpy Object-oriented restructure, improved organization Merge branches into single branch fo

Kyle Hundman 844 Dec 28, 2022
Unsupervised Foreground Extraction via Deep Region Competition

Unsupervised Foreground Extraction via Deep Region Competition [Paper] [Code] The official code repository for NeurIPS 2021 paper "Unsupervised Foregr

28 Nov 06, 2022
Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

Shuwa Gesture Toolkit is a framework that detects and classifies arbitrary gestures in short videos

Google 89 Dec 22, 2022
Callable PyTrees and filtered JIT/grad transformations => neural networks in JAX.

Equinox Callable PyTrees and filtered JIT/grad transformations = neural networks in JAX Equinox brings more power to your model building in JAX. Repr

Patrick Kidger 909 Dec 30, 2022
Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

Super Resolution Examples We run this script under TensorFlow 2.0 and the TensorLayer2.0+. For TensorLayer 1.4 version, please check release. 🚀 🚀 🚀

TensorLayer Community 2.9k Jan 08, 2023
Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset

George Stoica 40 Dec 10, 2022
Scales, Chords, and Cadences: Practical Music Theory for MIR Researchers

ISMIR-musicTheoryTutorial This repository has slides and Jupyter notebooks for the ISMIR 2021 tutorial Scales, Chords, and Cadences: Practical Music T

Johanna Devaney 58 Oct 11, 2022
Official PyTorch implementation of U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

U-GAT-IT — Official PyTorch Implementation : Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Imag

Hyeonwoo Kang 2.4k Jan 04, 2023
PyTorch Implementation of our paper Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

PyTorch Implementation of our paper Explain Me the Painting: Multi-Topic Knowledgeable Art Description Generation

Zechen Bai 12 Jul 08, 2022
Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker

Example Of Fine-Tuning BERT For Named-Entity Recognition Task And Preparing For Cloud Deployment Using Flask, React, And Docker This repository contai

Nikita 12 Dec 14, 2022
Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Keon Lee 170 Dec 27, 2022
Public Code for NIPS submission SimiGrad: Fine-Grained Adaptive Batching for Large ScaleTraining using Gradient Similarity Measurement

Public code for NIPS submission "SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement" This repo co

Heyang Qin 0 Oct 13, 2021
Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression"

beyond-preserved-accuracy Repo for EMNLP 2021 paper "Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression" How to implemen

Kevin Canwen Xu 10 Dec 23, 2022
A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

Evan 1.3k Jan 02, 2023
Deep learning for spiking neural networks

A deep learning library for spiking neural networks. Norse aims to exploit the advantages of bio-inspired neural components, which are sparse and even

Electronic Vision(s) Group — BrainScaleS Neuromorphic Hardware 59 Nov 28, 2022
Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature

Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature Q. Wan, L. Gao, X. Li and L. Wen, "Industrial Image Anomaly

smiler 6 Dec 25, 2022
This is an official PyTorch implementation of Task-Adaptive Neural Network Search with Meta-Contrastive Learning (NeurIPS 2021, Spotlight).

NeurIPS 2021 (Spotlight): Task-Adaptive Neural Network Search with Meta-Contrastive Learning This is an official PyTorch implementation of Task-Adapti

Wonyong Jeong 15 Nov 21, 2022
Reinforcement Learning for finance

Reinforcement Learning for Finance We apply reinforcement learning for stock trading. Fetch Data Example import utils # fetch symbols from yahoo fina

Tomoaki Fujii 159 Jan 03, 2023