Pytorch Implementation for (STANet+ and STANet)

Related tags

Deep LearningSTANet
Overview

Pytorch Implementation for (STANet+ and STANet)

V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:V2

V1-From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach (CVPR2021), pdf:V1


Introduction

  • This repository contains the source code, results, and evaluation toolbox of STANet+ (V2), which are the journal extension version of our paper STANet (V1) published at CVPR-2021.
  • Compared our conference version STANet (V2), which has been extended in two distinct aspects.
    First on the basis of multisource and multiscale perspectives which have been adopted by the CVPR version (V1), we have provided a deep insight into the relationship between multigranularity perception (Fig.2) and real human attention behaved in visual-auditory environment.
    Second without using any complex networks, we have provided an elegant framework to complementary integrate multisource, multiscale, and multigranular information (Fig.1) to formulate pseudofixations which are very consistent with the real ones. Apart from achieving significant performance gain, this work also provides a comprehensive solution for mimicking multimodality attention.

Figure 1: STANet+ mainly focuses on devising a weakly supervised approach for the spatial-temporal-audio (STA) fixation prediction task, where the key innovation is that, as one of the first attempts, we automatically convert semantic category tags to pseudofixations via the newly proposed selective class activation mapping (SCAM) and the upgraded version SCAM+ that has been additionally equipped with the multigranularity perception ability. The obtained pseudofixations can be used as the learning objective to guide knowledge distillation to teach two individual fixation prediction networks (i.e., STA and STA+), which jointly enable generic video fixation prediction without requiring any video tags.

Figure 2: Some representative ’fixation shifting’ cases, additional multigranularity information (i.e., long/crossterm information) has been shown before collecting fixations in A_SRC. Clearly, by comparing A_FIX0, A_FIX1, and A _FIX2, we can easily notice that the multigranularity information could draw human attention to the most meaningful objects and make the fixations to be more focused.

Dependencies

  • Windows10
  • NVIDIA GeForce RTX 2070 SUPER & NVIDIA GeForce RTX 1080Ti
  • python 3.6.4
  • Matlab R2016b
  • pytorch 1.8.0
  • soundmodel

Preparation

Downloading the official pretrained visual and audio model

Visual:resnext101_32x8d, vgg16
Audio: vggsound, net = torch.load('vggsound_netvlad').

Downloading the training dataset and testing dataset:

Training dataset: AVE(Audio Visual Event Location).
Testing dataset: AVAD, DIEM, SumMe, ETMD, Coutrot.

Training

Note
We use Fourier-transform to transform audio features as audio stream input, therefore, you firstly need to use the function audiostft.py to convert the audio files (.wav) to get the audio features(.h5).

Step 1. SCAM training

Coarse: Separately training branches of Scoarse, SAcoarse, STcoarse ,it should be noted that the coarse stage is coarse location, so the size is set to 256 to ensure object-wise location accuracy.
Fine: Separately re-training branches of Sfine, SAfine, STfine,it should be noted that the fine stage is a fine location, so the size is set to 356 to ensure regional location exactness.

Step2. SCAM+ training

S+: Separately training branches of S+short, S+long, S+cross, because it is frame-wise relational reasoning network, the network is the same, so we only need to change the source of the input data.
SA+: Separately training branches of SA+long, SA+cross.
ST+: Separately training branches of ST+short, ST+long, ST+cross.

Step 3. pseudoGT generation

In order to facilitate the display of matrix data processing, Matlab2016b was performed in coarse location of inter-frame smoothing and pseudo GT data post-processing.

Step 4. STA and STA+ training

Training the model of STA and STA+ using the AVE video frames with the generated pseudoGT.

Testing

Step 1. Using the function audiostft.py to convert the audio files (.wav) to get the audio features (.h5).
Step 2. Testing STA, STA+ network, fusing the test results to generate final saliency results.(STANet+)

The model weight file STANet+, STANet, AudioSwitch:
(Baidu Netdisk, code:6afo).

Evaluation

We use the evaluation code in the paper of STAVIS for fair comparisons.
You may need to revise the algorithms, data_root, and maps_root defined in the main.m.
We provide the saliency maps of the SOTA:

(STANet+, STANet, ITTI, GBVS, SCLI, AWS-D, SBF, CAM, GradCAM, GradCAMpp, SGradCAMpp, xGradCAM, SSCAM, ScoCAM, LCAM, ISCAM, ACAM, EGradCAM, ECAM, SPG, VUNP, WSS, MWS, WSSA).
(Baidu Netdisk, code:6afo).

Quantitative comparisons:

Qualitative results of our method and eight representative saliency models: ITTI, GBVS, SCLI, SBF, AWS-D, WSS, MWS, WSSA. It can be observed that our method is able to handle various challenging scenes well and produces more accurate results than other competitors.

Qualitative comparisons:

Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on 6 datasets. Bold means the best result, " denotes the higher the score, the better the performance.

References

[1][Tsiami, A., Koutras, P., Maragos, P.STAViS: Spatio-Temporal AudioVisual Saliency Network. (CVPR 2020).] (https://openaccess.thecvf.com/content_CVPR_2020/papers/Tsiami_STAViS_Spatio-Temporal_AudioVisual_Saliency_Network_CVPR_2020_paper.pdf)
[2][Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C. Audio-Visual Event Localization in Unconstrained Videos. (ECCV 2018)] (https://openaccess.thecvf.com/content_ECCV_2018/papers/Yapeng_Tian_Audio-Visual_Event_Localization_ECCV_2018_paper.pdf)
[3][Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. Vggsound: A Large-Scale Audio-Visual Dataset. (ICASSP 2020)] (https://www.robots.ox.ac.uk/~vgg/publications/2020/Chen20/chen20.pdf)

Citation

If you find this work useful for your research, please consider citing the following paper:

@InProceedings{Wang_2021_CVPR,  
    author    = {Wang, Guotao and Chen, Chenglizhao and Fan, Deng-Ping and Hao, Aimin and Qin, Hong},
    title     = {From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach},  
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},  
    month     = {June},  
    year      = {2021},  
    pages     = {15119-15128}  
}  


@misc{wang2021weakly,
    title={Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception}, 
    author={Guotao Wang and Chenglizhao Chen and Dengping Fan and Aimin Hao and Hong Qin},
    year={2021},
    eprint={2112.13697},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Owner
GuotaoWang
GuotaoWang
Non-stationary GP package written from scratch in PyTorch

NSGP-Torch Examples gpytorch model with skgpytorch # Import packages import torch from regdata import NonStat2D from gpytorch.kernels import RBFKernel

Zeel B Patel 1 Mar 06, 2022
📚 A collection of Jupyter notebooks for learning and experimenting with OpenVINO 👓

A collection of ready-to-run Python* notebooks for learning and experimenting with OpenVINO developer tools. The notebooks are meant to provide an introduction to OpenVINO basics and teach developers

OpenVINO Toolkit 840 Jan 03, 2023
This repository contains the implementation of the paper: Federated Distillation of Natural Language Understanding with Confident Sinkhorns

Federated Distillation of Natural Language Understanding with Confident Sinkhorns This repository provides an alternative method for ensembled distill

Deep Cognition and Language Research (DeCLaRe) Lab 11 Nov 16, 2022
Clean Machine Learning, a Coding Kata

Kata: Clean Machine Learning From Dirty Code First, open the Kata in Google Colab (or else download it) You can clone this project and launch jupyter-

Neuraxio 13 Nov 03, 2022
Unicorn can be used for performance analyses of highly configurable systems with causal reasoning

Unicorn can be used for performance analyses of highly configurable systems with causal reasoning. Users or developers can query Unicorn for a performance task.

AISys Lab 27 Jan 05, 2023
Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Example scripts for the detection of lanes using the ultra fast lane detection model in ONNX.

Ibai Gorordo 35 Sep 07, 2022
Official PyTorch implementation of StyleGAN3

Modified StyleGAN3 Repo Changes Made tied to python 3.7 syntax .jpgs instead of .pngs for training sample seeds to recreate the 1024 training grid wit

Derrick Schultz (he/him) 83 Dec 15, 2022
HugsVision is a easy to use huggingface wrapper for state-of-the-art computer vision

HugsVision is an open-source and easy to use all-in-one huggingface wrapper for computer vision. The goal is to create a fast, flexible and user-frien

Labrak Yanis 166 Nov 27, 2022
Implementation of "Scaled-YOLOv4: Scaling Cross Stage Partial Network" using PyTorch framwork.

YOLOv4-large This is the implementation of "Scaled-YOLOv4: Scaling Cross Stage Partial Network" using PyTorch framwork. YOLOv4-CSP YOLOv4-tiny YOLOv4-

Kin-Yiu, Wong 2k Jan 02, 2023
Generating Fractals on Starknet with Cairo

StarknetFractals Generating the mandelbrot set on Starknet Current Implementation generates 1 pixel of the fractal per call(). It takes a few minutes

Orland0x 10 Jul 16, 2022
Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

Hierarchical Attention Mining (HAM) for weakly-supervised abnormality localization This is the official PyTorch implementation for the HAM method. Pap

Xi Ouyang 22 Jan 02, 2023
Prediction of MBA refinance Index (Mortgage prepayment)

Prediction of MBA refinance Index (Mortgage prepayment) Deep Neural Network based Model The ability to predict mortgage prepayment is of critical use

Ruchil Barya 1 Jan 16, 2022
Improving the robustness and performance of biomedical NLP models through adversarial training

RobustBioNLP Improving the robustness and performance of biomedical NLP models through adversarial training In this repository you can find suppliment

Milad Moradi 3 Sep 20, 2022
This is the pytorch re-implementation of the IterNorm

IterNorm-pytorch Pytorch reimplementation of the IterNorm methods, which is described in the following paper: Iterative Normalization: Beyond Standard

Lei Huang 32 Dec 27, 2022
Unsupervised Foreground Extraction via Deep Region Competition

Unsupervised Foreground Extraction via Deep Region Competition [Paper] [Code] The official code repository for NeurIPS 2021 paper "Unsupervised Foregr

28 Nov 06, 2022
Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

RecycleD Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

Yunan Zhu 23 Nov 05, 2022
code for Image Manipulation Detection by Multi-View Multi-Scale Supervision

MVSS-Net Code and models for ICCV 2021 paper: Image Manipulation Detection by Multi-View Multi-Scale Supervision Update 22.02.17, Pretrained model for

dong_chengbo 131 Dec 30, 2022
Learning View Priors for Single-view 3D Reconstruction (CVPR 2019)

Learning View Priors for Single-view 3D Reconstruction (CVPR 2019) This is code for a paper Learning View Priors for Single-view 3D Reconstruction by

Hiroharu Kato 38 Aug 17, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmenta

NVIDIA Research Projects 3.2k Dec 30, 2022
An ML & Correlation platform for transforming disparate data points of interest into usable intelligence.

SSIDprobeCollector An ML & Correlation platform for transforming disparate data points of interest into usable intelligence. At a High level the platf

Bill Reyor 1 Jan 30, 2022