The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Overview

Hierarchical Token Semantic Audio Transformer

Introduction

The Code Repository for "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection", in ICASSP 2022.

In this paper, we devise a model, HTS-AT, by combining a swin transformer with a token-semantic module and adapt it in to audio classification and sound event detection tasks. HTS-AT is an efficient and light-weight audio transformer with a hierarchical structure and has only 30 million parameters. It achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models.

HTS-AT Architecture

Classification Results on AudioSet, ESC-50, and Speech Command V2 (mAP)

HTS-AT ClS Result

Localization/Detection Results on DESED dataset (F1-Score)

HTS-AT Localization Result

Getting Started

Install Requirments

pip install -r requirements.txt

Download and Processing Datasets

  • config.py
change the varible "dataset_path" to your audioset address
change the variable "desed_folder" to your DESED address
change the classes_num to 527
./create_index.sh # 
// remember to change the pathes in the script
// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn

python main.py save_idc 
// count the number of samples in each class and save the npy files
Open the jupyter notebook at esc-50/prep_esc50.ipynb and process it
Open the jupyter notebook at scv2/prep_scv2.ipynb and process it
python conver_desed.py 
// will produce the npy data files

Set the Configuration File: config.py

The script config.py contains all configurations you need to assign to run your code. Please read the introduction comments in the file and change your settings. For the most important part: If you want to train/test your model on AudioSet, you need to set:

dataset_path = "your processed audioset folder"
dataset_type = "audioset"
balanced_data = True
loss_type = "clip_bce"
sample_rate = 32000
hop_size = 320 
classes_num = 527

If you want to train/test your model on ESC-50, you need to set:

dataset_path = "your processed ESC-50 folder"
dataset_type = "esc-50"
loss_type = "clip_ce"
sample_rate = 32000
hop_size = 320 
classes_num = 50

If you want to train/test your model on Speech Command V2, you need to set:

dataset_path = "your processed SCV2 folder"
dataset_type = "scv2"
loss_type = "clip_bce"
sample_rate = 16000
hop_size = 160
classes_num = 35

If you want to test your model on DESED, you need to set:

resume_checkpoint = "Your checkpoint on AudioSet"
heatmap_dir = "localization results output folder"
test_file = "output heatmap name"
fl_local = True
fl_dataset = "Your DESED npy file"

Train and Evaluation

Notice: Our model is run on DDP mode and requires at least two GPU cards. If you want to use a single GPU for training and evaluation, you need to mannually change sed_model.py and main.py

All scripts is run by main.py:

Train: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train

Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test

Ensemble Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py esm_test 
// See config.py for settings of ensemble testing

Weight Average: python main.py weight_average
// See config.py for settings of weight averaging

Localization on DESED

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

Model Checkpoints:

We provide the model checkpoints on three datasets (and additionally DESED dataset) in this link. Feel free to download and test it.

Citing

@inproceedings{htsat-ke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle = {{ICASSP} 2022}
}

Our work is based on Swin Transformer, which is a famous image classification transformer model.

Owner
Knut(Ke) Chen
ORZ: { godfather: sweetdum, ufo: zgg, dragon sister: lzl, morning king: corner café }
Knut(Ke) Chen
Iterative Normalization: Beyond Standardization towards Efficient Whitening

IterNorm Code for reproducing the results in the following paper: Iterative Normalization: Beyond Standardization towards Efficient Whitening Lei Huan

Lei Huang 21 Dec 27, 2022
Group Activity Recognition with Clustered Spatial Temporal Transformer

GroupFormer Group Activity Recognition with Clustered Spatial-TemporalTransformer Backbone Style Action Acc Activity Acc Config Download Inv3+flow+pos

28 Dec 12, 2022
Lbl2Vec learns jointly embedded label, document and word vectors to retrieve documents with predefined topics from an unlabeled document corpus.

Lbl2Vec Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embed

sebis - TUM - Germany 61 Dec 20, 2022
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

jemmy li 121 Sep 26, 2022
PyTorch implementation for the paper Pseudo Numerical Methods for Diffusion Models on Manifolds

Pseudo Numerical Methods for Diffusion Models on Manifolds (PNDM) This repo is the official PyTorch implementation for the paper Pseudo Numerical Meth

Luping Liu (刘路平) 196 Jan 05, 2023
GB-CosFace: Rethinking Softmax-based Face Recognition from the Perspective of Open Set Classification

GB-CosFace: Rethinking Softmax-based Face Recognition from the Perspective of Open Set Classification This is the official pytorch implementation of t

Alibaba Cloud 5 Nov 14, 2022
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 04, 2023
IDRLnet, a Python toolbox for modeling and solving problems through Physics-Informed Neural Network (PINN) systematically.

IDRLnet IDRLnet is a machine learning library on top of PyTorch. Use IDRLnet if you need a machine learning library that solves both forward and inver

IDRL 105 Dec 17, 2022
[ICCV21] Code for RetrievalFuse: Neural 3D Scene Reconstruction with a Database

RetrievalFuse Paper | Project Page | Video RetrievalFuse: Neural 3D Scene Reconstruction with a Database Yawar Siddiqui, Justus Thies, Fangchang Ma, Q

Yawar Nihal Siddiqui 75 Dec 22, 2022
Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

50 Dec 11, 2022
Aircraft design optimization made fast through modern automatic differentiation

Aircraft design optimization made fast through modern automatic differentiation. Plug-and-play analysis tools for aerodynamics, propulsion, structures, trajectory design, and much more.

Peter Sharpe 394 Dec 23, 2022
This is the repository of our article published on MDPI Entropy "Feature Selection for Recommender Systems with Quantum Computing".

Collaborative-driven Quantum Feature Selection This repository was developed by Riccardo Nembrini, PhD student at Politecnico di Milano. See the websi

Quantum Computing Lab @ Politecnico di Milano 10 Apr 21, 2022
[AAAI-2022] Official implementations of MCL: Mutual Contrastive Learning for Visual Representation Learning

Mutual Contrastive Learning for Visual Representation Learning This project provides source code for our Mutual Contrastive Learning for Visual Repres

winycg 48 Jan 02, 2023
Deep Implicit Moving Least-Squares Functions for 3D Reconstruction

DeepMLS: Deep Implicit Moving Least-Squares Functions for 3D Reconstruction This repository contains the implementation of the paper: Deep Implicit Mo

103 Dec 22, 2022
SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

SpeechNAS Better Trade off between Latency and Accuracy for Large Scale Speaker Verification

Wentao Zhu 24 May 20, 2022
Simple torch.nn.module implementation of Alias-Free-GAN style filter and resample

Alias-Free-Torch Simple torch module implementation of Alias-Free GAN. This repository including Alias-Free GAN style lowpass sinc filter @filter.py A

이준혁(Junhyeok Lee) 64 Dec 22, 2022
imbalanced-DL: Deep Imbalanced Learning in Python

imbalanced-DL: Deep Imbalanced Learning in Python Overview imbalanced-DL (imported as imbalanceddl) is a Python package designed to make deep imbalanc

NTUCSIE CLLab 19 Dec 28, 2022
Python package for downloading ECMWF reanalysis data and converting it into a time series format.

ecmwf_models Readers and converters for data from the ECMWF reanalysis models. Written in Python. Works great in combination with pytesmo. Citation If

TU Wien - Department of Geodesy and Geoinformation 31 Dec 26, 2022
Robust Lane Detection via Expanded Self Attention (WACV 2022)

Robust Lane Detection via Expanded Self Attention (WACV 2022) Minhyeok Lee, Junhyeop Lee, Dogyoon Lee, Woojin Kim, Sangwon Hwang, Sangyoun Lee Overvie

Min Hyeok Lee 18 Nov 12, 2022
Code for CPM-2 Pre-Train

CPM-2 Pre-Train Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支 CPM-2技术报告请参考link。 0 模型下载 请在智源资源下载页面进行申请,文件介绍如下: 文件名 描述 参数大小 100000.tar

Tsinghua AI 136 Dec 28, 2022