Dataset and baseline code for the VocalSound dataset (ICASSP2022).

Overview

VocalSound: A Dataset for Improving Human Vocal Sounds Recognition

Introduction

VocalSound Poster

VocalSound is a free dataset consisting of 21,024 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects. The VocalSound dataset also contains meta information such as speaker age, gender, native language, country, and health condition.

This repository contains the official code of the data preparation and baseline experiment in the ICASSP paper VocalSound: A Dataset for Improving Human Vocal Sounds Recognition (Yuan Gong, Jin Yu, and James Glass; MIT & Signify). Specifically, we provide an extremely simple one-click Google Colab script Open In Colab for the baseline experiment, no GPU / local data downloading is needed.

The dataset is ideal for:

  • Build vocal sound recognizer.
  • Research on removing model bias on various speaker groups.
  • Evaluate pretrained models (e.g., those trained with AudioSet) on vocal sound classification to check their generalization ability.
  • Combine with existing large-scale general audio dataset to improve the vocal sound recognition performance.

Citing

Please cite our paper(s) if you find the VocalSound dataset and code useful. The first paper proposes introduces the VocalSound dataset and the second paper describes the training pipeline and model we used for the baseline experiment.

@INPROCEEDINGS{gong_vocalsound,
  author={Gong, Yuan and Yu, Jin and Glass, James},
  booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition}, 
  year={2022},
  pages={151-155},
  doi={10.1109/ICASSP43922.2022.9746828}}
@ARTICLE{gong_psla, 
    author={Gong, Yuan and Chung, Yu-An and Glass, James},
    title={PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation}, 
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},  
    year={2021}, 
    doi={10.1109/TASLP.2021.3120633}
}

Download VocalSound

The VocalSound dataset can be downloaded as a single .zip file:

Sample Recordings (Listen to it without downloading)

VocalSound 44.1kHz Version (4.5 GB)

VocalSound 16kHz Version (1.7 GB, used in our baseline experiment)

(Mirror Links) 腾讯微云下载链接: 试听24个样本16kHz版本44.1kHz版本

If you plan to reproduce our baseline experiments using our Google Colab script, you do NOT need to download it manually, our script will download and process the 16kHz version automatically.

Creative Commons License
The VocalSound dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Dataset Details

data
├──readme.txt
├──class_labels_indices_vs.csv # include label code and name information
├──audio_16k
│  ├──f0003_0_cough.wav # female speaker, id=0003, 0=first collection (most spks only record once, but there are exceptions), cough
│  ├──f0003_0_laughter.wav
│  ├──f0003_0_sigh.wav
│  ├──f0003_0_sneeze.wav
│  ├──f0003_0_sniff.wav
│  ├──f0003_0_throatclearing.wav
│  ├──f0004_0_cough.wav # data from another female speaker 0004
│   ... (21024 files in total)
│   
├──audio_44k
│    # same recordings with those in data/data_16k, but are no downsampled
│   ├──f0003_0_cough.wav
│    ... (21024 files in total)
│
├──datafiles  # json datafiles that we use in our baseline experiment, you can ignore it if you don't use our training pipeline
│  ├──all.json  # all data
│  ├──te.json  # test data
│  ├──tr.json  # training data
│  ├──val.json  # validation data
│  └──subtest # subset of the test set, for fine-grained evaluation
│     ├──te_age1.json  # age [18-25]
│     ├──te_age2.json  # age [26-48]
│     ├──te_age3.json  # age [49-80]
│     ├──te_female.json
│     └──te_male.json
│
└──meta  # Meta information of the speakers [spk_id, gender, age, country, native language, health condition (no=no problem)]
   ├──all_meta.json  # all data
   ├──te_meta.json  # test data
   ├──tr_meta.json  # training data
   └──val_meta.json  # validation data

Baseline Experiment

Option 1. One-Click Google Colab Experiment Open In Colab

We provide an extremely simple one-click Google Colab script for the baseline experiment.

What you need:

  • A free google account with Google Drive free space > 5Gb
    • A (paid) Google Colab Pro plan could speed up training, but is not necessary. Free version can run the script, just a bit slower.

What you don't need:

  • Download VocalSound manually (The Colab script download it to your Google Drive automatically)
  • GPU or any other hardware (Google Colab provides free GPUs)
  • Any enviroment setting and package installation (Google Colab provides a ready-to-use environment)
  • A specific operating system (You only need a web browser, e.g., Chrome)

Please Note

  • This script is slightly different with our local code, but the performance is not impacted.
  • Free Google Colab might be slow and unstable. In our test, it takes ~5 minutes to train the model for one epoch with a free Colab account.

To run the baseline experiment

  • Make sure your Google Drive is mounted. You don't need to do it by yourself, but Google Colab will ask permission to acess your Google Drive when you run the script, please allow it if you want to use Google Drive.
  • Make sure GPU is enabled for Colab. To do so, go to the top menu > Edit > Notebook settings and select GPU as Hardware accelerator.
  • Run the script. Just press Ctrl+F9 or go to runtime menu on top and click "run all" option. That's it.

Option 2. Run Experiment Locally

We also provide a recipe for local experiments.

Compared with the Google Colab online script, it has following advantages:

  • It can be faster and more stable than online Google Colab (free version) if you have fast GPUs.
  • It is basically the original code we used for our paper, so it should reproduce the exact numbers in the paper.

Step 1. Clone or download this repository and set it as the working directory, create a virtual environment and install the dependencies.

cd vocalsound/ 
python3 -m venv venv-vs
source venv-vs/bin/activate
pip install -r requirements.txt 

Step 2. Download the VocalSound dataset and process it.

cd data/
wget https://www.dropbox.com/s/c5ace70qh1vbyzb/vs_release_16k.zip?dl=0 -O vs_release_16k.zip
unzip vs_release_16k.zip
cd ../src
python prep_data.py

# you can provide a --data_dir augment if you download the data somewhere else
# python prep_data.py --data_dir absolute_path/data

Step 3. Run the baseline experiment

chmod 777 run.sh
./run.sh

# or slurm user
#sbatch run.sh

We test both options before this release, you should get similar accuracies.

Accuracy (%) Colab Script Open In Colab Local Script ICASSP Paper
Validation Set 91.1 90.2 90.1±0.2
All Test Set 91.6 90.6 90.5±0.2
Test Age 18-25 93.4 92.3 91.5±0.3
Test Age 26-48 90.8 90.0 90.1±0.2
Test Age 49-80 92.2 90.2 90.9±1.6
Test Male 89.8 89.6 89.2±0.5
Test Female 93.4 91.6 91.9±0.1
Model Implementation torchvision EfficientNet PSLA EfficientNet PSLA EfficientNet
Batch Size 80 100 100
GPU Google Colab Free 4X Titan 4X Titan
Training Time (30 Epochs) ~2.5 Hours ~1 Hour ~1 Hour

Contact

If you have a question, please bring up an issue (preferred) or send me an email [email protected].

Owner
Yuan Gong
Postdoc, MIT CSAIL
Yuan Gong
An audio digital processing toolbox based on a workflow/pipeline principle

AudioTK Audio ToolKit is a set of audio filters. It helps assembling workflows for specific audio processing workloads. The audio workflow is split in

Matthieu Brucher 238 Oct 18, 2022
Official implementation of A cappella: Audio-visual Singing VoiceSeparation, from BMVC21

Y-Net Official implementation of A cappella: Audio-visual Singing VoiceSeparation, British Machine Vision Conference 2021 Project page: ipcv.github.io

Juan F. Montesinos 12 Oct 22, 2022
A library for augmenting annotated audio data

muda A library for Musical Data Augmentation. muda package implements annotation-aware musical data augmentation, as described in the muda paper. The

Brian McFee 214 Nov 22, 2022
Python interface to the WebRTC Voice Activity Detector

py-webrtcvad This is a python interface to the WebRTC Voice Activity Detector (VAD). It is compatible with Python 2 and Python 3. A VAD classifies a p

John Wiseman 1.5k Dec 22, 2022
Python I/O for STEM audio files

stempeg = stems + ffmpeg Python package to read and write STEM audio files. Technically, stems are audio containers that combine multiple audio stream

Fabian-Robert Stöter 72 Dec 23, 2022
Anaphones are like anagrams, but for sounds.

Anaphones Anaphones are like anagrams but for sounds (phonemes). Examples include: salami-awesomely, atari-tiara, and beefy-phoebe. Anaphones can be a

James Murphy 18 Nov 02, 2022
Analysis of voices based on the Mel-frequency band

Speaker_partition_module Analysis of voices based on the Mel-frequency band. Goal: Identification of voices speaking (diarization) and calculation of

1 Feb 06, 2022
A python library for working with praat, textgrids, time aligned audio transcripts, and audio files.

praatIO Questions? Comments? Feedback? A library for working with praat, time aligned audio transcripts, and audio files that comes with batteries inc

Tim 224 Dec 19, 2022
Stream Music 🎵 𝘼 𝙗𝙤𝙩 𝙩𝙝𝙖𝙩 𝙘𝙖𝙣 𝙥𝙡𝙖𝙮 𝙢𝙪𝙨𝙞𝙘 𝙤𝙣 𝙏𝙚𝙡𝙚𝙜𝙧𝙖𝙢 𝙂𝙧𝙤𝙪𝙥 𝙖𝙣𝙙 𝘾𝙝𝙖𝙣𝙣𝙚𝙡 𝙑𝙤𝙞𝙘𝙚 𝘾𝙝𝙖𝙩𝙨 𝘼𝙫𝙖𝙞𝙡?

Stream Music 🎵 𝘼 𝙗𝙤𝙩 𝙩𝙝𝙖𝙩 𝙘𝙖𝙣 𝙥𝙡𝙖𝙮 𝙢𝙪𝙨𝙞𝙘 𝙤𝙣 𝙏𝙚𝙡𝙚𝙜𝙧𝙖𝙢 𝙂𝙧𝙤𝙪𝙥 𝙖𝙣𝙙 𝘾𝙝𝙖𝙣𝙣𝙚𝙡 𝙑𝙤𝙞𝙘𝙚 𝘾𝙝𝙖𝙩𝙨 𝘼𝙫𝙖𝙞𝙡?

Sadew Jayasekara 15 Nov 12, 2022
An audio-solving python funcaptcha solving module

funcapsolver funcapsolver is a funcaptcha audio-solving module, which allows captchas to be interacted with and solved with the use of google's speech

Acier 8 Nov 21, 2022
Pyrogram bot to automate streaming music in voice chats

Pyrogram bot to automate streaming music in voice chats Help If you face an error, want to discuss this project or get support for it, join it's group

Roj 124 Oct 21, 2022
SinGlow: Generative Flow for SVS tasks in Tensorflow 2

SinGlow is a part of my Singing voice synthesis system. It can extract features of sound, particularly songs and musics. Then we can use these features (or perfect encoding) for feature migrating tas

Haobo Yang 8 Aug 22, 2022
Stevan KZ 1 Oct 27, 2021
Audio pitch-shifting & re-sampling utility, based on the EMU SP-1200

Pitcher.py Free & OS emulation of the SP-12 & SP-1200 signal chain (now with GUI) Pitch shift / bitcrush / resample audio files Written and tested in

morgan 13 Oct 03, 2022
Guide & Examples to create deeplearning gstreamer plugins and use them in your pipeline

upai-gst-dl-plugins Guide & Examples to create deeplearning gstreamer plugins and use them in your pipeline Introduction Thanks to the work done by @j

UPAI.IO 11 Dec 11, 2022
This library provides common speech features for ASR including MFCCs and filterbank energies.

python_speech_features This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs ar

James Lyons 2.2k Jan 04, 2023
Learn chords with your MIDI keyboard !

miditeach miditeach is a music learning tool that can be used to practice your chords skills with a midi keyboard 🎹 ! Features Midi keyboard input se

Alexis LOUIS 3 Oct 20, 2021
📺Headless全自动B站直播录播、切片、上传一体工具

DDRecorder Headless全自动B站直播录播、切片、上传一体工具 感谢 FortuneDayssss/BilibiliUploader 安装指南(Windows) 在Release下载zip包解压。 修改配置文件config.json 双击运行DDRecorder.exe (这将使用co

322 Dec 27, 2022
Xbot-Music - Bot Play Music and Video in Voice Chat Group Telegram

XBOT-MUSIC A Telegram Music+video Bot written in Python using Pyrogram and Py-Tg

Fariz 2 Jan 20, 2022
A Music Player Bot for Discord Servers

A Music Player Bot for Discord Servers

Halil Acar 2 Oct 25, 2021