Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

Overview

MusCaps: Generating Captions for Music Audio

Ilaria Manco1 2, Emmanouil Benetos1, Elio Quinton2, Gyorgy Fazekas1
1 Queen Mary University of London, 2 Universal Music Group

This repository is the official implementation of "MusCaps: Generating Captions for Music Audio" (IJCNN 2021). In this work, we propose an encoder-decoder model to generate natural language descriptions of music audio. We provide code to train our model on any dataset of (audio, caption) pairs, together with code to evaluate the generated descriptions on a set of automatic metrics (BLEU, METEOR, ROUGE, CIDEr, SPICE, SPIDEr).

Setup

The code was developed in Python 3.7 on Linux CentOS 7 and training was carried out on an RTX 2080 Ti GPU. Other GPUs and platforms have not been fully tested.

Clone the repo

git clone https://github.com/ilaria-manco/muscaps
cd muscaps

You'll need to have the libsndfile library installed. All other requirements, including the code package, can be installed with

pip install -r requirements.txt
pip install -e .

Project structure

root
├─ configs                      # Config files
│   ├─ datasets
│   ├─ models  
│   └─ default.yaml              
├─ data                         # Folder to save data (input data, pretrained model weights, etc.)
│   ├─ audio_encoders   
│   ├─ datasets            
│   │   └─ dataset_name     
|   └── ...             
├─ muscaps
|   ├─ caption_evaluation_tools # Translation metrics eval on audio captioning 
│   ├─ datasets                 # Dataset classes
│   ├─ models                   # Model code
│   ├─ modules                  # Model components
│   ├─ scripts                  # Python scripts for training, evaluation etc.
│   ├─ trainers                 # Trainer classes
│   └─ utils                    # Utils
└─ save                         # Saved model checkpoints, logs, configs, predictions    
    └─ experiments
        ├── experiment_id1
        └── ...                  

Dataset

The datasets used in our experiments is private and cannot be shared, but details on how to prepare an equivalent music captioning dataset are provided in the data README.

Pre-trained audio feature extractors

For the audio feature extraction component, MusCaps uses CNN-based audio tagging models like musicnn. In our experiments, we use @minzwon's implementation and pre-trained models, which you can download from the official repo. For example, to obtain the weights for the HCNN model trained on the MagnaTagATune dataset, run the following commands

mkdir data/audio_encoders
cd data/audio_encoders/
wget https://github.com/minzwon/sota-music-tagging-models/raw/master/models/mtat/hcnn/best_model.pth
mv best_model.pth mtt_hcnn.pth

Training

Dataset, model and training configurations are set in the respective yaml files in configs. Some of the fields can be overridden by arguments in the CLI (for more details on this, refer to the training script).

To train the model with the default configs, simply run

cd muscaps/scripts/
python train.py <baseline/attention> --feature_extractor <musicnn/hcnn> --pretrained_model <msd/mtt>  --device_num <gpu_number>

This will generate an experiment_id and create a new folder in save/experiments where the output will be saved.

If you wish to resume training from a saved checkpoint, run

python train.py <baseline/attention> --experiment_id <experiment_id>  --device_num <gpu_number>

Evaluation

To evaluate a model saved under <experiment_id> on the captioning task, run

cd muscaps/scripts/
python caption.py <experiment_id> --metrics True

Cite

@misc{manco2021muscaps,
      title={MusCaps: Generating Captions for Music Audio}, 
      author={Ilaria Manco and Emmanouil Benetos and Elio Quinton and Gyorgy Fazekas},
      year={2021},
      eprint={2104.11984},
      archivePrefix={arXiv}
}

Acknowledgements

This repo reuses some code from the following repos:

Contact

If you have any questions, please get in touch: [email protected].

Owner
Ilaria Manco
AI & Music PhD Researcher at the Centre for Digital Music (QMUL)
Ilaria Manco
Run object detection model on the Raspberry Pi

Using TensorFlow Lite with Python is great for embedded devices based on Linux, such as Raspberry Pi.

Dimitri Yanovsky 6 Oct 08, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
Open-source code for Generic Grouping Network (GGN, CVPR 2022)

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity Pytorch implementation for "Open-World Instance Segmen

Meta Research 99 Dec 06, 2022
PoseViz – Multi-person, multi-camera 3D human pose visualization tool built using Mayavi.

PoseViz – 3D Human Pose Visualizer Multi-person, multi-camera 3D human pose visualization tool built using Mayavi. As used in MeTRAbs visualizations.

István Sárándi 79 Dec 30, 2022
A curated list of references for MLOps

A curated list of references for MLOps

Larysa Visengeriyeva 9.3k Jan 07, 2023
PyTorch implementation of Trust Region Policy Optimization

PyTorch implementation of TRPO Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

Ilya Kostrikov 366 Nov 15, 2022
Code for one-stage adaptive set-based HOI detector AS-Net.

AS-Net Code for one-stage adaptive set-based HOI detector AS-Net. Mingfei Chen*, Yue Liao*, Si Liu, Zhiyuan Chen, Fei Wang, Chen Qian. "Reformulating

Mingfei Chen 45 Dec 09, 2022
LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

LiDAR Distillation Paper | Model LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection Yi Wei, Zibu Wei, Yongming Rao, Jiax

Yi Wei 75 Dec 22, 2022
Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

George Zhao 117 Dec 15, 2022
This repo will contain code to reproduce and build upon understanding transfer learning

What is being transferred in transfer learning? This repo contains the code for the following paper: Behnam Neyshabur*, Hanie Sedghi*, Chiyuan Zhang*.

4 Jun 16, 2021
TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, CVPR2022

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation Paper Links: TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentati

Hust Visual Learning Team 253 Dec 21, 2022
FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation.

FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation [Project] [Paper] [arXiv] [Home] Official implementation of FastFCN:

Wu Huikai 815 Dec 29, 2022
CvT-ASSD: Convolutional vision-Transformerbased Attentive Single Shot MultiBox Detector (ICTAI 2021 CCF-C 会议)The 33rd IEEE International Conference on Tools with Artificial Intelligence

CvT-ASSD including extra CvT, CvT-SSD, VGG-ASSD models original-code-website: https://github.com/albert-jin/CvT-SSD new-code-website: https://github.c

金伟强 -上海大学人工智能小渣渣~ 5 Mar 07, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 03, 2023
This is the winning solution of the Endocv-2021 grand challange.

Endocv2021-winner [Paper] This is the winning solution of the Endocv-2021 grand challange. Dependencies pytorch # tested with 1.7 and 1.8 torchvision

Vajira Thambawita 14 Dec 03, 2022
Code and training data for our ECCV 2016 paper on Unsupervised Learning

Shuffle and Learn (Shuffle Tuple) Created by Ishan Misra Based on the ECCV 2016 Paper - "Shuffle and Learn: Unsupervised Learning using Temporal Order

Ishan Misra 44 Dec 08, 2021
Official implementation for "Low-light Image Enhancement via Breaking Down the Darkness"

Low-light Image Enhancement via Breaking Down the Darkness by Qiming Hu, Xiaojie Guo. 1. Dependencies Python3 PyTorch=1.0 OpenCV-Python, TensorboardX

Qiming Hu 30 Jan 01, 2023
PRTR: Pose Recognition with Cascade Transformers

PRTR: Pose Recognition with Cascade Transformers Introduction This repository is the official implementation for Pose Recognition with Cascade Transfo

mlpc-ucsd 133 Dec 30, 2022
Code for Estimating Multi-cause Treatment Effects via Single-cause Perturbation (NeurIPS 2021)

Estimating Multi-cause Treatment Effects via Single-cause Perturbation (NeurIPS 2021) Single-cause Perturbation (SCP) is a framework to estimate the m

Zhaozhi Qian 9 Sep 28, 2022
Simulation of the solar system using various nummerical methods

solar-system Simulation of the solar system using various nummerical methods Download the repo Make shure matplotlib, scipy etc. are installed execute

Caspar 7 Jul 15, 2022