Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Overview

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*.

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet.

*Note: please see the ArXiv version for additional results on the test set.

Setup

  1. Clone this module and any submodules: git clone --recurse-submodules [email protected]:iapalm/Spoken-ObjectNet.git
  2. Follow the directions in data.md to set up ObjectNet images and the Spoken ObjectNet-50k corpus
  3. This code was tested with PyTorch 1.9 with CUDA 10.2 and Python 3.8.8.
  4. To train the models with the code as-is, we use 2 GPUs with 11 Gb of memory. A single GPU can be used, but the batch size or other parameters should be reduced.
  5. Note about the speed of this code: This code will work as-is on the Spoken ObjectNet audio captions, but the speed could be greatly improved. A main bottleneck is the resampling of the audio wav files from 48 kHz to 16 kHz, which is done with librosa here. We suggest to pre-process the audio files into the desired format first, and then remove this line or the on-the-fly spectrogram conversion entirely. We estimate the speed will improve 5x.
  6. On our servers, the zero-shot evaluation takes around 20-30 minutes and training takes around 4-5 days. As mentioned in the previous point, this could be improved with audio pre-processing.

Running Experiments

We support 3 experiments that can be used as baselines for future work:

  • (1) Zero-shot evaluation of the ResDAVEnet-VQ model trained on Places-400k [2].
  • (2) Fine-tuning the ResDAVEnet-VQ model trained on Places-400k on Spoken ObjectNet with a frozen image branch .
  • (3) Training the ResDAVEnet-VQ model from scratch on Spoken ObjectNet with a frozen image branch.
  • Note: fine-tuning the image branch on Spoken ObjectNet is not permitted, but fine-tuning the audio branch is allowed.

Zero-shot transfer from Places-400k

  • Download and extract the directory containing the model weights from this link. Keep the folder named RDVQ_00000 and move it to the exps directory.
  • In scripts/train.sh, change data_dt to data/Spoken-ObjectNet-50k/metadata/SON-test.json to evaluate on the test set instead of the validation set.
  • Run the following command for zero-shot evaluation: source scripts/train.sh 00000 RDVQ_00000 "--resume True --mode eval"
  • The results are printed in exps/RDVQ_00000_transfer/train.out

Fine-tune the model from Places-400k

  • Download and extract the directory containing the args.pkl file which specifies the fine-tuning arguments. The directory at this link contains the args.pkl file as well as the model weights.
  • The model weights of the fine-tuned model are provided for easier evaluation. Run the following command to evaluate the model using those weights: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True --mode eval"
  • Otherwise, to fine-tune the model yourself, first move the model weights to a new folder model_dl, then make a new folder model to save the new weights, and then run the following command: source scripts/train.sh 00000 RDVQ_00000_finetune "--resume True". This still require the args.pkl file mentioned previously.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Train the model from scratch on Spoken ObjectNet

  • Run the following command to train the model from scratch: source scripts/train.sh 00000 RDVQ_scratch_frozen "--lr 0.001 --freeze-image-model True"
  • The model weights can be evaulated with source scripts/train.sh 00000 RDVQ_scratch_frozen "--resume True --mode eval"
  • We also provide the trained model weights at this link.
  • Plese note the value of data_dt in scripts/train.sh. The code saves the best performing model during training, which is why it should be set to the validation set during training. During evaluation, it loads the best performing model, which is why it should be set to the test set during evaluation.

Contact

If You find any problems or have any questions, please open an issue and we will try to respond as soon as possible. You can also try emailing the first corresponding author.

References

[1] Palmer, I., Rouditchenko, A., Barbu, A., Katz, B., Glass, J. (2021) Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. Proc. Interspeech 2021, 3650-3654, doi: 10.21437/Interspeech.2021-245

[2] David Harwath*, Wei-Ning Hsu*, and James Glass. Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech. Proc. International Conference on Learning Representations (ICLR), 2020

Spoken ObjectNet - Bibtex:

@inproceedings{palmer21_interspeech,
  author={Ian Palmer and Andrew Rouditchenko and Andrei Barbu and Boris Katz and James Glass},
  title={{Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3650--3654},
  doi={10.21437/Interspeech.2021-245}
}
Owner
Ian Palmer
Ian Palmer
Colar: Effective and Efficient Online Action Detection by Consulting Exemplars, CVPR 2022.

Colar: Effective and Efficient Online Action Detection by Consulting Exemplars This repository is the official implementation of Colar. In this work,

LeYang 246 Dec 13, 2022
This repository is an open-source implementation of the ICRA 2021 paper: Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling.

Locus This repository is an open-source implementation of the ICRA 2021 paper: Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order

Robotics and Autonomous Systems Group 96 Dec 15, 2022
This repository contains all code and data for the Inside Out Visual Place Recognition task

Inside Out Visual Place Recognition This repository contains code and instructions to reproduce the results for the Inside Out Visual Place Recognitio

15 May 21, 2022
IAUnet: Global Context-Aware Feature Learning for Person Re-Identification

IAUnet This repository contains the code for the paper: IAUnet: Global Context-Aware Feature Learning for Person Re-Identification Ruibing Hou, Bingpe

30 Jul 14, 2022
Official Pytorch implementation of 6DRepNet: 6D Rotation representation for unconstrained head pose estimation.

6D Rotation Representation for Unconstrained Head Pose Estimation (Pytorch) Paper Thorsten Hempel and Ahmed A. Abdelrahman and Ayoub Al-Hamadi, "6D Ro

Thorsten Hempel 284 Dec 23, 2022
LSSY量化交易系统

LSSY量化交易系统 该项目是本人3年来研究量化慢慢积累开发的一套系统,属于早期作品慢慢修改而来,仅供学习研究,回测分析,实盘交易部分未公开

55 Oct 04, 2022
AdvStyle - Official PyTorch Implementation

AdvStyle - Official PyTorch Implementation Paper | Supp Discovering Interpretable Latent Space Directions of GANs Beyond Binary Attributes. Huiting Ya

Beryl 37 Oct 21, 2022
Repositorio oficial del curso IIC2233 Programación Avanzada 🚀✨

IIC2233 - Programación Avanzada Evaluación Las evaluaciones serán efectuadas por medio de actividades prácticas en clases y tareas. Se calculará la no

IIC2233 @ UC 47 Sep 06, 2022
Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021) PyTorch implementation of Learning RAW-to-sRGB Mappings with Inaccurat

Zhilu Zhang 53 Dec 20, 2022
《Geo Word Clouds》paper implementation

《Geo Word Clouds》paper implementation

Russellwzr 2 Jan 28, 2022
AI Face Mesh: This is a simple face mesh detection program based on Artificial intelligence.

AI Face Mesh: This is a simple face mesh detection program based on Artificial Intelligence which made with Python. It's able to detect 468 different

Md. Rakibul Islam 1 Jan 13, 2022
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning Authors: Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nig

Yixuan Su 79 Nov 04, 2022
Image Segmentation and Object Detection in Pytorch

Image Segmentation and Object Detection in Pytorch Pytorch-Segmentation-Detection is a library for image segmentation and object detection with report

Daniil Pakhomov 732 Dec 10, 2022
(AAAI 2021) Progressive One-shot Human Parsing

End-to-end One-shot Human Parsing This is the official repository for our two papers: Progressive One-shot Human Parsing (AAAI 2021) End-to-end One-sh

54 Dec 30, 2022
Code for the paper "Graph Attention Tracking". (CVPR2021)

SiamGAT 1. Environment setup This code has been tested on Ubuntu 16.04, Python 3.5, Pytorch 1.2.0, CUDA 9.0. Please install related libraries before r

122 Dec 24, 2022
Deep Q-learning for playing chrome dino game

[PYTORCH] Deep Q-learning for playing Chrome Dino

Viet Nguyen 68 Dec 05, 2022
Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

This repo contains code for our paper State-only Imitation with Transition Dynamics Mismatch published at ICLR 2020. The code heavily uses the RL mach

20 Sep 08, 2022
Discover hidden deepweb pages

DeepWeb Scapper Att: Demo version An simple script to scrappe deepweb to find pages. Will return if any of those exists and will save on a file. You s

Héber Júlio 77 Oct 02, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022
A hyperparameter optimization framework

Optuna: A hyperparameter optimization framework Website | Docs | Install Guide | Tutorial Optuna is an automatic hyperparameter optimization software

7.4k Jan 04, 2023