ResDAVEnet-VQ
Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
What is in this repo?
- Multi-GPU training of ResDAVEnet-VQ
- Quantitative evaluation
- Image-to-speech and speech-to-image retrieval
- ZeroSpeech 2019 ABX phone-discriminability test
- Word detection
- Qualitative evaluation
- Visualize time-aligned word/phone/code transcripts
- F1/recall/precision scatter plots for model/layer comparison
If you find the code useful, please cite
@inproceedings{Harwath2020Learning,
title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
author={David Harwath and Wei-Ning Hsu and James Glass},
booktitle={International Conference on Learning Representations},
year={2020},
url={https://openreview.net/forum?id=B1elCp4KwH}
}
Pre-trained models
Model | [email protected] | Link | MD5 sum |
---|---|---|---|
{} | 0.735 | gDrive | e3f94990c72ce9742c252b2e04f134e4 |
{}->{2} | 0.760 | gDrive | d8ebaabaf882632f49f6aea0a69516eb |
{}->{3} | 0.794 | gDrive | 2c3a269c70005cbbaaa15fc545da93fa |
{}->{2,3} | 0.787 | gDrive | d0764d8e97187c8201f205e32b5f7fee |
{2} | 0.753 | gDrive | d68c942069fcdfc3944e556f6af79c60 |
{2}->{2,3} | 0.764 | gDrive | 09e704f8fcd9f85be8c4d5bdf779bd3b |
{2}->{2,3}->{2,3,4} | 0.793 | gDrive | 6e403e7f771aad0c95f087318bf8447e |
{3} | 0.734 | gDrive | a0a3d5adbbd069a2739219346c8a8f70 |
{3}->{2,3} | 0.760 | gDrive | 6c92bcc4445895876a7840bc6e88892b |
{2,3} | 0.667 | gDrive | 7a98a661302939817a1450d033bc2fcc |
Data preparation
Download the MIT Places Image/Audio Data
We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.
Optional data preprocessing
Data specifcation files can be found at metadata/{train,val}.json
inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:
Open the *.json
files and update the values of image_base_path
and audio_base_path
to reflect the path where the image and the audio datasets are stored.
To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5
in ./dataloaders/image_caption_dataset_hdf5.py
. To prepare HDF5 datasets, run
./scripts/preprocess.sh
(We do support on-the-fly feature processing with the ImageCaptionDataset
class in ./dataloaders/image_caption_dataset.py
, which takes a data specification file as input (e.g., metadata/train.json
). However, this can be very slow)
ImageCaptionDataset
and ImageCaptionDatasetHDF5
are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset
.
Interactive Qualtitative Evaluation
See run_evaluations.ipynb
Quantitative Evaluation
ZeroSpeech 2019 ABX Phone Discriminability Test
Users need to download the dataset and the Docker image by following the instructions here.
To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh
.
Word detection
See ./run_unit_analysis.py
. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.
Example:
python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
--exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir
Cross-modal retrieval
See ./run_ResDavenetVQ.py
. Set --mode=eval
for retrieval evaluation.
Example:
python run_ResDavenetVQ.py --resume=True --mode=eval \
--data-train=$data_tr --data-val=$data_dt \
--exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"
Training
See ./scripts/train.sh
.
To train a model from scratch with the 2nd and 3rd layers quantized, run
./scripts/train.sh 01100 RDVQ_01100 ""
To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000
), run
./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"