Cross-modal Retrieval using Transformer Encoder Reasoning Networks
This project reimplements the idea from "Transformer Reasoning Network for Image-Text Matching and Retrieval". To solve the task of cross-modal retrieval, representative features from both modal are extracted using distinctive pipeline and then projected into the same embedding space. Because the features are sequence of vectors, Transformer-based model can be utilised to work best. In this repo, my highlight contribution is:
- Reimplement TERN module, which exploits the effectiveness of using Transformer on bottom-up attention features and bert features.
- Take advantage of facebookresearch's FAISS for efficient similarity search and clustering of dense vectors.
- Experiment various metric learning loss objectives from KevinMusgrave's Pytorch Metric Learning
The figure below shows the overview of the architecture
Datasets
-
I trained TERN on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. For each sample, visual and text features are pre-extracted as numpy files
-
Some samples from the dataset:
Execution
- Installation
pip install -r requirements.txt
apt install libomp-dev
pip install faiss-gpu
-
Specify dataset paths and configuration in the config file
-
For training
PYTHONPATH=. python tools/train.py
- For evaluation
PYTHONPATH=. python tools/eval.py \
--top_k= <top k similarity> \
--weight= <model checkpoint> \
- For inference
- See tools/inference.py script
Notebooks
- Inference TERN on Flickr30k dataset
- Use FasterRCNN to extract Bottom Up embeddings
- Use BERT to extract text embeddings
Results
- Validation m on Flickr30k dataset (trained for 100 epochs):
Model | Weights | i2t/[email protected] | t2i/[email protected] |
---|---|---|---|
TERN | link | 0.5174 | 0.7496 |
- Some visualization
Query text: Two dogs are running along the street |
---|
Query text: The woman is holding a violin |
---|
Query text: Young boys are playing baseball |
---|
Query text: A man is standing, looking at a lake |
---|
Paper References
@misc{messina2021transformer,
title={Transformer Reasoning Network for Image-Text Matching and Retrieval},
author={Nicola Messina and Fabrizio Falchi and Andrea Esuli and Giuseppe Amato},
year={2021},
eprint={2004.09144},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{anderson2018bottomup,
title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
year={2018},
eprint={1707.07998},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{JDH17,
title={Billion-scale similarity search with GPUs},
author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
journal={arXiv preprint arXiv:1702.08734},
year={2017}
}