An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.


Bottom-Up and Top-Down Attention for Visual Question Answering

An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.

The implementation follows the VQA system described in "Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering" ( and "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" (


Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
Implemented Model 63.58 40 - 50 minutes (Titan Xp)

The accuracy was calculated using the VQA evaluation metric.


This is part of a project done at CMU for the course 11-777 Advanced Multimodal Machine Learning and a joint work between Hengyuan Hu, Alex Xiao, and Henry Huang.

As part of our project, we implemented bottom up attention as a strong VQA baseline. We were planning to integrate object detection with VQA and were very glad to see that Peter Anderson and Damien Teney et al. had already done that beautifully. We hope this clean and efficient implementation can serve as a useful baseline for future VQA explorations.

Implementation Details

Our implementation follows the overall structure of the papers but with the following simplifications:

  1. We don't use extra data from Visual Genome.
  2. We use only a fixed number of objects per image (K=36).
  3. We use a simple, single stream classifier without pre-training.
  4. We use the simple ReLU activation instead of gated tanh.

The first two points greatly reduce the training time. Our implementation takes around 200 seconds per epoch on a single Titan Xp while the one described in the paper takes 1 hour per epoch.

The third point is simply because we feel the two stream classifier and pre-training in the original paper is over-complicated and not necessary.

For the non-linear activation unit, we tried gated tanh but couldn't make it work. We also tried gated linear unit (GLU) and it works better than ReLU. Eventually we choose ReLU due to its simplicity and since the gain from using GLU is too small to justify the fact that GLU doubles the number of parameters.

With these simplifications we would expect the performance to drop. For reference, the best result on validation set reported in the paper is 63.15. The reported result without extra data from visual genome is 62.48, the result using only 36 objects per image is 62.82, the result using two steam classifier but not pre-trained is 62.28 and the result using ReLU is 61.63. These numbers are cited from the Table 1 of the paper: "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge". With all the above simplification aggregated, our first implementation got around 59-60 on validation set.

To shrink the gap, we added some simple but powerful modifications. Including:

  1. Add dropout to alleviate overfitting
  2. Double the number of neurons
  3. Add weight normalization (BN seems not work well here)
  4. Switch to Adamax optimizer
  5. Gradient clipping

These small modifications bring the number back to ~62.80. We further change the concatenation based attention module in the original paper to a projection based module. This new attention module is inspired by the paper "Modeling Relationships in Referential Expressions with Compositional Modular Networks" (, but with some modifications (implemented in attention.NewAttention). With the help of this new attention, we boost the performance to ~63.58, surpassing the reported best result with no extra data and less computation cost.



Make sure you are on a machine with a NVIDIA GPU and Python 2 with about 70 GB disk space.

  1. Install PyTorch v0.3 with CUDA and Python 2.7.
  2. Install h5py.

Data Setup

All data should be downloaded to a 'data/' directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/ from the repository root. The features are provided by and downloaded from the original authors' repo. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/ from the repository root to process the data to the correct format.


Simply run python to start training. The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default flags should give you the result provided in the table above.

Hengyuan Hu
Hengyuan Hu
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition How Fast Compare to Other Zero-Shot NAS Proxies on CIFAR-10/100 Pre-trained Model

190 Dec 29, 2022
PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

Shipeng Wang 34 Dec 21, 2022
An LSTM for time-series classification

Update 10-April-2017 And now it works with Python3 and Tensorflow 1.1.0 Update 02-Jan-2017 I updated this repo. Now it works with Tensorflow 0.12. In

Rob Romijnders 391 Dec 27, 2022
Tom-the-AI - A compound artificial intelligence software for Linux systems.

Tom the AI (version 0.82) WARNING: This software is not yet ready to use, I'm still setting up the GitHub repository. Should be ready in a few days. T

2 Apr 28, 2022
Evolution Strategies in PyTorch

Evolution Strategies This is a PyTorch implementation of Evolution Strategies. Requirements Python 3.5, PyTorch = 0.2.0, numpy, gym, universe, cv2 Wh

Andrew Gambardella 333 Nov 14, 2022
Custom implementation of Corrleation Module

Pytorch Correlation module this is a custom C++/Cuda implementation of Correlation module, used e.g. in FlowNetC This tutorial was used as a basis for

Clément Pinard 361 Dec 12, 2022
Tutorial on scikit-learn and IPython for parallel machine learning

Parallel Machine Learning with scikit-learn and IPython Video recording of this tutorial given at PyCon in 2013. The tutorial material has been rearra

Olivier Grisel 1.6k Dec 26, 2022
WORD: Revisiting Organs Segmentation in the Whole Abdominal Region

WORD: Revisiting Organs Segmentation in the Whole Abdominal Region (Paper and DataSet). [New] Note that all the emails about the download permission o

Healthcare Intelligence Laboratory 71 Dec 22, 2022
Train SN-GAN with AdaBelief

SNGAN-AdaBelief Train a state-of-the-art spectral normalization GAN with AdaBelief Acknowledgeme

Juntang Zhuang 10 Jun 11, 2022
Transfer Learning Shootout for PyTorch's model zoo (torchvision)

pytorch-retraining Transfer Learning shootout for PyTorch's model zoo (torchvision). Load any pretrained model with custom final layer (num_classes) f

Alexander Hirner 169 Jun 29, 2022
Just Randoms Cats with python

Random-Cat Just Randoms Cats with python.

OriCode 2 Dec 21, 2021
Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

This repository contains the official code of OSTAR in "Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift" (ICLR 2022).

Matthieu Kirchmeyer 5 Dec 06, 2022
The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Hierarchical Token Semantic Audio Transformer Introduction The Code Repository for "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound

Knut(Ke) Chen 134 Jan 01, 2023
The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

SIGIR2021-EGLN The implement of paper "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization" Neural graph based Col

15 Dec 27, 2022
ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

Ibai Gorordo 18 Nov 06, 2022
[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Large Scale Learning on Non-Homophilous Graphs: New Benchmark

60 Jan 03, 2023
A embed able annotation tool for end to end cross document co-reference

CoRefi CoRefi is an emebedable web component and stand alone suite for exaughstive Within Document and Cross Document Coreference Anntoation. For a de

PythicCoder 39 Dec 12, 2022
AI4Good project for detecting waste in the environment

Detect waste AI4Good project for detecting waste in environment. Our latest results were published in Waste Management journal in

108 Dec 25, 2022
Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

Tim Ansell 70 Nov 04, 2022
Liver segmentation using MONAI and pytorch

Machine Learning use case in the field of Healthcare. In this project MONAI and pytorch frameworks are used for 3D Liver segmentation.

Abhishek Gajbhiye 2 May 30, 2022