Official repository for "Intriguing Properties of Vision Transformers" (2021)

Overview

Intriguing Properties of Vision Transformers

Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, & Ming-Hsuan Yang

Paper Link

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Our code will be publicly released.

Citation

@misc{naseer2021intriguing,
      title={Intriguing Properties of Vision Transformers}, 
      author={Muzammal Naseer and Kanchana Ranasinghe and Salman Khan and Munawar Hayat and Fahad Shahbaz Khan and Ming-Hsuan Yang},
      year={2021},
      eprint={2105.10497},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

We are in the process of cleaning our code. We will update this repo shortly. Here are the highlights of what to expect :)

  1. Pretrained ViT models trained on Stylized ImageNet (along with distilled ones). We will provide code to use these models for auto-segmentation.
  2. Training and Evaluations for our proposed off-the-shelf ensemble features.
  3. Code to evaluate any model on our proposed occulusion stratagies (random, foreground and background).
  4. Code for evaluation of permutation invaraince.
  5. Pretrained models to study the effect of varying patch sizes and positional encoding.
  6. Pretrained adversarial patches and code to evalute them.
  7. Training on Stylized Imagenet.

Requirements

pip install -r requirements.txt

Shape Biased Models

Our shape biased pretrained models can be downloaded from here. Code for evaluating their shape bias using auto segmentation on the PASCAL VOC dataset can be found under scripts. Please fix any paths as necessary. You may place the VOC devkit folder under data/voc of fix the paths appropriately.

Running segmentation evaluation on models:

./scripts/eval_segmentation.sh

Visualizing segmentation for images in a given folder:

./scripts/visualize_segmentation.sh

Off the Shelf Classification

Training code for off-the-shelf experiment in classify_metadataset.py. Seven datasets (aircraft CUB DTD fungi GTSRB Places365 INAT) available by default. Set the appropriate dir path in classify_md.sh by fixing DATA_PATH.

Run training and evaluation for a selected dataset (aircraft by default) using selected model (DeiT-T by default):

./scripts/classify_md.sh

Occlusion Evaluation

Evaluation on ImageNet val set (change path in script) for our proposed occlusion techniques:

./scripts/evaluate_occlusion.sh

Permutation Invariance Evaluation

Evaluation on ImageNet val set (change path in script) for the shuffle operation:

./scripts/evaluate_shuffle.sh

Varying Patch Sizes and Positional Encoding

Pretrained models to study the effect of varying patch sizes and positional encoding:

DeiT-T Model Top-1 Top-5 Pretrained
No Pos. Enc. 68.3 89.0 Link
Patch 22 68.7 89.0 Link
Patch 28 65.2 86.7 Link
Patch 32 63.1 85.3 Link
Patch 38 55.2 78.8 Link

References

Code borrowed from DeiT and DINO repositories.

Comments
  • Question about links of pretrained models

    Question about links of pretrained models

    Hi! First of all, thank the authors for the exciting work! I noticed that the checkpoint link of the pretrained 'deit_tiny_distilled_patch16_224' in vit_models/deit.py is different from the one of the shape-biased model DeiT-T-SIN (distilled), as given in README.md. I thought deit_tiny_distilled_patch16_224 has the same definition with DeiT-T-SIN (distilled). Do they have differences in model architecture or training procedure?

    opened by ZhouqyCH 3
  • Two questions on your paper

    Two questions on your paper

    Hi. This is heonjin.

    Firstly, big thanks to you and your paper. well-read and precise paper! I have two questions on your paper.

    1. Please take a look at Figure 9. image On the 'no positional encoding' experiment, there is a peak on 196 shuffle size of "DeiT-T-no-pos". Why is there a peak? and I wonder why there is a decreasing from 0 shuffle size to 64 of "DeiT-T-no-pos".

    2. On the Figure 14, image On the Aircraft(few shot), Flower(few shot) dataset, CNN performs better than DeiT. Could you explain this why?

    Thanks in advance.

    opened by hihunjin 2
  • Attention maps DINO Patchdrop

    Attention maps DINO Patchdrop

    Hi, thanks for the amazing paper.

    My question is about how which patches are dropped from the image with the DINO model. It looks like in the code in evaluate.py on line 132 head_number = 1. I want to understand the reason why this number was chosen (the other params used to index the attention maps seem to make sense). Wouldn't averaging the attention maps across heads give you better segmentation?

    Thanks,

    Ravi

    opened by rraju1 1
  • Support CPU when visualizing segmentations

    Support CPU when visualizing segmentations

    Most of the code to visualize segmentation is ready for GPU and CPU, but I bumped into this one place where there is a hard-coded .cuda() call. I changed it to .to(device) to support CPU.

    opened by cgarbin 0
  • Expand the instructions to install the PASCAL VOC dataset

    Expand the instructions to install the PASCAL VOC dataset

    I inspected the code to understand the expected directory structure. This note in the README may help other users put the dataset in the right place from the start.

    opened by cgarbin 0
  • Add note to use Python 3.8 because of PyTorch 1.7

    Add note to use Python 3.8 because of PyTorch 1.7

    PyTorch 1.7 requires Python 3.8. Refer to the discussion in https://github.com/pytorch/pytorch/issues/47354.

    Suggest adding this note to the README to help reproduce the environment because running pip install -r requirements.txt with the wrong version of Python gives an obscure error message.

    opened by cgarbin 0
  • Amazing work, but can it work on DETR?

    Amazing work, but can it work on DETR?

    ViT family show strong robustness on RandomDrop and Domain shift Problem. The thing is , I 'm working on object detection these days,detr is an end to end object detection methods which adopted Transformer's encoder decoder part, but the backbone I use , is Resnet50, it can still find the properties that your paper mentioned. Above all I want to ask two questions: (1).Do these intriguing properties come from encoder、decoder part? (2).What's the difference between distribution shift and domain shift(I saw distribution shift first time on your paper)?

    opened by 1184125805 0
Owner
Muzammal Naseer
PhD student at Australian National University.
Muzammal Naseer
Dialect classification

Dialect-Classification This repository presents the data that was used in a talk at ICKL-5 (5th International Conference on Kurdish Linguistics) at th

Kurdish-BLARK 0 Nov 12, 2021
Bootstrapped Unsupervised Sentence Representation Learning (ACL 2021)

Install first pip3 install -e . Training python3 training/unsupervised_tuning.py python3 training/supervised_tuning.py python3 training/multilingual_

yanzhang_nlp 26 Jul 22, 2022
Parallel and High-Fidelity Text-to-Lip Generation; AAAI 2022 ; Official code

Parallel and High-Fidelity Text-to-Lip Generation This repository is the official PyTorch implementation of our AAAI-2022 paper, in which we propose P

Zhying 77 Dec 21, 2022
Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

Peter Lin 6.5k Jan 04, 2023
PyTorch-based framework for Deep Hedging

PFHedge: Deep Hedging in PyTorch PFHedge is a PyTorch-based framework for Deep Hedging. PFHedge Documentation Neural Network Architecture for Efficien

139 Dec 30, 2022
Graph neural network message passing reframed as a Transformer with local attention

Adjacent Attention Network An implementation of a simple transformer that is equivalent to graph neural network where the message passing is done with

Phil Wang 49 Dec 28, 2022
TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation Zhaoyun Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li

DamoCV 25 Dec 16, 2022
Pomodoro timer that acknowledges the inexorable, infinite passage of time

Pomodouroboros Most pomodoro trackers assume you're going to start them. But time and tide wait for no one - the great pomodoro of the cosmos is cold

Glyph 66 Dec 13, 2022
On Generating Extended Summaries of Long Documents

ExtendedSumm This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the

Georgetown Information Retrieval Lab 76 Sep 05, 2022
Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neurons learned with Gradient descent or LeLevenberg–Marquardt algorithm

Neuron class provides LNU (Linear Neural Unit), QNU (Quadratic Neural Unit), RBF (Radial Basis Function), MLP (Multi Layer Perceptron), MLP-ELM (Multi Layer Perceptron - Extreme Learning Machine) neu

Filip Molcik 38 Dec 17, 2022
The official implementation of paper "Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks" (IJCV under review).

DGMS This is the code of the paper "Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks". Installation Our code works with Pytho

Runpei Dong 3 Aug 28, 2022
Robot Servers and Server Manager software for robo-gym

robo-gym-server-modules Robot Servers and Server Manager software for robo-gym. For info on how to use this package please visit the robo-gym website

JR ROBOTICS 4 Aug 16, 2021
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Master Docs License Apache MXNet (incubating) is a deep learning framework designed for both efficiency an

ROCm Software Platform 29 Nov 16, 2022
Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Kai Zhang 1.2k Dec 29, 2022
The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation"

SD-AANet The code is for the paper "A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation" [arxiv] Overview confi

cv516Buaa 9 Nov 07, 2022
DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

DeepProbLog DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predic

KU Leuven Machine Learning Research Group 94 Dec 18, 2022
Our CIKM21 Paper "Incorporating Query Reformulating Behavior into Web Search Evaluation"

Reformulation-Aware-Metrics Introduction This codebase contains source-code of the Python-based implementation of our CIKM 2021 paper. Chen, Jia, et a

xuanyuan14 5 Mar 05, 2022
Phy-Q: A Benchmark for Physical Reasoning

Phy-Q: A Benchmark for Physical Reasoning Cheng Xue*, Vimukthini Pinto*, Chathura Gamage* Ekaterina Nikonova, Peng Zhang, Jochen Renz School of Comput

29 Dec 19, 2022
2021 credit card consuming recommendation

2021 credit card consuming recommendation

Wang, Chung-Che 7 Mar 08, 2022
Repo for parser tensorflow(.pb) and tflite(.tflite)

tfmodel_parser .pb file is the format of tensorflow model .tflite file is the format of tflite model, which usually used in mobile devices before star

1 Dec 23, 2021