[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

Last update: Nov 26, 2022

Overview

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Codes for this paper: [CVPR 2022] The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy.

Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang.

Overview

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models.

Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a “head-to-toe assessment” regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field.

This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information.

Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70% ∼1.76% accuracy boosts on ImageNet with highly reduced similarity.

Prerequisites

Install PyTorch 1.7.0+ and torchvision 0.8.1+ and pytorch-image-models 0.3.2:

conda install -c pytorch torchvision
pip install timm==0.3.2

Training on ImageNet

./script/run_deit_small_diverse.sh [data/imagenet] (Deit-Small-12layers)
./script/run_deit_small_24layer_diverse.sh [data/imagenet] (Deit-Small-24layers)

Citation

TBD

Acknowledgement

https://github.com/facebookresearch/deit

[CVPR 2022] "The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy" by Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, Zhangyang Wang

Related tags

Overview

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Overview

Prerequisites

Training on ImageNet

Citation

Acknowledgement

Owner

VITA

PyTorch implementation of Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction (ICCV 2021).

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

A Python 3 package for state-of-the-art statistical dimension reduction methods

Image data augmentation scheduler for albumentations transforms

For AILAB: Cross Lingual Retrieval on Yelp Search Engine

A small demonstration of using WebDataset with ImageNet and PyTorch Lightning

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

This is an official implementation of the CVPR2022 paper "Blind2Unblind: Self-Supervised Image Denoising with Visible Blind Spots".

The King is Naked: on the Notion of Robustness for Natural Language Processing

potpourri3d - An invigorating blend of 3D geometry tools in Python.

PyTorch implementations of algorithms for density estimation

Classification of EEG data using Deep Learning

Job Assignment System by Real-time Emotion Detection

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Convert onnx models to pytorch.

AntroPy: entropy and complexity of (EEG) time-series in Python

Code for the paper Open Sesame: Getting Inside BERT's Linguistic Knowledge.

Repository for "Toward Practical Monocular Indoor Depth Estimation" (CVPR 2022)

SCNet: Learning Semantic Correspondence

A Pose Estimator for Dense Reconstruction with the Structured Light Illumination Sensor