The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Overview

CrossFormer

This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Introduction

Existing vision transformers fail to build attention among objects/features of different scales (cross-scale attention), while such ability is very important to visual tasks. CrossFormer is a versatile vision transformer which solves this problem. Its core designs contain Cross-scale Embedding Layer (CEL), Long-Short Distance Attention (L/SDA), which work together to enable cross-scale attention.

CEL blends every input embedding with multiple-scale features. L/SDA split all embeddings into several groups, and the self-attention is only computed within each group (embeddings with the same color border belong to the same group.).

Further, we also propose a dynamic position bias (DPB) module, which makes the effective yet inflexible relative position bias apply to variable image size.

Now, experiments are done on four representative visual tasks, i.e., image classification, objection detection, and instance/semantic segmentation. Results show that CrossFormer outperforms existing vision transformers in these tasks, especially in dense prediction tasks (i.e., object detection and instance/semantic segmentation). We think it is because image classification only pays attention to one object and large-scale features, while dense prediction tasks rely more on cross-scale attention.

Prerequisites

  1. Libraries (Python3.6-based)
pip3 install numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2
  1. Dataset: ImageNet

  2. Requirements for detection/instance segmentation and semantic segmentation are listed here: detection/README.md or segmentation/README.md

Getting Started

Training

## There should be two directories under the path_to_imagenet: train and validation

## CrossFormer-T
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-S
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/small_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-B
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/base_patch4_group7_224.yaml 
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-L
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/large_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

Testing

## Take CrossFormer-T as an example
python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --eval --resume path_to_crossformer-t.pth

Training scripts for objection detection: detection/README.md.

Training scripts for semantic segmentation: segmentation/README.md.

Results

Image Classification

Models trained on ImageNet-1K and evaluated on its validation set. The input image size is 224 x 224.

Architectures Params FLOPs Accuracy Models
ResNet-50 25.6M 4.1G 76.2% -
RegNetY-8G 39.0M 8.0G 81.7% -
CrossFormer-T 27.8M 2.9G 81.5% Google Drive/BaiduCloud, key: nkju
CrossFormer-S 30.7M 4.9G 82.5% Google Drive/BaiduCloud, key: fgqj
CrossFormer-B 52.0M 9.2G 83.4% Google Drive/BaiduCloud, key: 7md9
CrossFormer-L 92.0M 16.1G 84.0% TBD

More results compared with other vision transformers can be seen in the paper.

Objection Detection & Instance Segmentation

Models trained on COCO 2017. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone Detection Head Learning Schedule Params FLOPs box AP mask AP
ResNet-101 RetinaNet 1x 56.7M 315.0G 38.5 -
CrossFormer-S RetinaNet 1x 40.8M 282.0G 44.4 -
CrossFormer-B RetinaNet 1x 62.1M 389.0G 46.2 -
ResNet-101 Mask-RCNN 1x 63.2M 336.0G 40.4 36.4
CrossFormer-S Mask-RCNN 1x 50.2M 301.0G 45.4 41.4
CrossFormer-B Mask-RCNN 1x 71.5M 407.9G 47.2 42.7

More results and pretrained models for objection detection: detection/README.md.

Semantic Segmentation

Models trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone Segmentation Head Iterations Params FLOPs IOU MS IOU
CrossFormer-S FPN 80K 34.3M 209.8G 46.4 -
CrossFormer-B FPN 80K 55.6M 320.1G 48.0 -
CrossFormer-L FPN 80K 95.4M 482.7G 49.1 -
ResNet-101 UPerNet 160K 86.0M 1029.G 44.9 -
CrossFormer-S UPerNet 160K 62.3M 979.5G 47.6 48.4
CrossFormer-B UPerNet 160K 83.6M 1089.7G 49.7 50.6
CrossFormer-L UPerNet 160K 125.5M 1257.8G 50.4 51.4

MS IOU means IOU with multi-scale testing.

More results and pretrained models for semantic segmentation: segmentation/README.md.

Citing Us

@article{crossformer2021,
  title     = {CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention},
  author    = {Wenxiao Wang and Lu Yao and Long Chen and Deng Cai and Xiaofei He and Wei Liu},
  journal   = {CoRR},
  volume    = {abs/2108.00154},
  year      = {2021},
}

Acknowledgement

Part of the code of this repository refers to Swin Transformer.

Owner
cheerss
cheerss
End-to-end beat and downbeat tracking in the time domain.

WaveBeat End-to-end beat and downbeat tracking in the time domain. | Paper | Code | Video | Slides | Setup First clone the repo. git clone https://git

Christian J. Steinmetz 60 Dec 24, 2022
Continuous Time LiDAR odometry

CT-ICP: Elastic SLAM for LiDAR sensors This repository implements the SLAM CT-ICP (see our article), a lightweight, precise and versatile pure LiDAR o

385 Dec 29, 2022
In this project we use both Resnet and Self-attention layer for cat, dog and flower classification.

cdf_att_classification classes = {0: 'cat', 1: 'dog', 2: 'flower'} In this project we use both Resnet and Self-attention layer for cdf-Classification.

3 Nov 23, 2022
Weighted QMIX: Expanding Monotonic Value Function Factorisation

This repo contains the cleaned-up code that was used in "Weighted QMIX: Expanding Monotonic Value Function Factorisation"

whirl 82 Dec 29, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [PaddlePaddle Implementation] Homepage of paper: Paint Transformer: Fee

442 Dec 16, 2022
Facilitating Database Tuning with Hyper-ParameterOptimization: A Comprehensive Experimental Evaluation

A Comprehensive Experimental Evaluation for Database Configuration Tuning This is the source code to the paper "Facilitating Database Tuning with Hype

DAIR Lab 9 Oct 29, 2022
Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation".

I2V-GAN This repository is the official Pytorch implementation for ACMMM2021 paper "I2V-GAN: Unpaired Infrared-to-Visible Video Translation". Traffic

69 Dec 31, 2022
Simple node deletion tool for onnx.

snd4onnx Simple node deletion tool for onnx. I only test very miscellaneous and limited patterns as a hobby. There are probably a large number of bugs

Katsuya Hyodo 6 May 15, 2022
Intel® Nervana™ reference deep learning framework committed to best performance on all hardware

DISCONTINUATION OF PROJECT. This project will no longer be maintained by Intel. Intel will not provide or guarantee development of or support for this

Nervana 3.9k Dec 20, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at

9 Oct 28, 2022
Deep learning toolbox based on PyTorch for hyperspectral data classification.

Deep learning toolbox based on PyTorch for hyperspectral data classification.

Nicolas 304 Dec 28, 2022
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees

ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees This repository is the official implementation of the empirica

Kuan-Lin (Jason) Chen 2 Oct 02, 2022
Finetune SSL models for MOS prediction

Finetune SSL models for MOS prediction This is code for our paper under review for ICASSP 2022: "Generalization Ability of MOS Prediction Networks" Er

Yamagishi and Echizen Laboratories, National Institute of Informatics 32 Nov 22, 2022
This repo tries to recognize faces in the dataset you created

YÜZ TANIMA SİSTEMİ Bu repo oluşturacağınız yüz verisetlerini tanımaya çalışan ma

Mehdi KOŞACA 2 Dec 30, 2021
Simple Pixelbot for Diablo 2 Resurrected written in python and opencv.

Simple Pixelbot for Diablo 2 Resurrected written in python and opencv. Obviously only use it in offline mode as it is against the TOS of Blizzard to use it in online mode!

468 Jan 03, 2023
3D-Transformer: Molecular Representation with Transformer in 3D Space

3D-Transformer: Molecular Representation with Transformer in 3D Space

55 Dec 19, 2022
Source code of CIKM2021 Long Paper "PSSL: Self-supervised Learning for Personalized Search with Contrastive Sampling".

PSSL Source code of CIKM2021 Long Paper "PSSL: Self-supervised Learning for Personalized Search with Contrastive Sampling". It consists of the pre-tra

2 Dec 21, 2021
Codes for CyGen, the novel generative modeling framework proposed in "On the Generative Utility of Cyclic Conditionals" (NeurIPS-21)

On the Generative Utility of Cyclic Conditionals This repository is the official implementation of "On the Generative Utility of Cyclic Conditionals"

Chang Liu 44 Nov 16, 2022
PyTorch implementation of some learning rate schedulers for deep learning researcher.

pytorch-lr-scheduler PyTorch implementation of some learning rate schedulers for deep learning researcher. Usage WarmupReduceLROnPlateauScheduler Visu

Soohwan Kim 59 Dec 08, 2022
Image to Image translation, image generataton, few shot learning

Semi-supervised Learning for Few-shot Image-to-Image Translation [paper] Abstract: In the last few years, unpaired image-to-image translation has witn

yaxingwang 49 Nov 18, 2022