This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Overview

Self-Supervised Learning with Vision Transformers

By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu

This repo is the official implementation of "Self-Supervised Learning with Swin Transformers".

A important feature of this codebase is to include Swin Transformer as one of the backbones, such that we can evaluate the transferring performance of the learnt representations on down-stream tasks of object detection and semantic segmentation. This evaluation is usually not included in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

It currently includes code and models for the following tasks:

Self-Supervised Learning and Linear Evaluation: Included in this repo. See get_started.md for a quick start.

Transferring Performance on Object Detection/Instance Segmentation: See Swin Transformer for Object Detection.

Transferring Performance on Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Highlights

  • Include down-stream evaluation: the first work to evaluate the transferring performance on down-stream tasks for SSL using Transformers
  • Small tricks: significantly less tricks than previous works, such as MoCo v3 and DINO
  • High accuracy on ImageNet-1K linear evaluation: 72.8 vs 72.5 (MoCo v3) vs 72.5 (DINO) using DeiT-S/16 and 300 epoch pre-training

Updates

05/13/2021

  1. Self-Supervised models with DeiT-Small on ImageNet-1K (MoBY-DeiT-Small-300Ep-Pretrained, MoBY-DeiT-Small-300Ep-Linear) are provided.
  2. The supporting code and config for self-supervised learning with DeiT-Small are provided.

05/11/2021

Initial Commits:

  1. Self-Supervised Pre-training models on ImageNet-1K (MoBY-Swin-T-300Ep-Pretrained, MoBY-Swin-T-300Ep-Linear) are provided.
  2. The supported code and models for self-supervised pre-training and ImageNet-1K linear evaluation, COCO object detection and ADE20K semantic segmentation are provided.

Introduction

MoBY: a self-supervised learning approach by combining MoCo v2 and BYOL

MoBY (the name MoBY stands for MoCo v2 with BYOL) is initially described in arxiv, which is a combination of two popular self-supervised learning approaches: MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL.

MoBY achieves reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.3% top-1 accuracy using DeiT and Swin-T, respectively, by 300-epoch training. The performance is on par with recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.

teaser_moby

Swin Transformer as a backbone

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

We involve Swin Transformer as one of backbones to evaluate the transferring performance on down-stream tasks such as object detection. This differentiate this codebase with other approaches studying SSL on Transformer architectures.

ImageNet-1K linear evaluation

Method Architecture Epochs Params FLOPs img/s Top-1 Accuracy Pre-trained Checkpoint Linear Checkpoint
Supervised Swin-T 300 28M 4.5G 755.2 81.2 Here
MoBY Swin-T 100 28M 4.5G 755.2 70.9 TBA
MoBY1 Swin-T 100 28M 4.5G 755.2 72.0 TBA
MoBY DeiT-S 300 22M 4.6G 940.4 72.8 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
MoBY Swin-T 300 28M 4.5G 755.2 75.3 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
  • 1 denotes the result of MoBY which has adopted a trick from MoCo v3 that replace theLayerNorm layers before the MLP blocks by BatchNorm.

  • Access code for baidu is moby.

Transferring to Downstream Tasks

COCO Object Detection (2017 val)

Backbone Method Model Schd. box mAP mask mAP Params FLOPs
Swin-T Mask R-CNN Sup. 1x 43.7 39.8 48M 267G
Swin-T Mask R-CNN MoBY 1x 43.6 39.6 48M 267G
Swin-T Mask R-CNN Sup. 3x 46.0 41.6 48M 267G
Swin-T Mask R-CNN MoBY 3x 46.0 41.7 48M 267G
Swin-T Cascade Mask R-CNN Sup. 1x 48.1 41.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 1x 48.1 41.5 86M 745G
Swin-T Cascade Mask R-CNN Sup. 3x 50.4 43.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 3x 50.2 43.5 86M 745G

ADE20K Semantic Segmentation (val)

Backbone Method Model Crop Size Schd. mIoU mIoU (ms+flip) Params FLOPs
Swin-T UPerNet Sup. 512x512 160K 44.51 45.81 60M 945G
Swin-T UPerNet MoBY 512x512 160K 44.06 45.58 60M 945G

Citing MoBY and Swin

MoBY

@article{xie2021moby,
  title={Self-Supervised Learning with Swin Transformers}, 
  author={Zhenda Xie and Yutong Lin and Zhuliang Yao and Zheng Zhang and Qi Dai and Yue Cao and Han Hu},
  journal={arXiv preprint arXiv:2105.04553},
  year={2021}
}

Swin Transformer

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Getting Started

Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
[CVPR2021] De-rendering the World's Revolutionary Artefacts

De-rendering the World's Revolutionary Artefacts Project Page | Video | Paper In CVPR 2021 Shangzhe Wu1,4, Ameesh Makadia4, Jiajun Wu2, Noah Snavely4,

49 Nov 06, 2022
Open-source python package for the extraction of Radiomics features from 2D and 3D images and binary masks.

pyradiomics v3.0.1 Build Status Linux macOS Windows Radiomics feature extraction in Python This is an open-source python package for the extraction of

Artificial Intelligence in Medicine (AIM) Program 842 Dec 28, 2022
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

ViViT is a collection of numerical tricks to efficiently access curvature from the generalized Gauss-Newton (GGN) matrix based on its low-rank structure. Provided functionality includes computing

Felix Dangel 12 Dec 08, 2022
PyTorch implementation of UNet++ (Nested U-Net).

PyTorch implementation of UNet++ (Nested U-Net) This repository contains code for a image segmentation model based on UNet++: A Nested U-Net Architect

4ui_iurz1 642 Jan 04, 2023
一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目

定时面板上的签到盒 一个运行在 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 或 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 等定时面板的签到项目 𝐞𝐥𝐞𝐜𝐕𝟐𝐏 𝐪𝐢𝐧𝐠𝐥𝐨𝐧𝐠 特别声明 本仓库发布的脚本及其中涉及的任何解锁和解密分析脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合

Leon 1.1k Dec 30, 2022
Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations

Transfer-Learning-in-Reinforcement-Learning Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations Final Report Tra

Trung Hieu Tran 4 Oct 17, 2022
The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

1 Oct 14, 2021
For holding anime-related object classification and detection models

Animesion An end-to-end framework for anime-related object classification, detection, segmentation, and other models. Update: 01/22/2020. Due to time-

Edwin Arkel Rios 72 Nov 30, 2022
Experiments for Neural Flows paper

Neural Flows: Efficient Alternative to Neural ODEs [arxiv] TL;DR: We directly model the neural ODE solutions with neural flows, which is much faster a

54 Dec 07, 2022
Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

Official repository for GCR rerank, a GCN-based reranking method for both image and video re-ID

53 Nov 22, 2022
The official PyTorch implementation of paper BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition Boyan Zhou, Quan Cui, Xiu-Shen Wei*, Zhao-Min Chen This repo

Megvii-Nanjing 616 Dec 21, 2022
A collection of loss functions for medical image segmentation

A collection of loss functions for medical image segmentation

Jun 3.1k Jan 03, 2023
Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners This repository is built upon BEiT, thanks very much! Now, we on

Zhiliang Peng 2.3k Jan 04, 2023
Implementation of "A MLP-like Architecture for Dense Prediction"

A MLP-like Architecture for Dense Prediction (arXiv) Updates (22/07/2021) Initial release. Model Zoo We provide CycleMLP models pretrained on ImageNet

Shoufa Chen 244 Dec 27, 2022
An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

CPC_audio This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper Unsupervised Pretraining Transfers we

Meta Research 283 Dec 30, 2022
A crash course in six episodes for software developers who want to become machine learning practitioners.

Featured code sample tensorflow-planespotting Code from the Google Cloud NEXT 2018 session "Tensorflow, deep learning and modern convnets, without a P

Google Cloud Platform 2.6k Jan 08, 2023
PECOS - Prediction for Enormous and Correlated Spaces

PECOS - Predictions for Enormous and Correlated Output Spaces PECOS is a versatile and modular machine learning (ML) framework for fast learning and i

Amazon 387 Jan 04, 2023
A stable algorithm for GAN training

DRAGAN (Deep Regret Analytic Generative Adversarial Networks) Link to our paper - https://arxiv.org/abs/1705.07215 Pytorch implementation (thanks!) -

195 Oct 10, 2022
A custom DeepStack model for detecting 16 human actions.

DeepStack_ActionNET This repository provides a custom DeepStack model that has been trained and can be used for creating a new object detection API fo

MOSES OLAFENWA 16 Nov 11, 2022
The codes I made while I practiced various TensorFlow examples

TensorFlow_Exercises The codes I made while I practiced various TensorFlow examples About the codes I didn't create these codes by myself, but re-crea

Terry Taewoong Um 614 Dec 08, 2022