MPViT:Multi-Path Vision Transformer for Dense Prediction

Overview

MPViT : Multi-Path Vision Transformer for Dense Prediction

This repository inlcudes official implementations and model weights for MPViT.

[Arxiv] [BibTeX]

MPViT : Multi-Path Vision Transformer for Dense Prediction
🏛️ ️️ 🏫 Youngwan Lee, 🏛️ ️️Jonghee Kim, 🏫 Jeff Willette, 🏫 Sung Ju Hwang
ETRI 🏛️ ️, KAIST 🏫

Abstract

We explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size (i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse and multi-scale feature representations, our MPViTs scaling from Tiny(5M) to Base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks.

Main results on ImageNet-1K

🚀 These all models are trained on ImageNet-1K with the same training recipe as DeiT and CoaT.

model resolution [email protected] #params FLOPs weight
MPViT-T 224x224 78.2 5.8M 1.6G weight
MPViT-XS 224x224 80.9 10.5M 2.9G weight
MPViT-S 224x224 83.0 22.8M 4.7G weight
MPViT-B 224x224 84.3 74.8M 16.4G weight

Main results on COCO object detection

🚀 All model are trained using ImageNet-1K pretrained weights.

☀️ MS denotes the same multi-scale training augmentation as in Swin-Transformer which follows the MS augmentation as in DETR and Sparse-RCNN. Therefore, we also follows the official implementation of DETR and Sparse-RCNN which are also based on Detectron2.

Please refer to detectron2/ for the details.

Backbone Method lr Schd box mAP mask mAP #params FLOPS weight
MPViT-T RetinaNet 1x 41.8 - 17M 196G model | metrics
MPViT-XS RetinaNet 1x 43.8 - 20M 211G model | metrics
MPViT-S RetinaNet 1x 45.7 - 32M 248G model | metrics
MPViT-B RetinaNet 1x 47.0 - 85M 482G model | metrics
MPViT-T RetinaNet MS+3x 44.4 - 17M 196G model | metrics
MPViT-XS RetinaNet MS+3x 46.1 - 20M 211G model | metrics
MPViT-S RetinaNet MS+3x 47.6 - 32M 248G model | metrics
MPViT-B RetinaNet MS+3x 48.3 - 85M 482G model | metrics
MPViT-T Mask R-CNN 1x 42.2 39.0 28M 216G model | metrics
MPViT-XS Mask R-CNN 1x 44.2 40.4 30M 231G model | metrics
MPViT-S Mask R-CNN 1x 46.4 42.4 43M 268G model | metrics
MPViT-B Mask R-CNN 1x 48.2 43.5 95M 503G model | metrics
MPViT-T Mask R-CNN MS+3x 44.8 41.0 28M 216G model | metrics
MPViT-XS Mask R-CNN MS+3x 46.6 42.3 30M 231G model | metrics
MPViT-S Mask R-CNN MS+3x 48.4 43.9 43M 268G model | metrics
MPViT-B Mask R-CNN MS+3x 49.5 44.5 95M 503G model | metrics

Deformable-DETR

All models are trained using the same training recipe.

Please refer to deformable_detr/ for the details.

backbone box mAP epochs link
ResNet-50 44.5 50 -
CoaT-lite S 47.0 50 link
CoaT-S 48.4 50 link
MPViT-S 49.0 50 link

Main results on ADE20K Semantic segmentation

All model are trained using ImageNet-1K pretrained weight.

Please refer to semantic_segmentation/ for the details.

Backbone Method Crop Size Lr Schd mIoU #params FLOPs weight
MPViT-S UperNet 512x512 160K 48.3 52M 943G weight
MPViT-B UperNet 512x512 160K 50.3 105M 1185G weight

Getting Started

We use pytorch==1.7.0 torchvision==0.8.1 cuda==10.1 libraries on NVIDIA V100 GPUs. If you use different versions of cuda, you may obtain different accuracies, but the differences are negligible.

Acknowledgement

This repository is built using the Timm library, DeiT, CoaT, Detectron2, mmsegmentation repositories.

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network and No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis).

License

Please refer to MPViT LSA.

Citing MPViT

@article{lee2021mpvit,
      title={MPViT: Multi-Path Vision Transformer for Dense Prediction}, 
      author={Youngwan Lee and Jonghee Kim and Jeff Willette and Sung Ju Hwang},
      year={2021},
      journal={arXiv preprint arXiv:2112.11010}
}
Owner
Youngwan Lee
Researcher at ETRI & Ph.D student in Graduate school of AI at KAIST.
Youngwan Lee
data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer"

C2F-FWN data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer" (https://arxiv.org/abs/

EKILI 46 Dec 14, 2022
Benchmark for Answering Existential First Order Queries with Single Free Variable

EFO-1-QA Benchmark for First Order Query Estimation on Knowledge Graphs This repository contains an entire pipeline for the EFO-1-QA benchmark. EFO-1

HKUST-KnowComp 14 Oct 24, 2022
Robotics environments

Robotics environments Details and documentation on these robotics environments are available in OpenAI's blog post and the accompanying technical repo

Farama Foundation 121 Dec 28, 2022
EM-POSE 3D Human Pose Estimation from Sparse Electromagnetic Trackers.

EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers This repository contains the code to our paper published at ICCV 2021. For ques

Facebook Research 62 Dec 14, 2022
Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

🐯 SynthTIGER: Synthetic Text Image GEneratoR Official implementation of SynthTIGER | Paper | Datasets Moonbin Yim1, Yoonsik Kim1, Han-cheol Cho1, Sun

Clova AI Research 256 Jan 05, 2023
UV matrix decompostion using movielens dataset

UV-matrix-decompostion-with-kfold UV matrix decompostion using movielens dataset upload the 'ratings.dat' file install the following python libraries

2 Oct 18, 2022
Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

NANSY: Unofficial Pytorch Implementation of Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations Notice Papers' D

Dongho Choi 최동호 104 Dec 23, 2022
Seg-Torch for Image Segmentation with Torch

Seg-Torch for Image Segmentation with Torch This work was sparked by my personal research on simple segmentation methods based on deep learning. It is

Eren Gölge 37 Dec 12, 2022
This repository contains part of the code used to make the images visible in the article "How does an AI Imagine the Universe?" published on Towards Data Science.

Generative Adversarial Network - Generating Universe This repository contains part of the code used to make the images visible in the article "How doe

Davide Coccomini 9 Dec 18, 2022
MIMO-UNet - Official Pytorch Implementation

MIMO-UNet - Official Pytorch Implementation This repository provides the official PyTorch implementation of the following paper: Rethinking Coarse-to-

Sungjin Cho 248 Jan 02, 2023
Learn about quantum computing and algorithm on quantum computing

quantum_computing this repo contains everything i learn about quantum computing and algorithm on quantum computing what is aquantum computing quantum

arfy slowy 8 Dec 25, 2022
Multi-Scale Aligned Distillation for Low-Resolution Detection (CVPR2021)

MSAD Multi-Scale Aligned Distillation for Low-Resolution Detection Lu Qi*, Jason Kuen*, Jiuxiang Gu, Zhe Lin, Yi Wang, Yukang Chen, Yanwei Li, Jiaya J

Jia Research Lab 115 Dec 23, 2022
SafePicking: Learning Safe Object Extraction via Object-Level Mapping, ICRA 2022

SafePicking Learning Safe Object Extraction via Object-Level Mapping Kentaro Wad

Kentaro Wada 49 Oct 24, 2022
Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

Kakao Brain 114 Nov 28, 2022
A fast Evolution Strategy implementation in Python

Evostra: Evolution Strategy for Python Evolution Strategy (ES) is an optimization technique based on ideas of adaptation and evolution. You can learn

Mika 251 Dec 08, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
Generative Adversarial Text to Image Synthesis

Text To Image Synthesis This is a tensorflow implementation of synthesizing images. The images are synthesized using the GAN-CLS Algorithm from the pa

Hao 575 Jan 08, 2023
joint detection and semantic segmentation, based on ultralytics/yolov5,

Multi YOLO V5——Detection and Semantic Segmentation Overeview This is my undergraduate graduation project which based on ultralytics YOLO V5 tag v5.0.

477 Jan 06, 2023
Expressive Power of Invariant and Equivaraint Graph Neural Networks (ICLR 2021)

Expressive Power of Invariant and Equivaraint Graph Neural Networks In this repository, we show how to use powerful GNN (2-FGNN) to solve a graph alig

Marc Lelarge 36 Dec 12, 2022
Differentiable scientific computing library

xitorch: differentiable scientific computing library xitorch is a PyTorch-based library of differentiable functions and functionals that can be widely

98 Dec 26, 2022