ConTNet
Introduction
ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large receptive field, limiting the performance of ConvNets on downstream tasks. (2) Transformer-based model is not robust enough and requires special training settings or hundreds of millions of images as the pretrain dataset, thereby limiting their adoption. ConTNet combines convolution and transformer alternately, which is very robust and can be optimized like ResNet unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and need many tricks when trained from scratch on a midsize dataset (e.g., ImageNet).
Main Results on ImageNet
name | resolution | [email protected] | #params(M) | FLOPs(G) | model |
---|---|---|---|---|---|
Res-18 | 224x224 | 71.5 | 11.7 | 1.8 | |
ConT-S | 224x224 | 74.9 | 10.1 | 1.5 | |
Res-50 | 224x224 | 77.1 | 25.6 | 4.0 | |
ConT-M | 224x224 | 77.6 | 19.2 | 3.1 | |
Res-101 | 224x224 | 78.2 | 44.5 | 7.6 | |
ConT-B | 224x224 | 77.9 | 39.6 | 6.4 | |
DeiT-Ti* | 224x224 | 72.2 | 5.7 | 1.3 | |
ConT-Ti* | 224x224 | 74.9 | 5.8 | 0.8 | |
Res-18* | 224x224 | 73.2 | 11.7 | 1.8 | |
ConT-S* | 224x224 | 76.5 | 10.1 | 1.5 | |
Res-50* | 224x224 | 78.6 | 25.6 | 4.0 | |
DeiT-S* | 224x224 | 79.8 | 22.1 | 4.6 | |
ConT-M* | 224x224 | 80.2 | 19.2 | 3.1 | |
Res-101* | 224x224 | 80.0 | 44.5 | 7.6 | |
DeiT-B* | 224x224 | 81.8 | 86.6 | 17.6 | |
ConT-B* | 224x224 | 81.8 | 39.6 | 6.4 |
Note: * indicates training with strong augmentations.
Main Results on Downstream Tasks
Object detection results on COCO.
method | backbone | #params(M) | FLOPs(G) | AP | APs | APm | APl |
---|---|---|---|---|---|---|---|
RetinaNet | Res-50 ConTNet-M |
32.0 27.0 |
235.6 217.2 |
36.5 37.9 |
20.4 23.0 |
40.3 40.6 |
48.1 50.4 |
FCOS | Res-50 ConTNet-M |
32.2 27.2 |
242.9 228.4 |
38.7 40.8 |
22.9 25.1 |
42.5 44.6 |
50.1 53.0 |
faster rcnn | Res-50 ConTNet-M |
41.5 36.6 |
241.0 225.6 |
37.4 40.0 |
21.2 25.4 |
41.0 43.0 |
48.1 52.0 |
Instance segmentation results on Cityscapes based on Mask-RCNN.
backbone | APbb | APsbb | APmbb | APlbb | APmk | APsmk | APmmk | APlmk |
---|---|---|---|---|---|---|---|---|
Res-50 ConT-M |
38.2 40.5 |
21.9 25.1 |
40.9 44.4 |
49.5 52.7 |
34.7 38.1 |
18.3 20.9 |
37.4 41.0 |
47.2 50.3 |
Semantic segmentation results on cityscapes.
model | mIOU |
---|---|
PSP-Res50 | 77.12 |
PSP-ConTM | 78.28 |
Bib Citing
@article{yan2021contnet,
title={ConTNet: Why not use convolution and transformer at the same time?},
author={Haotian Yan and Zhe Li and Weijian Li and Changhu Wang and Ming Wu and Chuang Zhang},
year={2021},
journal={arXiv preprint arXiv:2104.13497}
}