当前位置:网站首页>Convolution free backbone network: Pyramid transformer to improve the accuracy of target detection / segmentation and other tasks (with source code)
Convolution free backbone network: Pyramid transformer to improve the accuracy of target detection / segmentation and other tasks (with source code)
2022-07-05 20:10:00 【Computer Vision Research Institute】
Computer Vision Institute column
author :Edison_G
Embedding pyramid structure into Transformer Structure is used to generate multi-scale features , And finally used in dense prediction tasks .
official account ID|ComputerVisionGzq
Study Group | Scan the code to get the join mode on the homepage
Pay attention to the parallel stars
Never get lost
Institute of computer vision
Address of thesis :https://arxiv.org/pdf/2102.12122.pdf
Source code address :https://github.com/whai362/PVT
background
With self attention Transformer It triggered a revolution in the field of natural language processing , Recently also inspired Transformer The emergence of Architectural Design , It has achieved competitive results in many computer vision tasks .
Here's what we shared earlier based on Transformer New target detection technology !
link :ResNet Super variant : JD.COM AI New open source computer vision module !( With source code )
link : utilize TRansformer End to end target detection and tracking ( With source code )
link :YOLOS: Rethink through target detection Transformer( With source code )
In the work shared today , The researchers designed a novel Transformer modular , Backbone network for dense prediction tasks , utilize Transformer The architecture design has made an innovative exploration , Combine the feature pyramid structure with Transformer A fusion , So that it can better output multi-scale features , Thus, it is more convenient to combine with other downstream tasks .
Preface
Although convolutional neural networks (CNN) Great success has been achieved in computer vision , But the work shared today explores a simpler 、 Convolution free backbone network , It can be used for many intensive prediction tasks .
object detection
Semantic segmentation
Instance segmentation
Compared with the recently proposed method designed for image classification Vision Transformer(ViT) Different , The researchers introduced Pyramid Vision Transformer(PVT), It overcomes the problem of Transformer The difficulty of porting to various intensive prediction tasks . Compared with the current state of Technology ,PVT There are several advantages :
It is different from that which usually produces low resolution output and leads to high computing and memory costs ViT Different ,PVT It can not only be trained on the dense partition of the image to obtain the information which is very important for dense prediction High output resolution , It also uses a progressive contraction pyramid to Reduce the calculation of large feature map ;
PVT Inherited CNN and Transformer The advantages of , Make it a unified backbone of various visual tasks , No convolution , It can directly replace CNN The trunk ;
Through a large number of experiments PVT, Show it Improved performance of many downstream tasks , Including object detection 、 Instance and semantic segmentation .
for example , When the number of parameters is equal ,PVT+RetinaNet stay COCO Data set on the implementation of 40.4 AP, exceed ResNet50+RetinNet(36.3 AP)4.1 Absolutely AP( See the picture below ). The researchers hope that PVT It can be used as an alternative and useful backbone for pixel level prediction , And promote future research .
Basic review
CNN Backbones
CNN It is the main force of deep neural network in visual recognition . standard CNN Originally in 【Gradient-based learning applied to document recognition】 Write numbers in the middle . The model includes convolution kernel with specific receptive field to capture favorable visual environment . In order to provide translation equivariance , The weight of convolution kernel is shared in the whole image space . lately , With the rapid development of computing resources ( for example ,GPU), Stacked convolution blocks are successfully trained on large-scale image classification data sets ( for example ,ImageNet) Has become possible . for example ,GoogLeNet It is proved that the convolution operator with multiple kernel paths can achieve very competitive performance .
multi-path convolutional block The effectiveness of Inception series 、ResNeXt、DPN、MixNet and SKNet It has been further verified . Besides ,ResNet The skip connection is introduced into the convolution block , So you can create / Train a very deep network and get impressive results in the field of computer vision .DenseNet A densely connected topology is introduced , It connects each convolution block to all previous blocks . More recent developments can be found in recent papers .
New framework
The framework aims to embed the pyramid structure into Transformer Structure is used to generate multi-scale features , And finally used in dense prediction tasks . The figure above shows the proposed PVT Schematic architecture , Similar to CNN Backbone structure ,PVT It also contains four stages for generating features with different scales , All stages have a similar structure :Patch Embedding+Transformer Encoder.
In the first stage , The given size is H*W*3 The input image of , Follow the following procedure :
First , Divide it into HW/4^2 The block , The size of each block is 4*4*3;
then , Send the expanded block to the linear projection , The obtained size is HW/4^2 * C1 Embedded block ;
secondly , The embedding block and position embedding information are sent to Transformer Of Encoder, Its output will be reshap by H/4 * W/4 * C1.
In a similar way , The output of the previous stage can be used as the input to obtain the feature F2,F3 and F4. Feature based pyramid F1、F2、F3、F4, The proposed scheme can be easily combined with most downstream tasks ( Such as image classification 、 object detection 、 Semantic segmentation ) To integrate .
Feature Pyramid for Transforme
differ CNN use stride Multiscale features obtained by convolution ,PVT By block embedding according to progressive shrinking The policy controls the scale of the feature .
Hypothesis number 1 i The block size of the stage is Pi, At the beginning of each phase , Evenly split the input features into Hi-1Wi-1/Pi Block , Then each block is expanded and projected onto Ci Embedded information of dimension , After linear projection , The size of the embedded block can be regarded as Hi-1/Pi * Wi-1/Pi * Ci. In this way, the feature size of each stage can be flexibly adjusted , Make it possible to target Transformer Build a feature pyramid .
Transformer Encoder
about Transformer encoder Of the i Stage , It has Li individual encoder layer , Every encoder Layer consists of attention layer and MLP constitute . Because the proposed method needs to deal with high-resolution features , Therefore, a SRA(spatial-reduction attention) Used to replace the traditional MHA(multi-head-attention).
Be similar to MHA,SRA Also received from the input Q、K、V As input , And output the refined characteristics .SRA And MHA The difference is that :SRA It will reduce K and V The spatial scale of , See the picture below .
Detailed settings of PVT series
experiment
ImageNet Performance comparison on datasets , The results are shown in the table above . You can see from it :
comparison CNN, Under the same parameter quantity and calculation constraints ,PVT-Small achieve 20.2% The error rate of , be better than ResNet50 Of 21.5%;
Compared to other Transformer( Such as ViT、DeiT), The proposed PVT Considerable performance is achieved with less computation .
Performance comparison in semantic segmentation , See the table above . You can see : Under different parameter configurations ,PVT Can achieve better than ResNet And ResNeXt Performance of . This side shows : comparison CNN, Benefit from the global attention mechanism ,PVT Better features can be extracted for semantic segmentation .
THE END
Please contact the official account for authorization.
The learning group of computer vision research institute is waiting for you to join !
Institute of computer vision Mainly involves Deep learning field , It's mainly about Face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation, etc Research direction . research institute Next, we will continue to share the latest papers, algorithms and new frameworks , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let's really understand Get rid of the theory The real scene of , Develop the habit of hands-on programming and brain thinking !
Sweep code Focus on
Institute of computer vision
official account ID|ComputerVisionGzq
Study Group | Scan the code to get the join mode on the homepage
Previous recommendation
YOLOS: Rethink through target detection Transformer( With source code )
I think it's an interesting target detection framework , Share with you ( Source papers have )
ICCV2021 object detection : Improve accuracy with graph feature pyramid ( Attached thesis download )
CVPR21 Best test : It is no longer a square target detection output ( Source code attached )
Sparse R-CNN: Sparse frame , End to end target detection ( Source code attached )
utilize TRansformer End to end target detection and tracking ( With source code )
边栏推荐
- 国信证券在网上开户安全吗?
- Parler de threadlocal insecurerandom
- USACO3.4 “破锣摇滚”乐队 Raucous Rockers - DP
- 线程池参数及合理设置
- 95后阿里P7晒出工资单:狠补了这个,真香...
- js方法传Long类型id值时会出现精确损失
- 【数字IC验证快速入门】9、Verilog RTL设计必会的有限状态机(FSM)
- Based on vs2017 and cmake GUI configuration, zxing and opencv are used in win10 x64 environment, and simple detection of data matrix code is realized
- nprogress插件 进度条
- 股票开户哪里好?网上客户经理开户安全吗
猜你喜欢
Let's talk about threadlocalinsecurerandom
Hong Kong stocks will welcome the "best ten yuan store". Can famous creative products break through through the IPO?
[quick start of Digital IC Verification] 6. Quick start of questasim (taking the design and verification of full adder as an example)
无卷积骨干网络:金字塔Transformer,提升目标检测/分割等任务精度(附源代码)...
解决php无法将string转换为json的办法
leetcode刷题:二叉树14(左叶子之和)
js实现禁止网页缩放(Ctrl+鼠标、+、-缩放有效亲测)
Database logic processing function
Fundamentals of deep learning convolutional neural network (CNN)
Zero cloud new UI design
随机推荐
CADD课程学习(7)-- 模拟靶点和小分子相互作用 (半柔性对接 AutoDock)
解决Thinkphp框架应用目录下数据库配置信息修改后依然按默认方式连接
Using repositoryprovider to simplify the value passing of parent-child components
国信证券在网上开户安全吗?
Summer Challenge harmonyos - realize message notification function
How to retrieve the root password of MySQL if you forget it
【数字IC验证快速入门】7、验证岗位中必备的数字电路基础知识(含常见面试题)
图嵌入Graph embedding学习笔记
2020 CCPC 威海 - A. Golden Spirit(思维),D. ABC Conjecture(大数分解 / 思维)
【数字IC验证快速入门】6、Questasim 快速上手使用(以全加器设计与验证为例)
leetcode刷题:二叉树18(最大二叉树)
计算lnx的一种方式
E. Singhal and Numbers(质因数分解)
银河证券在网上开户安全吗?
基金网上开户安全吗?去哪里开,可以拿到低佣金?
Is it safe for Galaxy Securities to open an account online?
深度学习 卷积神经网络(CNN)基础
Debezium series: idea integrates lexical and grammatical analysis ANTLR, and check the DDL, DML and other statements supported by debezium
ICTCLAS word Lucene 4.9 binding
深度學習 卷積神經網絡(CNN)基礎