当前位置:网站首页>Transformer landing | next vit realizes the real-time landing of industrial tensorrt, surpassing RESNET and cswin
Transformer landing | next vit realizes the real-time landing of industrial tensorrt, surpassing RESNET and cswin
2022-07-28 04:27:00 【Zhiyuan community】
Due to the complex attention mechanism and model design , Most of the existing ViTs In real industrial deployment scenarios, it cannot be like CNNs So efficient , for example .TensorRT and CoreML.
This poses an obvious challenge : Whether the visual neural network can be designed with CNN Reasoning as fast as and ViT Same powerful performance ?
Recently, a lot of work has tried to design CNN-Transformer Hybrid architecture to solve this problem , But the overall performance of these works is far from satisfactory . To end this , The author of this paper puts forward the effective deployment in the real industrial scene next generation vision Transformer, namely Next-ViT, From delay / From the perspective of accuracy tradeoff , It's in CNN and ViT Both are dominant .
In this work , Developed separately Next Convolution Block(NCB) and Next Transformer Block(NTB), To capture local and global information through deployment friendly mechanisms . then ,Next Hybrid Strategy (NHS) Designed to stack in an efficient hybrid paradigm NCB and NTB, So as to improve the performance of various downstream tasks .
A lot of experiments show that , Delays across various visual tasks / Accuracy tradeoffs ,Next-ViT Significantly better than existing CNN、ViT and CNN-Transformer Hybrid architecture . stay TensorRT On ,Next-ViT stay COCO The detection exceeds ResNet 5.4 mAP( from 40.4 To 45.8), stay ADE20K More than 8.2% mIoU( from 38.8% To 47.0%), The reasoning delay is almost the same . meanwhile , With the CSWin Quite good performance , At the same time, the reasoning speed is improved 3.6 times . stay CoreML On ,Next-ViT stay COCO The detection exceeds EfficientFormer 4.6 mAP( from 42.6 To 47.2), stay ADE20K More than 3.5% mIoU( from 45.2% To 48.7%).
1 brief introduction
In recent years ViTs It has received more and more attention in the industry and academia , And in image classification 、 object detection 、 Semantic segmentation and other computer vision tasks have achieved great success . However , From the perspective of real-world deployment ,cnn Still dominating visual tasks , because vit Usually better than classic cnn It's much slower , for example ResNets. Including multi head self attention (MHSA) The complexity of the mechanism is similar to Token The length is quadratic 、 Incorporeal LayerNorm and GELU layer 、 Complex model design leads to frequent memory access and replication, which limits ViTs The reasoning speed of the model .
A lot of work is being done to make vit Free from the dilemma of high delay . for example ,Swin and PVT Try to design more effective spatial attention mechanism , To relieve MHSA The second increase in computational complexity . Other work is also considering the combination of efficient convolution blocks and powerful Transformer Block To design CNN-Transformer Hybrid architecture , To get a better trade-off between accuracy and delay . Coincidentally, , Almost all existing hybrid architectures use convolution blocks in the shallow stage , Only stack is used in the last few stages Transformer Block. However , The author observed , This hybrid strategy may lead to downstream tasks ( For example, segmentation and detection ) Performance saturation . Besides , The author also found that , In the existing work , Convolution block and Transformer Block Can not have the characteristics of efficiency and performance at the same time . Although with vit comparison , precision - The tradeoff of delay has been improved , However, the overall performance of the existing hybrid architecture is still far from satisfactory .
In order to solve the above problems , This work developed 3 Three important components to design efficient vision Transformer The Internet .
First , It introduces Next Convolution Block(NCB),NCB Good at using novel deployment friendly multi head convolution attention (MHCA) To capture short-term dependent information in visual data .
secondly , To build the Next Transformer Block(NTB),NTB Not just experts who capture long-term dependent information , And it can also be used as a lightweight high and low frequency signal mixer to enhance the modeling ability .
Last , Designed Next Hybrid Strategy (NHS), Stack with a novel hybrid paradigm at each stage NCB and NTB, Greatly reduced Transformer The proportion of blocks , And in a variety of downstream tasks to the greatest extent Vision Transformer High accuracy of network .
Based on the above method, we propose a next generation vision Transformer( Referred to as Next-ViT). In this paper , To provide a fair comparison , The author offers a point of view , Consider delays on specific hardware as direct efficiency feedback .TensorRT and CoreML They represent common and easy to deploy solutions for server-side and mobile devices respectively , It helps to provide convincing performance guidance for hardware . Through this direct and accurate guidance , Redrawn the diagram 1 Accuracy and delay tradeoffs of several existing competitive models in . Pictured 1(a)(d) Shown ,Next-ViT stay ImageNet-1K The best delay is achieved on classification tasks / Accuracy tradeoffs . what's more ,Next-ViT It shows more significant delay on downstream tasks / Accuracy weighs advantages .

Pictured 1(b)(c) Shown , stay TensorRT Upper and ResNet comparison ,Next-ViT stay COCO Better than 5.4mAP( from 40.4 To 45.8), stay ADE20K Better than... In segmentation 8.2%mIoU( from 38.8% To 47.0%).Next-ViT Realized and CSWin Quite good performance , And the reasoning speed is improved 3.6×.
Pictured 1(e)(f) Shown , stay CoreML Upper and EfficientFormer comparison ,Next-ViT More than the 4.6mAP( from 42.6 To 47.2), stay ADE20K More than 3.6%mIoU( from 45.2% To 48.7%).
Our main contributions are summarized as follows :
- Developed a powerful convolution block and Transformer block , namely NCB and NTB, It has a deployment friendly mechanism .Next-ViT The stack NCB and NTB To build advanced CNN-Transformer Hybrid architecture .
- Designed an innovative CNN-Transformer Hybrid strategy , To improve efficiency and performance .
- It shows Next-ViT, A powerful vision Transformer Architecture family . A large number of experiments have proved that Next-ViT The advantages of . It's in TensorRT and CoreML Image classification is realized on 、 Object detection and semantic segmentation SOTA Delay / Accuracy tradeoffs .
边栏推荐
- "Three no's and five requirements" principle of enterprise Digitalization Construction
- .net upload files through boundary
- 【二、移动web网页开发】2D&3D转换与动画、移动端布局、响应式布局
- Idea start project MVN command terminal cannot recognize "MVN" item as cmdlet
- Password key hard coding check
- RT-Thread改变打印串口(在BSP的基础上添加其他功能)
- How to select reliable securities analysts?
- After login, the upper right corner changes to enter the login status
- The unsatisfied analysis of setup and hold timing is the solution
- Information system project manager (2022) - key content: information system integrated testing and management, project management maturity model, quantitative project management (21)
猜你喜欢

Seamless support for hugging face community, colossal AI low-cost and easy acceleration of large model
![[day03] process control statement](/img/4d/d66140962b7e121a2fea2c366a972a.png)
[day03] process control statement

Learn regular expressions (regexp)

【实战】使用 Web Animations API 实现一个精确计时的时钟

Idea start project MVN command terminal cannot recognize "MVN" item as cmdlet

Kotlin -- function

Full resolution of the use of go native plug-ins

Bio annotation of emotion analysis aste triples extraction

功耗:Leakage Power

【YOLOv5实战5】基于YOLOv5的交通标志识别系统-YOLOv5整合PyQt5
随机推荐
上班摸鱼打卡模拟器微信小程序源码
Password key hard coding check
RuntimeError: stack expects each tensor to be equal size, but got [8] at entry 0 and [2] at entry 2
High number_ Chapter 4__ Curvilinear integral_ Exercise solution
After login, the upper right corner changes to enter the login status
Information system project manager (2022) - key content: Strategic Management (17)
【伸手党福利】微信中h5网页调起扫一扫最简单的方法
[coding and decoding] Huffman coding and decoding based on Matlab GUI [including Matlab source code 1976]
Reading the paper "learning span level interactions for aspect sentimental triple extraction"
Shanghai Telecom released public computing services and signed the action plan of "Joint Innovation Center for intelligent computing applications" with Huawei and other partners
"Three no's and five requirements" principle of enterprise Digitalization Construction
[untitled]
重要的 SQL Server 函数 - 其他函数
RT thread changes the print serial port (add other functions on the basis of BSP)
Some personal understandings of openpose
How to select reliable securities analysts?
setup和hold timing分析不满足是解决方法
[day03] process control statement
金仓数据库KingbaseES安全指南--5.2. 数据完整性保护
校园流浪猫信息记录和分享的小程序源码
