当前位置:网站首页>The performance and efficiency of the model that can do three segmentation tasks at the same time is better than maskformer! Meta & UIUC proposes a general segmentation model with better performance t
The performance and efficiency of the model that can do three segmentation tasks at the same time is better than maskformer! Meta & UIUC proposes a general segmentation model with better performance t
2022-07-07 18:34:00 【I love computer vision】
Official account , Find out CV The beauty of Technology
Share this article CVPR2022 The paper 『Masked-attention Mask Transformer for Universal Image Segmentation』 A model that can do three segmentation tasks at the same time , Better performance and efficiency than MaskFormer!Meta&UIUC Propose a general segmentation model , Performance is better than task specific models ! The code is open source !
The details are as follows :
Address of thesis :https://arxiv.org/abs/2112.01527[1]
Code address :https://bowenc0221.github.io/mask2former/[2]
01
Abstract
Image segmentation is about grouping pixels using different semantics , Such as category or instance , Each of these semantic choices defines a task . Although only the semantics of each task are different , But the current research focus is to design a special structure for each task .
The author proposes a method that can handle any image segmentation task ( panorama 、 Instance or semantics ) The new structure of ——Masked-attention Mask Transformer(Mask2Former). Its key components include masking attention (masked attention), It passes through the predicted mask Cross attention is constrained in the region to extract local features . In addition to reducing the research workload by at least three times , Its performance on four popular data sets is also much better than the best task specific structure .
The most remarkable thing is ,Mask2Former Divide for panorama (COCO Up for 57.8 PQ)、 Instance segmentation (COCO Up for 50.1 AP) And semantic segmentation (ADE20K Up for 57.7 mIoU) Reach a new level in the task SOTA level .
02
Motivation
Image segmentation studies pixel grouping . Different semantics of pixel grouping ( Such as category or instance ) It leads to different types of segmentation tasks , For example, panorama 、 Instance or semantic segmentation . Although these tasks are only semantically different , But current methods develop specific structures for each task . Based on full convolution network (FCN) Pixel by pixel classification architecture for semantic segmentation , The mask classification structure that predicts a set of binary masks dominates the instance segmentation . Although this specialized structure improves each individual task , But they lack the flexibility to extend to other tasks . for example , be based on FCN The structure of has difficulties in instance segmentation . therefore , Repeated research and hardware optimization efforts are spent on each task specific structure .
In order to solve this segmentation problem , Recent work has attempted to design a generic architecture , Be able to handle all split tasks with the same architecture ( General image segmentation ). These structures are usually based on end-to-end set prediction targets ( for example ,DETR), And without modifying the structure 、 Successfully handle multiple tasks in case of loss or training process . Be careful , Although it has the same architecture , However, the general structure is still trained separately for different tasks and data sets . In addition to flexibility , The general structure has recently shown the most advanced results in terms of semantics and panoramic segmentation . However , Recent work is still focused on promoting dedicated structures , This raises a question : Why does the general structure not replace the special structure ?
Although the existing general structure is flexible enough , It can handle any subdivision task , As shown in the figure above , But in practice , Their performance lags behind the best professional structures . In addition to poor performance , Universal structures are also more difficult to train . They usually need more advanced hardware and longer training time . for example , Training MaskFormer need 300 individual epoch To achieve 40.1 AP, And it can only be in one 32G In memory GPU Contains an image . by comparison , Special Swin-HTC++ Only in 72 individual epoch Get better performance in . Both performance and training efficiency problems hinder the deployment of general architecture .
In this work , The author proposes a general architecture of image segmentation , be known as Masked attention Mask Transformer(Mask2Former), It is superior to the special structure in different segmentation tasks , At the same time, it is easy to train on every task . The model is based on a simple meta architecture , The architecture consists of a backbone feature extractor 、 Pixel decoder and Transformer Decoder composition . The author proposes key improvements , To achieve better results and effective training .
First , The author in Transformer Masking attention is used in the decoder (masked attention), Limit attention to local features centered on predicted segments , These prediction segments can be objects , It can also be a region , It depends on the specific semantics of the grouping . With the standard Transformer The cross attention used in the decoder is compared to , The masking attention in this paper can converge faster , And improve performance .
secondly , The author uses multi-scale high-resolution features to help the model segment small objects / Area .
Third , The author puts forward some optimization improvements , Such as switching the order of self attention and cross attention , Make query features learnable , And eliminate omissions ; All of these can improve performance without additional calculation . Last , Without affecting performance , By calculating the mask loss at random sampling points , Saved 3× Training video memory . These improvements not only improve the performance of the model , And it makes training very easy , Make the general structure easier for users with limited computing power .
The author uses four popular data sets (COCO、Cityscapes、ADE20K and Mapillary Vistas) Assessed Mask2Former In three image segmentation tasks ( panorama 、 Instance and semantic segmentation ) Performance on . On all these benchmarks , The performance of the single structure in this paper is equal to or better than that of the special structure for the first time .Mask2Former Adopt the same structure , stay COCO Panoramic segmentation has reached 57.8 PQ Of SOTA level , stay COCO The instance segmentation has reached 50.1 AP, stay ADE20K Semantic segmentation has reached 57.7 mIoU.
03
Method
3.1. Mask classification preliminaries
The mask classification structure is predicted N Binary masks and N Corresponding category labels , Group pixels into N A fragment . Mask classification has enough versatility , You can do this by creating different segments (segment) Assign different semantics ( Such as category or instance ) To solve any split tasks . However , The challenge is to find good performance for each segmentation task . for example ,Mask RCNN Use a bounding box as a representation , This limits its application to semantic segmentation .
suffer DETR Inspired by the , Each segment in the image can be represented as C Dimension eigenvector (“ Object query ”), And can be Transformer The decoder processes , Use the set prediction goal for training . A simple meta structure will consist of three components : The method of extracting low resolution features from images Backbone network ; Gradually sample low resolution features upward from the trunk output , To generate high resolution per pixel embedded Pixel decoder ; Operating on image features to process object queries Transformer decoder . The final binary mask prediction is embedded and decoded from each pixel through object query . A successful example of this meta structure is MaskFormer.
3.2. Transformer decoder with masked attention
Mask2Former Adopt the above meta structure , What this article puts forward Transformer decoder ( Top right ) Replaces the standard decoder .Transformer The key components of the decoder include a masked attention operator , The operator limits the cross attention to the prediction of each query mask To extract local features in the foreground area , instead of attend Complete feature mapping . To handle small objects , The author proposes an effective multiscale strategy to utilize high-resolution features . It feeds the continuous feature mapping in the feature pyramid of the pixel decoder to the continuous Transformer In the decoder layer . Last , The author added optimization improvements , The performance of the model is improved without introducing additional calculations .
3.2.1 Masked attention
Context feature is very important for image segmentation . However , Recent research shows that , be based on Transformer The slow convergence of the model is due to the global environment in the cross attention layer , Because cross attention requires a lot of training epoch Can learn to pay attention to local object areas . The author believes that local features are sufficient to update query features , And context information can be collected through self attention . So , The author proposed masking attention (masked attention), This is a variant of cross attention , Only appear in the prediction of each query mask In the foreground area .
Standard cross attention ( Path with residuals ) The calculation for the :
here ,l Is the layer subscript , It means the first one l Layer of N individual C Dimension query feature ,. Express Transformer Input query characteristics of decoder . They are the image features under and transformation , And are the spatial resolution of image features ., And are linear transformations .
In this paper, the mask Attention adjusts the attention matrix in the following ways :
Besides ,attention mask At the feature location (x,y) The value at is :
among , yes l-1 layer Transformer Binary output of decoder layer ( The threshold for 0.5). It is resized to the same resolution as .
3.2.2 High-resolution features
High resolution features improve model performance , Especially for small objects . However , This is computationally demanding . therefore , The author proposes an effective multiscale strategy to introduce high-resolution features , At the same time, control the increase of calculation . Authors do not always use high-resolution feature mapping , Instead, a feature pyramid consisting of low resolution and high resolution features is used , And send one resolution of multi-scale features to one at a time Transformer Decoder layer .
say concretely , The author uses the feature pyramid generated by the pixel decoder , The resolution is... Of the original image 1/32、1/16 and 1/8. For each resolution , The authors have added sine position embedding , And learnable scale-level The embedded . From the lowest resolution to the highest resolution , For the corresponding Transformer Decoder layer , As shown on the left above . Repeat this three layers Transformer decoder L Time . therefore , The final Transformer The decoder has 3L layer .
More specifically , The receiving resolution of the first three layers is
402 Payment Required
and Characteristic graph , among H and W Is the original image resolution . For all layers below , This pattern repeats in a circular fashion .3.2.3 Optimization improvements
standard Transformer The decoder layer consists of three modules , Process query features in the following order : Self attention module 、 Cross attention and feedforward networks (FFN). Besides , Query features () Being sent Transformer The decoder was zero initialized , And it is associated with learnable location embedding . Besides ,dropout Apply to residual connections and attention map.
In order to optimize the Transformer Decoder design , The author has made the following three improvements . First , Switch the order of self attention and cross attention ( namely masked attention) To improve the efficiency of calculation : The query feature of the first self attention layer does not depend on image features , Therefore, the application of self attention will not generate any meaningful features .
secondly , Make the query feature () You can also learn ( Still retain the learnable query location ), And learnable query features are used in Transformer The decoder predicts mask() Was directly supervised before . The author finds that the functions of these learnable query features are similar to region proposal The Internet , And can generate mask proposal. Last , The author found dropout It's not necessary , It usually reduces performance . therefore , The author completely eliminated the dropout.
3.3. Improving training efficiency
One limitation of the training common architecture is , Due to high resolution mask forecast , Video memory consumption is large , This makes them more difficult to use than video memory friendly dedicated architectures . for example ,MaskFormer Only in the presence of 32G The memory GPU Put a single image in the middle . suffer PointRend and Implicit PointRend Inspired by the , The author uses sampling point calculation in both matching and final loss calculation mask Loss .
To be specific , Building for bipartite matching cost In the matching loss of the matrix , The author is for all predictions and ground truth mask Uniformly sample the same K Point set . In predicting what it matches ground truth In the final loss between , The author uses importance sampling to evaluate different predictions and ground truth Different K Point sets are sampled . among , Set up K=12544, namely 112×112 spot . This new training strategy effectively reduces the training memory 3 times , From each image 18GB Reduced to 6GB.
04
experiment
In the above table , The author will Mask2Former And COCO The most advanced panoramic segmentation models on the panoramic segmentation dataset are compared .Mask2Former It is always better than MaskFormer Higher than 5 individual PQ above , At the same time, the convergence speed is fast 6 times .
In the above table , The author will Mask2Former And COCO The latest models on the instance segmentation dataset are compared . With the help of ResNet Backbone network ,Mask2Former In the use of large-scale jittering(LSJ) In the case of enhancement , Better performance than the Mask R-CNN, Simultaneous need 8× Less training iterations . rely on Swin-L Backbone network ,Mask2Former Its performance is superior to the most advanced HTC++.
The above table , The author will Mask2Former And ADE20K The most advanced semantic segmentation models on the dataset are compared .Mask2Former The performance on different trunks is better than MaskFormer, This shows that the proposed improvement even improves the result of semantic segmentation .
The above table shows the ablation experimental results of the module proposed in this paper , It can be seen that , The module proposed in this paper can promote the accuracy and efficiency of segmentation .
In the above table, the author studied the impact of computing losses based on masks or sampling points on performance and training video memory . Use the sampling points to calculate the final training loss , You can reduce the training memory without affecting the performance 3 times . Besides , Using sample points to calculate the matching loss can improve the performance of all three tasks .
In the diagram above , The author visualizes the mask prediction of selected learnable queries , Then enter it Transformer decoder . By calculation COCO val2017 On 100 The predicted class unknowable average recall, Further to these proposal The quality of . It can be seen that , These learnable queries have achieved very good results .
In order to prove the Mask2Former It can be extended to COCO Outside the data set , The author further carried out experiments on other popular image segmentation data sets . In the above table , The author shows Cityscapes Result .
The ultimate goal of this paper is to train a single model for all image segmentation tasks . In the above table , The author found that training on panoramic segmentation Mask2Former It is only slightly worse than the identical model trained with corresponding annotations on the three data sets , For example, semantic segmentation task . This shows that , Even though Mask2Former It can be extended to different tasks , But you still need to train for these specific tasks .
05
summary
The author proposes a method for general image segmentation Mask2Former.Mask2Former Based on a simple meta framework , A new Transformer decoder , The decoder uses the proposed masked attention, On four popular datasets , In all three main image segmentation tasks ( panorama 、 Instances and semantics ) Both achieved the best results , Even better than the best specialized models designed for each benchmark , At the same time, it maintains the performance that is easy to train . Compared with designing a dedicated model for each task ,Mask2Former Saved 3 Times the research workload , And users can use it with limited computing resources .
Reference material
[1]https://arxiv.org/abs/2112.01527
[2]https://bowenc0221.github.io/mask2former/
END
Welcome to join 「 Image segmentation 」 Exchange group notes : Division
边栏推荐
- 能同时做三个分割任务的模型,性能和效率优于MaskFormer!Meta&UIUC提出通用分割模型,性能优于任务特定模型!开源!...
- Introduction de l'API commune de programmation de socket et mise en œuvre de socket, select, Poll et epoll
- 通过 Play Integrity API 的 nonce 字段提高应用安全性
- Test for 3 months, successful entry "byte", my interview experience summary
- 直播软件搭建,canvas文字加粗
- In depth understanding of USB communication protocol
- Tips for this week 131: special member functions and ` = Default`
- Chapter 3 business function development (to remember account and password)
- What are the financial products in 2022? What are suitable for beginners?
- [paper sharing] where's crypto?
猜你喜欢
Discuss | frankly, why is it difficult to implement industrial AR applications?
Taffydb open source JS database
Summary of evaluation indicators and important knowledge points of regression problems
清华、剑桥、UIC联合推出首个中文事实核查数据集:基于证据、涵盖医疗社会等多个领域
Click on the top of today's headline app to navigate in the middle
Interviewer: why is the page too laggy and how to solve it? [test interview question sharing]
Automated testing: a practical skill that everyone wants to know about robot framework
In depth understanding of USB communication protocol
Tear the Nacos source code by hand (tear the client source code first)
3分钟学会制作动态折线图!
随机推荐
DataSimba推出微信小程序,DataNuza接受全场景考验? | StartDT Hackathon
The report of the state of world food security and nutrition was released: the number of hungry people in the world increased to 828million in 2021
2022年理财产品的一般收益率是多少?
List selection JS effect with animation
开发一个小程序商城需要多少钱?
Introduction of common API for socket programming and code implementation of socket, select, poll, epoll high concurrency server model
SD_DATA_SEND_SHIFT_REGISTER
[论文分享] Where’s Crypto?
嵌入式C语言程序调试和宏使用的技巧
小程序中实现付款功能
2022年理财有哪些产品?哪些适合新手?
debian10系统问题总结
Mui side navigation anchor positioning JS special effect
zdog. JS rocket turn animation JS special effects
性能测试过程和计划
Datasimba launched wechat applet, and datanuza accepted the test of the whole scene| StartDT Hackathon
< code random recording two brushes> linked list
AI 击败了人类,设计了更好的经济机制
线程池和单例模式以及文件操作
Tips for this week 140: constants: safety idioms