当前位置:网站首页>Interpretation of mask RCNN paper
Interpretation of mask RCNN paper
2022-07-05 01:36:00 【Xiaobai learns vision】
Mask R-CNN Introduce
Mask R-CNN Is based on Faster R-CNN Based on staged improvements ,FasterR-CNN Not designed for pixel alignment between input and output , In order to make up for this deficiency , We propose a concise non quantized layer , named RoIAlign,RoIAlign You can keep an approximate spatial location , In addition to this improvement ,RoIAlign There is also a major impact : That is, it can be relatively improved 10% To 50% Mask accuracy (Mask Accuracy), This improvement can get better measurement results under more strict positioning measurement indicators . second , We find that segmentation mask and category prediction are very important : So , We predict a binary mask for each category . Based on the above improvements , Our final model Mask R-CNN I've outperformed all the previous COCO Single model of instance segmentation task , This model can be used in GPU On the frame of 200ms Speed of operation , stay COCO Of 8-GPU Training on the machine requires 1 To 2 Time of day .
MaskR-CNN Have simple and clear ideas : about FasterR-CNN Come on , For each target object , It has two outputs , One is the class tag (classlabel), One is the offset value of the bounding box (bounding-box offset), On this basis ,Mask R-CNN Method adds the output of the third branch : Destination mask . The destination mask is different from the existing one class and box The difference in output is that it requires a more refined extraction of the spatial layout of the target . Next , Let's introduce in detail Mask R-CNN The main elements of , Include Fast/Faster R-CNN Missing pixel alignment (pixel-topixel alignment).
Mask R-CNN How it works
Mask R-CNN Used with Faster R-CNN An interlinked two-stage process , The first stage is called RPN(Region Proposal Network), This step proposes the candidate object bounding box . The second stage is essentially FastR-CNN, It uses... From candidate frameworks RoIPool To extract features and carry out classification and bounding box regression , but Mask R-CNN Further, for each RoI Generated a binary mask , We recommend readers to read further Huang(2016) And so on “Speed/accuracy trade-offs for modern convolutional object detectors” Detailed comparison of papers Faster R-CNN Different from other frameworks .
The mask encodes the spatial layout of an object , Unlike class tags or frameworks ,Mast R-CNN The spatial structure can be extracted using a mask by convoluted pixel alignment .
ROIAlign:ROIPool From every ROI Extract feature map from ( for example 7*7) Standard operation of .
Network architecture (Network Architecture): In order to prove Mast R-CNN Universality of , We will Mask R-CNN Multiple architectural instantiations of , To distinguish between different architectures , The main architecture of convolution is shown in this paper (backbone architecture), The architecture is used to extract the features of the whole picture ; Header architecture (headarchitecture), For border recognition ( Classification and regression ) And each RoI Mask prediction .
stay Faster R-CNN Modifications on the network , Specific include :
(1) take ROI Pooling Layer replaced with ROIAlign;
(2) Added juxtaposed FCN layer (Mask layer ).
Technical points
One 、 Enhanced infrastructure
take ResNeXt-101+FPN Used as a feature extraction network , achieve State-of-the-art The effect of .
Two 、 Joined the ROIAlign layer
ROIPool It's for every ROI Extract a small-scale feature map (E.g. 7x7) Standard operation of , It is used to solve problems of different scales ROI The problem of extracting the feature size into the same scale .ROIPool First of all, the floating-point numerical value ROI Quantized into a characteristic diagram of discrete particles , Then quantify ROI A small piece divided into several spaces (Spatial Bins), Finally, each small piece is Max Pooling The operation produces the final result .
By calculation [x/16] In continuous coordinates x Quantify on , among 16 Is the step size of the characteristic graph ,[ . ] It means round off . These quantifications introduce ROI Misalignment with the extracted features . Because the classification problem is robust to the translation problem , So the impact is relatively small . However, this will have a very large negative impact when predicting the mask with pixel level accuracy .
thus , The author puts forward ROIAlign Layer to solve this problem , And align the extracted features with the input . It's easy , Avoid being right ROI A boundary or block of (Bins) Do any quantification , For example, direct use x/16 Instead of [x/16]. The author uses bilinear interpolation (Bilinear Interpolation) At every ROI In block 4 Calculate the exact value of the input feature at a sampling location , And aggregate the results ( Use Max perhaps Average).
Use an example to analyze the mismatch of the above regions . As shown in the figure , This is a Faster-RCNN Detection framework . Enter a 800*800 Pictures of the , There is a... In the picture 665*665 The bounding box ( Framed by a dog ). After extracting the features of the image through the backbone network , Characteristic graph scaling step size (stride) by 32. therefore , The edge length of the image and bounding box is the same as that of the input 1/32.800 It just happens to be 32 Divide into 25. but 665 Divide 32 Get it later 20.78, With decimal , therefore ROI Pooling Directly quantify it into 20. Next, you need to pool the features in the box 7*7 Size , Therefore, the bounding box is evenly divided into 7*7 A rectangular area . obviously , The side length of each rectangular area is 2.86, It also contains decimals . therefore ROI Pooling Quantify it again to 2. After these two quantifications , Obvious deviation has occurred in the candidate region ( As shown in the green part of the figure ). what's more , On the characteristic map of this layer 0.1 One pixel deviation , Zoom to the original image is 3.2 Pixel . that 0.8 The deviation of , It's close on the original picture 30 The difference between pixels , The impact is still great .
Specific methods and key points :
- Traverse every candidate area , Keep floating-point boundaries and not quantify .
- Divide the candidate region into k x k A unit , The boundary of each element is not quantified .
- Four fixed coordinate positions are calculated in each cell , The values of these four positions are calculated by bilinear interpolation , And then maximize pooling .
3、 ... and 、 Improved segmentation Loss
From the original single pixel based Softmax The polynomial cross entropy becomes based on single pixel Sigmod Binary cross entropy . The framework predicts a binary mask for each category independently , No inter class competition is introduced , The category of each binary mask depends on the network ROI The classification prediction results given by the classification Branch . This is related to FCNs Different ,FCNs It is a multi category classification of each pixel , It classifies and segments at the same time , The experimental results show that this method can get a poor performance for object instance segmentation .
Here are more details , In the training phase , The author for each sample ROI Define a multitasking loss function L=L_{cls}+L_{box}+L_{mask}, The first two items don't introduce much . Mask branches for each ROI There will be one. Km^2 The output of dimensions , It encodes K The resolution is m\times m Binary mask for , Corresponding to K Categories . Therefore, the author makes use of A Per-pixelSigmoid, And defined as the average binary cross entropy loss (The Average Binary Cross-entropy Loss). For one that belongs to the K Category ROI, Consider only the second K individual Mask( Other mask inputs do not contribute to the loss function ). Such a definition will allow masks to be generated for each category , And there will be no competition between classes .
Four 、 The mask represents
A mask encodes the spatial layout of an input object . The author uses a FCN Come to each ROI Predict a mask , This preserves the spatial structure information .
边栏推荐
- MySQL regexp: Regular Expression Query
- 线上故障突突突?如何紧急诊断、排查与恢复
- Es uses collapsebuilder to de duplicate and return only a certain field
- Kibana installation and configuration
- 增量备份 ?db full
- 【LeetCode】88. Merge two ordered arrays
- Call Huawei order service to verify the purchase token interface and return connection reset
- How to use words to describe breaking change in Spartacus UI of SAP e-commerce cloud
- Blue Bridge Cup Square filling (DFS backtracking)
- Change the background color of a pop-up dialog
猜你喜欢

Take you ten days to easily complete the go micro service series (IX. link tracking)

Wechat applet: exclusive applet version of the whole network, independent wechat community contacts

MATLB|多微电网及分布式能源交易

实战模拟│JWT 登录认证

增量备份 ?db full
![[OpenGL learning notes 8] texture](/img/77/a4a784a535ea6f4c2382857b266cec.jpg)
[OpenGL learning notes 8] texture

A simple SSO unified login design

Main window in QT application

【LeetCode】88. Merge two ordered arrays
![[CTF] AWDP summary (WEB)](/img/4c/574742666bd8461c6f9263fd6c5dbb.png)
[CTF] AWDP summary (WEB)
随机推荐
Take you ten days to easily complete the go micro service series (IX. link tracking)
Es uses collapsebuilder to de duplicate and return only a certain field
Remote control service
What is the length of SHA512 hash string- What is the length of a hashed string with SHA512?
Basic operation of database and table ----- the concept of index
C basic knowledge review (Part 3 of 4)
pytorch fine-tuning (funtune) : 镂空设计or 偷梁换柱
Main window in QT application
JS implementation determines whether the point is within the polygon range
Can financial products be redeemed in advance?
Classification of performance tests (learning summary)
Call Huawei order service to verify the purchase token interface and return connection reset
Introduction to the gtid mode of MySQL master-slave replication
Yyds dry goods inventory kubernetes management business configuration methods? (08)
PHP 约瑟夫环问题
What sparks can applet container technology collide with IOT
Interesting practice of robot programming 14 robot 3D simulation (gazebo+turtlebot3)
Intel sapphire rapids SP Zhiqiang es processor cache memory split exposure
Application and Optimization Practice of redis in vivo push platform
Global and Chinese market of veterinary thermometers 2022-2028: Research Report on technology, participants, trends, market size and share