当前位置:网站首页>Interpretation of mask RCNN paper

Interpretation of mask RCNN paper

2022-07-05 01:36:00 Xiaobai learns vision

Mask R-CNN Introduce

Mask R-CNN Is based on Faster R-CNN Based on staged improvements ,FasterR-CNN Not designed for pixel alignment between input and output , In order to make up for this deficiency , We propose a concise non quantized layer , named RoIAlign,RoIAlign You can keep an approximate spatial location , In addition to this improvement ,RoIAlign There is also a major impact : That is, it can be relatively improved 10% To 50% Mask accuracy (Mask Accuracy), This improvement can get better measurement results under more strict positioning measurement indicators . second , We find that segmentation mask and category prediction are very important : So , We predict a binary mask for each category . Based on the above improvements , Our final model Mask R-CNN I've outperformed all the previous COCO Single model of instance segmentation task , This model can be used in GPU On the frame of 200ms Speed of operation , stay COCO Of 8-GPU Training on the machine requires 1 To 2 Time of day .

MaskR-CNN Have simple and clear ideas : about FasterR-CNN Come on , For each target object , It has two outputs , One is the class tag (classlabel), One is the offset value of the bounding box (bounding-box offset), On this basis ,Mask R-CNN Method adds the output of the third branch : Destination mask . The destination mask is different from the existing one class and box The difference in output is that it requires a more refined extraction of the spatial layout of the target . Next , Let's introduce in detail Mask R-CNN The main elements of , Include Fast/Faster R-CNN Missing pixel alignment (pixel-topixel alignment).

Mask R-CNN How it works

Mask R-CNN Used with Faster R-CNN An interlinked two-stage process , The first stage is called RPN(Region Proposal Network), This step proposes the candidate object bounding box . The second stage is essentially FastR-CNN, It uses... From candidate frameworks RoIPool To extract features and carry out classification and bounding box regression , but Mask R-CNN Further, for each RoI Generated a binary mask , We recommend readers to read further Huang(2016) And so on “Speed/accuracy trade-offs for modern convolutional object detectors” Detailed comparison of papers Faster R-CNN Different from other frameworks .

The mask encodes the spatial layout of an object , Unlike class tags or frameworks ,Mast R-CNN The spatial structure can be extracted using a mask by convoluted pixel alignment .

ROIAlign:ROIPool From every ROI Extract feature map from ( for example 7*7) Standard operation of .

Network architecture (Network Architecture): In order to prove Mast R-CNN Universality of , We will Mask R-CNN Multiple architectural instantiations of , To distinguish between different architectures , The main architecture of convolution is shown in this paper (backbone architecture), The architecture is used to extract the features of the whole picture ; Header architecture (headarchitecture), For border recognition ( Classification and regression ) And each RoI Mask prediction .

stay Faster R-CNN Modifications on the network , Specific include :

(1) take ROI Pooling Layer replaced with ROIAlign;

(2) Added juxtaposed FCN layer (Mask layer ).

Technical points

One 、 Enhanced infrastructure

take ResNeXt-101+FPN Used as a feature extraction network , achieve State-of-the-art The effect of .

Two 、 Joined the ROIAlign layer

ROIPool It's for every ROI Extract a small-scale feature map (E.g. 7x7) Standard operation of , It is used to solve problems of different scales ROI The problem of extracting the feature size into the same scale .ROIPool First of all, the floating-point numerical value ROI Quantized into a characteristic diagram of discrete particles , Then quantify ROI A small piece divided into several spaces (Spatial Bins), Finally, each small piece is Max Pooling The operation produces the final result .

By calculation [x/16] In continuous coordinates x Quantify on , among 16 Is the step size of the characteristic graph ,[ . ] It means round off . These quantifications introduce ROI Misalignment with the extracted features . Because the classification problem is robust to the translation problem , So the impact is relatively small . However, this will have a very large negative impact when predicting the mask with pixel level accuracy .

thus , The author puts forward ROIAlign Layer to solve this problem , And align the extracted features with the input . It's easy , Avoid being right ROI A boundary or block of (Bins) Do any quantification , For example, direct use x/16 Instead of [x/16]. The author uses bilinear interpolation (Bilinear Interpolation) At every ROI In block 4 Calculate the exact value of the input feature at a sampling location , And aggregate the results ( Use Max perhaps Average).

Use an example to analyze the mismatch of the above regions . As shown in the figure , This is a Faster-RCNN Detection framework . Enter a 800*800 Pictures of the , There is a... In the picture 665*665 The bounding box ( Framed by a dog ). After extracting the features of the image through the backbone network , Characteristic graph scaling step size (stride) by 32. therefore , The edge length of the image and bounding box is the same as that of the input 1/32.800 It just happens to be 32 Divide into 25. but 665 Divide 32 Get it later 20.78, With decimal , therefore ROI Pooling Directly quantify it into 20. Next, you need to pool the features in the box 7*7 Size , Therefore, the bounding box is evenly divided into 7*7 A rectangular area . obviously , The side length of each rectangular area is 2.86, It also contains decimals . therefore ROI Pooling Quantify it again to 2. After these two quantifications , Obvious deviation has occurred in the candidate region ( As shown in the green part of the figure ). what's more , On the characteristic map of this layer 0.1 One pixel deviation , Zoom to the original image is 3.2 Pixel . that 0.8 The deviation of , It's close on the original picture 30 The difference between pixels , The impact is still great .

Specific methods and key points :

  • Traverse every candidate area , Keep floating-point boundaries and not quantify .
  • Divide the candidate region into k x k A unit , The boundary of each element is not quantified .
  • Four fixed coordinate positions are calculated in each cell , The values of these four positions are calculated by bilinear interpolation , And then maximize pooling .

3、 ... and 、 Improved segmentation Loss

From the original single pixel based Softmax The polynomial cross entropy becomes based on single pixel Sigmod Binary cross entropy . The framework predicts a binary mask for each category independently , No inter class competition is introduced , The category of each binary mask depends on the network ROI The classification prediction results given by the classification Branch . This is related to FCNs Different ,FCNs It is a multi category classification of each pixel , It classifies and segments at the same time , The experimental results show that this method can get a poor performance for object instance segmentation .

Here are more details , In the training phase , The author for each sample ROI Define a multitasking loss function L=L_{cls}+L_{box}+L_{mask}, The first two items don't introduce much . Mask branches for each ROI There will be one. Km^2 The output of dimensions , It encodes K The resolution is m\times m Binary mask for , Corresponding to K Categories . Therefore, the author makes use of A Per-pixelSigmoid, And defined as the average binary cross entropy loss (The Average Binary Cross-entropy Loss). For one that belongs to the K Category ROI, Consider only the second K individual Mask( Other mask inputs do not contribute to the loss function ). Such a definition will allow masks to be generated for each category , And there will be no competition between classes .

Four 、 The mask represents

A mask encodes the spatial layout of an input object . The author uses a FCN Come to each ROI Predict a mask , This preserves the spatial structure information .

原网站

版权声明
本文为[Xiaobai learns vision]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202141026251707.html