当前位置:网站首页>Interpretation of mask RCNN paper
Interpretation of mask RCNN paper
2022-07-05 01:36:00 【Xiaobai learns vision】
Mask R-CNN Introduce
Mask R-CNN Is based on Faster R-CNN Based on staged improvements ,FasterR-CNN Not designed for pixel alignment between input and output , In order to make up for this deficiency , We propose a concise non quantized layer , named RoIAlign,RoIAlign You can keep an approximate spatial location , In addition to this improvement ,RoIAlign There is also a major impact : That is, it can be relatively improved 10% To 50% Mask accuracy (Mask Accuracy), This improvement can get better measurement results under more strict positioning measurement indicators . second , We find that segmentation mask and category prediction are very important : So , We predict a binary mask for each category . Based on the above improvements , Our final model Mask R-CNN I've outperformed all the previous COCO Single model of instance segmentation task , This model can be used in GPU On the frame of 200ms Speed of operation , stay COCO Of 8-GPU Training on the machine requires 1 To 2 Time of day .
MaskR-CNN Have simple and clear ideas : about FasterR-CNN Come on , For each target object , It has two outputs , One is the class tag (classlabel), One is the offset value of the bounding box (bounding-box offset), On this basis ,Mask R-CNN Method adds the output of the third branch : Destination mask . The destination mask is different from the existing one class and box The difference in output is that it requires a more refined extraction of the spatial layout of the target . Next , Let's introduce in detail Mask R-CNN The main elements of , Include Fast/Faster R-CNN Missing pixel alignment (pixel-topixel alignment).
Mask R-CNN How it works
Mask R-CNN Used with Faster R-CNN An interlinked two-stage process , The first stage is called RPN(Region Proposal Network), This step proposes the candidate object bounding box . The second stage is essentially FastR-CNN, It uses... From candidate frameworks RoIPool To extract features and carry out classification and bounding box regression , but Mask R-CNN Further, for each RoI Generated a binary mask , We recommend readers to read further Huang(2016) And so on “Speed/accuracy trade-offs for modern convolutional object detectors” Detailed comparison of papers Faster R-CNN Different from other frameworks .
The mask encodes the spatial layout of an object , Unlike class tags or frameworks ,Mast R-CNN The spatial structure can be extracted using a mask by convoluted pixel alignment .
ROIAlign:ROIPool From every ROI Extract feature map from ( for example 7*7) Standard operation of .
Network architecture (Network Architecture): In order to prove Mast R-CNN Universality of , We will Mask R-CNN Multiple architectural instantiations of , To distinguish between different architectures , The main architecture of convolution is shown in this paper (backbone architecture), The architecture is used to extract the features of the whole picture ; Header architecture (headarchitecture), For border recognition ( Classification and regression ) And each RoI Mask prediction .
stay Faster R-CNN Modifications on the network , Specific include :
(1) take ROI Pooling Layer replaced with ROIAlign;
(2) Added juxtaposed FCN layer (Mask layer ).
Technical points
One 、 Enhanced infrastructure
take ResNeXt-101+FPN Used as a feature extraction network , achieve State-of-the-art The effect of .
Two 、 Joined the ROIAlign layer
ROIPool It's for every ROI Extract a small-scale feature map (E.g. 7x7) Standard operation of , It is used to solve problems of different scales ROI The problem of extracting the feature size into the same scale .ROIPool First of all, the floating-point numerical value ROI Quantized into a characteristic diagram of discrete particles , Then quantify ROI A small piece divided into several spaces (Spatial Bins), Finally, each small piece is Max Pooling The operation produces the final result .
By calculation [x/16] In continuous coordinates x Quantify on , among 16 Is the step size of the characteristic graph ,[ . ] It means round off . These quantifications introduce ROI Misalignment with the extracted features . Because the classification problem is robust to the translation problem , So the impact is relatively small . However, this will have a very large negative impact when predicting the mask with pixel level accuracy .
thus , The author puts forward ROIAlign Layer to solve this problem , And align the extracted features with the input . It's easy , Avoid being right ROI A boundary or block of (Bins) Do any quantification , For example, direct use x/16 Instead of [x/16]. The author uses bilinear interpolation (Bilinear Interpolation) At every ROI In block 4 Calculate the exact value of the input feature at a sampling location , And aggregate the results ( Use Max perhaps Average).
Use an example to analyze the mismatch of the above regions . As shown in the figure , This is a Faster-RCNN Detection framework . Enter a 800*800 Pictures of the , There is a... In the picture 665*665 The bounding box ( Framed by a dog ). After extracting the features of the image through the backbone network , Characteristic graph scaling step size (stride) by 32. therefore , The edge length of the image and bounding box is the same as that of the input 1/32.800 It just happens to be 32 Divide into 25. but 665 Divide 32 Get it later 20.78, With decimal , therefore ROI Pooling Directly quantify it into 20. Next, you need to pool the features in the box 7*7 Size , Therefore, the bounding box is evenly divided into 7*7 A rectangular area . obviously , The side length of each rectangular area is 2.86, It also contains decimals . therefore ROI Pooling Quantify it again to 2. After these two quantifications , Obvious deviation has occurred in the candidate region ( As shown in the green part of the figure ). what's more , On the characteristic map of this layer 0.1 One pixel deviation , Zoom to the original image is 3.2 Pixel . that 0.8 The deviation of , It's close on the original picture 30 The difference between pixels , The impact is still great .
Specific methods and key points :
- Traverse every candidate area , Keep floating-point boundaries and not quantify .
- Divide the candidate region into k x k A unit , The boundary of each element is not quantified .
- Four fixed coordinate positions are calculated in each cell , The values of these four positions are calculated by bilinear interpolation , And then maximize pooling .
3、 ... and 、 Improved segmentation Loss
From the original single pixel based Softmax The polynomial cross entropy becomes based on single pixel Sigmod Binary cross entropy . The framework predicts a binary mask for each category independently , No inter class competition is introduced , The category of each binary mask depends on the network ROI The classification prediction results given by the classification Branch . This is related to FCNs Different ,FCNs It is a multi category classification of each pixel , It classifies and segments at the same time , The experimental results show that this method can get a poor performance for object instance segmentation .
Here are more details , In the training phase , The author for each sample ROI Define a multitasking loss function L=L_{cls}+L_{box}+L_{mask}, The first two items don't introduce much . Mask branches for each ROI There will be one. Km^2 The output of dimensions , It encodes K The resolution is m\times m Binary mask for , Corresponding to K Categories . Therefore, the author makes use of A Per-pixelSigmoid, And defined as the average binary cross entropy loss (The Average Binary Cross-entropy Loss). For one that belongs to the K Category ROI, Consider only the second K individual Mask( Other mask inputs do not contribute to the loss function ). Such a definition will allow masks to be generated for each category , And there will be no competition between classes .
Four 、 The mask represents
A mask encodes the spatial layout of an input object . The author uses a FCN Come to each ROI Predict a mask , This preserves the spatial structure information .
边栏推荐
- Do you know the eight signs of a team becoming agile?
- Include rake tasks in Gems - including rake tasks in gems
- MATLB | multi micro grid and distributed energy trading
- Roads and routes -- dfs+topsort+dijkstra+ mapping
- 小程序容器技术与物联网 IoT 可以碰撞出什么样的火花
- PowerShell: use PowerShell behind the proxy server
- One plus six brushes into Kali nethunter
- Chia Tai International Futures: what is the master account and how to open it?
- WCF: expose unset read-only DataMember property- WCF: Exposing readonly DataMember properties without set?
- Database postragesq BSD authentication
猜你喜欢
微信小程序:最新wordpress黑金壁纸微信小程序 二开修复版源码下载支持流量主收益
To sort out messy header files, I use include what you use
Behind the cluster listing, to what extent is the Chinese restaurant chain "rolled"?
Wechat applet; Gibberish generator
Blue Bridge Cup Square filling (DFS backtracking)
Main window in QT application
Five ways to query MySQL field comments!
微信小程序;胡言乱语生成器
Incremental backup? db full
Introduction to redis (1)
随机推荐
Database postragesq peer authentication
【大型电商项目开发】性能压测-性能监控-堆内存与垃圾回收-39
Arbitrum: two-dimensional cost
R语言用logistic逻辑回归和AFRIMA、ARIMA时间序列模型预测世界人口
Are you still writing the TS type code
[wave modeling 3] three dimensional random real wave modeling and wave generator modeling matlab simulation
220213c language learning diary
C basic knowledge review (Part 3 of 4)
Win: use shadow mode to view the Desktop Session of a remote user
Senior Test / development programmers write no bugs? Qualifications (shackles) don't be afraid of mistakes
Es uses collapsebuilder to de duplicate and return only a certain field
Jcenter () cannot find Alibaba cloud proxy address
Five ways to query MySQL field comments!
node工程中package.json文件作用是什么?里面的^尖括号和~波浪号是什么意思?
I was beaten by the interviewer because I didn't understand the sorting
当产业互联网时代真正发展完善之后,将会在每一个场景见证巨头的诞生
MySQL regexp: Regular Expression Query
【LeetCode】88. Merge two ordered arrays
PHP Joseph Ring problem
Four pits in reentrantlock!