当前位置：网站首页>Detailed reading of stereo r-cnn paper -- Experiment: detailed explanation and result analysis

Detailed reading of stereo r-cnn paper -- Experiment: detailed explanation and result analysis

2022-07-06 10:57:00 【Is it Wei Xiaobai】

In the past, I used to read the method part when reading papers , Then look at the performance of the test data . Recently, when I was writing my thesis, I found ,“ How to design the experiment ” It's also important , I will pay more attention to this part when I read the thesis in the future .

One 、 Details of the experiment

Introduce the conditions required for the test in detail

Network

Use five ranges (scale){32, 64, 128, 126, 512} And three proportions (ratios){0.5, 1, 2} Of archor. Adjust the size of the shorter edge of the original image to 600 Pixels . about Stereo-RPN, Due to the connection of left and right characteristic graphs , You need to have 1024 Input channels , instead of 512 Layers layer. Again , stay R-CNN Back to the head head Yes 512 Input channels . stay Titan XP GPU On ,Stereo R-CNN To a Stereo pair The reasoning time is about 0.28s.

Training

It's mainly about loss Explanation

$\begin{aligned} L &=w_{c l s}^{p} L_{c l s}^{p}+w_{r e g}^{p} L_{r e g}^{p}+w_{c l s}^{r} L_{c l s}^{r}+w_{b o x}^{r} L_{b o x}^{r} +w_{\alpha}^{r} L_{\alpha}^{r}+w_{d i m}^{r} L_{d i m}^{r}++w_{k e y}^{r} L_{k e y}^{r} \end{aligned}$

Among them $(\cdot)^{p},(\cdot)^{r}$ Express RPN and R-CNN, Subscript box、α、dim、key respectively stereo boxes Of loss,viewpoint Of loss、dimension Of loss and keypotint Of loss.

During training, the left and right images will also be flipped and exchanged ( Correspondingly, it will viewpoint angle and keypoint Mirror image ) To expand the data set . One per training batch Keep one in stereo and 512 individual RoIs.

Other conditions ： Use SGD、 The weight decays to 0.0005、 Momentum is 0.9%、 The learning rate is initialized to 0.001 And each 5 individual epoch Reduce 0.1%. Total training 20 individual epoch.

Two 、 Result analysis

Stereo Recall and Stereo Detection

Stereo R-CNN The target of is to detect and correlate the targets in the left and right images at the same time . In addition to evaluating the left and right images 2D Average recall (AR) and 2D average precision (AP) Outside , Also defined stereo AR and stereo AP Measure , Only query stereo box Only when the following conditions are met can it be considered as true positive (TPS)：

1. left GT The maximum size of the box IOU Greater than the given threshold ;
2. On the right side GT The maximum size of the box IOU Greater than the given threshold ;
3. Select the left and right GT The box belongs to the same object .

As shown in the table 1 Shown , And Faster RCNN comparison Stereo RCNN Have similar on a single image proposal recall and detection precision, At the same time, high-quality data association is generated in the left and right images without additional calculation .

although RPN Medium stereo AR Slightly smaller than left AR, But in R-CNN Left observed after 、 Right and right stereo AP Almost the same , This shows that the detection performance on the left and right images is consistent , And almost all the left images are true positive box There is a corresponding true positive box.

In addition, two left and right feature fusion strategies are tested ： Element based Averaging Strategy and channel cascading strategy . As shown in the table 1 Described in , Because all the information is retained , Channel cascading shows better performance .

above , Proved accurate stereo detection and association Provide enough box-level constraint .

3D Detection and 3D Localization

Use Precision for bird’s eye view (APbv) and 3D box (AP3d) evaluation 3D Detection and positioning accuracy . It turns out that table2 in . The detailed comparative analysis will not be repeated , You can read the paper directly .

It is worth noting that ,Kitti 3D The detection reference is for image-based (image-based) The method is difficult , For this method ,3D Performance tends to decline as the distance from the target object increases . This phenomenon is shown in Figure 7 Can be observed intuitively , Although the method in this paper realizes subpixel disparity estimation ( Less than 0.5 Pixels ), But because parallax is inversely proportional to depth , The depth error increases with the increase of object distance . For targets with obvious parallax , Based on strict geometric constraints, this paper realizes high-precision depth estimation . That explains why IoU The higher the threshold , The easier it is for the target object to belong to , Compared with other methods , This article gets more improvements .