当前位置：网站首页>CVPR 2022 best student paper: single image estimation object pose estimation in 3D space

CVPR 2022 best student paper: single image estimation object pose estimation in 3D space

2022-07-05 17:27:00 【PaperWeekly】

author | Chenhansheng

Company | Tongji University 、 Aridamo house

source | Almost Human

distance CVPR 2022 Not long after the major awards were announced , From Tongji University 、 Chenhansheng, a research intern of the Ali Dharma Institute, explained the best student thesis Award for us .

In this article, we get CVPR 2022 Work for the best student thesis award . The problem studied in this paper is to estimate the location of objects based on a single image 3D Pose in space . Among the existing methods , be based on PnP Geometric optimization of pose estimation methods are often extracted through the depth network 2D-3D Association point , However, there is a non differentiable problem in the back-propagation of the optimal position and orientation solution , It is difficult to realize the stable end-to-end training of the network with the pose error as the loss , here 2D-3D The correlation point relies on the monitoring of other agents' losses , This is not the best training target for pose estimation .

To solve this problem , We start from the theory , Put forward EPro-PnP modular , The probability density distribution of the output pose is not a single optimal solution , Thus, the non derivable optimal pose is replaced by the derivable probability density , Stable end-to-end training .EPro-PnP Strong commonality , Applicable to various specific tasks and data , It can be used to improve the existing PnP The method of pose estimation , You can also use its flexibility to train new networks . In a more general sense ,EPro-PnP The essence is to classify the common softmax Into the continuous field , In theory, it can be extended to training general models with nested optimization layers .

Paper title ：

EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

Thesis link ：

https://arxiv.org/abs/2203.13254

Code link ：

https://github.com/tjiiv-cprg/EPro-PnP

Preface

What we study is 3D A classic problem in vision ： Based on leaflets RGB The image locates the 3D object . To be specific , Given a picture containing 3D An image projected by an object , Our goal is to determine the rigid body transformation from the object coordinate system to the camera coordinate system . This rigid body transformation is called the pose of the object , Write it down as y, It consists of two parts ：1） Location （position） component , You can use 3x1 Displacement vector of t Express ,2） toward （orientation） component , You can use 3x3 The rotation matrix of R Express .

To address this issue , The existing methods can be divided into two categories: explicit and implicit . Explicit methods can also be called Direct pose prediction , That is, using feedforward neural network （FFN） Directly output each component of the object pose , Usually ：1） Predict the depth of an object ,2） Find the center point of the object on the image 2D Projection position ,3） Predict the orientation of an object （ The specific treatment of orientation may be complicated ）.

Using the image data marked with the real pose of the object , The loss function can be designed to directly monitor the position and posture prediction results , Easily realize the end-to-end training of the network . However , Such networks lack explicability , It is easy to over fit on a small data set . stay 3D In the target detection task , Explicit methods dominate , Especially for large data sets （ for example nuScenes）.

Implicit method is a pose estimation method based on geometric optimization , The most typical representative is be based on PnP The method of pose estimation . In this kind of method , First, we need to find out in the image coordinate system N individual 2D spot （ The first i spot 2D The coordinates are denoted as ）, At the same time, find out the relevant in the object coordinate system N individual 3D spot （ The first i spot 3D The coordinates are denoted as ）, Sometimes it is necessary to obtain the correlation weight of each pair of points （ The first i The correlation weight of points is recorded as ）. Constrain according to perspective projection , this N Yes 2D-3D Weighted correlation points implicitly define the optimal pose of the object . To be specific , We can find the position and orientation of the object that minimizes the re projection error ：

among , Represents the weighted re projection error , It is postural function . Represents the camera projection function with internal parameters , Represents the product of elements .PnP Methods are commonly used when the geometry of an object is known 6 Degree of freedom pose estimation task ：

be based on PnP The method also requires a feedforward network to predict 2D-3D Associated point sets

402 Payment Required

. Compared with direct pose prediction , This deep learning model combined with traditional geometric vision algorithm has very good interpretability , Its generalization performance is stable , However, in the previous work, the training method of the model has some defects . Many methods build agent loss functions , To supervise X This intermediate result , This is not the optimal target for posture .

for example , Given the shape of the object , Can be pre selected to take out the object 3D Key points , Then train the network to find the corresponding 2D Projection point location . This also means that agency losses can only be learned X Some of the variables in , So it's not flexible enough . If we don't know the shape of the object in the training set , Need to learn from scratch X What to do with all the content in ？

The advantages of explicit and implicit methods complement each other , If you can pass the supervision PnP Output pose results , End to end training network to learn Association point set X , You can combine the advantages of the two . To achieve this goal , Some recent studies have realized the use of implicit function derivation PnP Back propagation of layers . However ,PnP Medium argmin Functions are discontinuous and undifferentiable at some points , Make the back propagation unstable , Direct training is difficult to converge .

EPro-PnP Methods to introduce

2.1 EPro-PnP modular

For stable end-to-end training , We proposed End to end probability PnP（end-to-end probabilistic PnP）, namely EPro-PnP. The basic idea is to treat the implicit pose as a probability distribution , Then its probability density about X It's derivable . First, the likelihood function of pose is defined based on the re projection error ：

If no information prior is used , Then the posterior probability density of the pose is the normalized result of the likelihood function ：

It can be noted that , The above formula and common classification softmax The formula

402 Payment Required

Very close , Actually EPro-PnP The essence of is to softmax From discrete threshold to continuous threshold , Sum up Instead of integral .

2.1 KL Divergence loss

In the process of training the model , The real pose of a known object , Then the target pose distribution can be defined . At this point, you can calculate KL The divergence As the loss function used in the training network （ because Fix , It can also be understood as the cross entropy loss function ）. In the target Tend to be Dirac In the case of functions , be based on KL The loss function of divergence can be simplified to the following form ：

If we take its derivative, we have ：

so , The loss function consists of two terms , The first one is （ Write it down as ） Try to reduce the true value of posture The re projection error of , The second item （ Write it down as ） Try to increase the predicted pose Re projection errors everywhere . The two directions are opposite , The effect is as follows （ Left ） Shown . As an analogy , On the right is the classification cross entropy loss we often use when training classification Networks .

2.3 Monte Carlo pose loss

It needs to be noted that ,KL The second item in the loss It contains integral , This integral has no analytical solution , Therefore, it must be approximated by numerical methods . Comprehensive consideration of generality , Accuracy and computational efficiency , We use Monte Carlo method , Through sampling to simulate the pose distribution .

To be specific , We use an importance sampling algorithm ——Adaptive Multiple Importance Sampling（AMIS）, To calculate the K With weight Pose samples of , We call this process Monte Carlo PnP：

Accordingly , The second item It can be approximated as weight Function of , And It can be propagated back ：

The visualization effect of pose sampling is shown in the following figure ：

2.4 in the light of PnP Derivative regularization of solver

Although Monte Carlo PnP The loss can be used to train the network to obtain high-quality pose distribution , But in the reasoning stage , Still need to go through PnP Optimize the solver to get the optimal pose solution . Commonly used gauss - Newton and its derivative algorithm are solved by iterative optimization , Its iteration increment is determined by the cost function The first and second derivatives of . To make PnP Solution Closer to the truth , The derivative of the cost function can be regularized . The regularization loss function is designed as follows ：

among , For Gauss - Newton iteration increment , It is related to the first and second derivatives of the cost function , And can be back propagated , Represents a distance measure , For location use smooth L1, For orientation use cosine similarity. stay And When not in agreement , This loss function causes the iteration increment Point to actual truth value .

be based on EPro-PnP Pose estimation network

We are 6 Degree of freedom pose estimation and 3D Different networks are used in the two subtasks of target detection . among , about 6 Degree of freedom pose estimation , stay ICCV 2019 Of CDPN On the basis of the network, slightly modify and use EPro-PnP Training , Used for ablation studies; about 3D object detection , stay ICCVW 2021 Of FCOS3D On this basis, a new deformation association is designed （deformable correspondence） Detection head , To prove EPro-PnP It can train the network to learn all the information directly without the knowledge of object shape 2D-3D Points and associated weights , So as to show EPro-PnP Flexibility in application .

3.1 be used for 6 Dense association network for pose estimation of degrees of freedom

The network structure is shown in the figure above , Just in the original CDPN The output layer is modified based on . original edition CDPN Use objects that have been detected 2D Box crop out area image , Input to ResNet34 backbone in . original edition CDPN Decouple position and orientation into two branches , The location branch uses an explicit method of direct prediction , And towards branches use dense associations and PnP The implicit method of .

To study EPro-PnP, The modified network only retains the dense associated branches , The output of 3 The tunnel 3D Coordinates , as well as 2 Channel correlation weight , Among them, the correlation weight has gone through spatial softmax and global weight scaling. increase spatial softmax The purpose is to weigh Normalize , Make it similar to attention map The nature of , You can focus on relatively important areas , Experiments show that weight re normalization is also the key to stable convergence .Global weight scaling It reflects the position and posture distribution The degree of concentration . The network only needs EPro-PnP The Monte Carlo pose loss can be trained , In addition, derivative regularization can be added , And when the shape of the object is known, additional 3D Coordinate regression loss .

3.2 be used for 3D Deformation association network for target detection

The network structure is shown in the figure above . Generally speaking, it is based on FCOS3D detector , Reference resources deformable DETR Designed network structure . stay FCOS3D On the basis of , Retain its centerness and classification layer , The original pose prediction layer is replaced by object embedding and reference point layer , Used to generate object query. Reference resources deformable DETR, We predict relative to reference point The offset of is obtained 2D Sampling location （ And you get ）. After sampling feature Through attention The operations are aggregated into object feature, Used to predict the results at the object level （3D score,weight scale,3D box size etc. ）.

Besides , At each point after sampling feature Prior to joining object embedding And through self attention After processing, the corresponding... Of each point is output 3D coordinate And correlation weights . The predicted All can be controlled by EPro-PnP Monte Carlo pose loss training is , It can converge without additional regularization and has high accuracy . On this basis , The loss of derivative regularization and auxiliary loss can be increased to further improve the accuracy .

experimental result

4.1 6 DOF pose estimation task

Use LineMOD Data set experiments , And strictly with CDPN baseline compare , The main results are as follows . so , increase EPro-PnP Lose end-to-end training , The accuracy is significantly improved （+12.70）. Continue to increase the derivative regularization loss , The accuracy is further improved . On this basis , Use the original CDPN The training result of is initialized and added epoch（ Keep the total epoch Number and original CDPN The complete three stages of training are consistent ） Can further improve the accuracy , Pre training CDPN Part of the advantage comes from CDPN There's an extra mask supervise .

Above, EPro-PnP Comparison with various leading methods . From the more backward CDPN Improvement comes from EPro-PnP Close to in accuracy SOTA, also EPro-PnP Simple architecture , Based solely on PnP Pose estimation , No additional explicit depth estimation or pose refinement is required , Therefore, it also has advantages in efficiency .

4.2 3D Target detection task

Use nuScenes Data set experiments , The comparison results with other methods are shown in the figure above .EPro-PnP Not only relative FCOS3D There was a significant improvement , And beyond the SOTA、FCOS3D Another improved version of PGD. what's more ,EPro-PnP At present, it is the only one in nuScenes The geometric optimization method is used to estimate the pose on the data set . because nuScenes The dataset is large , The end-to-end trained direct pose estimation network has good performance , Our results show that end-to-end training of models based on geometric optimization can achieve better performance on large data sets .

4.3 Visual analysis

The figure above shows the use of EPro-PnP The prediction results of the trained dense Association Network . among , Association weight graph Highlights important areas in the image , Be similar to attention Mechanism . From the analysis of loss function , The highlight area corresponds to the area with low Re projection uncertainty and sensitive to pose changes .

3D The result of target detection is shown in the figure above . The upper left view shows the deformation association network sampled 2D Point location , Red means level X The higher weight will take you , Green means vertical Y Points with higher components . Green dots are usually located at the upper and lower ends of the object , Its main function is to calculate the distance of an object by its height , This characteristic is not artificially designated , It's all the result of free training . The right figure shows the test results on the top view , The blue cloud image represents the distribution density of the central point of the object , It reflects the uncertainty of object positioning . Generally, the location uncertainty of distant objects is greater than that of nearby objects .

EPro-PnP Another important advantage of is , The ambiguity of orientation can be expressed by predicting the complex multimodal distribution . As shown in the figure above ,Barrier Because the object itself is rotationally symmetric , There is often a difference in orientation 180° Two peaks of ;Cone It has no specific orientation , Therefore, the prediction results are distributed in all directions ;Pedestrian Although not completely rotationally symmetric , But the image is not clear , It is not easy to judge the front and back , Sometimes there are two peaks . This probability characteristic makes EPro-PnP For symmetrical objects, there is no need to do any special treatment on the loss function .

summary

EPro-PnP The original non differentiable optimal pose is transformed into a differentiable pose probability density , Make based on PnP The geometrically optimized pose estimation network can realize stable and flexible end-to-end training .EPro-PnP It can be applied to general 3D Object pose estimation problem , Even in the unknown 3D In the case of object geometry , You can also learn the object's... Through end-to-end training 2D-3D Association point . therefore ,EPro-PnP It widens the possibility of network design , For example, we propose a deformable association network , It was impossible to train in the past .

Besides ,EPro-PnP It can also be directly used to improve the existing PnP The method of pose estimation , Unleash the potential of existing networks through end-to-end training , Improving pose estimation accuracy . In a more general sense ,EPro-PnP The essence is to classify the common softmax Into the continuous field , Not only can it be used in other geometric optimization based 3D Visual problems , Theoretically, it can also be extended to the training of general models with nested optimization layers .

Read more

# cast draft through Avenue #

Let your words be seen by more people

How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ？ The answer is ： People you don't know .

There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities .

PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots 、 Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .

The basic requirements of the manuscript ：

• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark

• It is suggested that markdown Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues

• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement

Contribution channel ：

• Send email ：[email protected]

• Please note your immediate contact information （ WeChat ）, So that we can contact the author as soon as we choose the manuscript

• You can also directly add Xiaobian wechat （pwbot02） Quick contribution , remarks ： full name - contribute

△ Long press add PaperWeekly Small make up

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

原网站

版权声明
本文为[PaperWeekly]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207051654424739.html

当前位置：网站首页>CVPR 2022 best student paper: single image estimation object pose estimation in 3D space

CVPR 2022 best student paper: single image estimation object pose estimation in 3D space

402 Payment Required

402 Payment Required

边栏推荐

猜你喜欢

随机推荐