当前位置：网站首页>Siamrpn: recommended regional network and twin network

Siamrpn: recommended regional network and twin network

2022-07-26 14:46:00 【The way of code】

Address of thesis ：http://openaccess.thecvf.com/content_cvpr_2018/papers/Li_High_Performance_Visual_CVPR_2018_paper.pdf

Abstract

Most excellent visual target trackers are difficult to have real-time speed . In this article , We propose a twin candidate region generation network （Siamese region proposal network）, abbreviation Siamese-RPN, It can use large-scale images to train offline end-to-end . In particular , This structure contains Twin subnetwork （Siamese subnetwork） and Candidate regions generate networks （region proposal subnetwork）, The candidate area generation network includes classification and Return to Two branches . In the tracking phase , Our proposed method is constructed as a single sample detection task （one-shot detection task）.

We pre calculate the template branches in the twin subnet , That's the first frame , And it is constructed as a convolution layer in the area extraction network in the detection branch , For online tracking . Thanks to these improvements , Traditional multi-scale testing and online fine-tuning can be discarded , This also greatly improves the speed .Siamese-RPN Ran out 160FPS The speed of , And in VOT2015,VOT2016 and VOT2017 Has achieved leading results .

1. introduction

Compared with the most advanced method based on correlation filter which is properly designed , The tracker based on off-line training and deep learning can get better results . The key is to generate a network of candidate twin candidate regions （Siamese-RPN）. It consists of Template Branch and Detection branch form , They train large-scale image pairs offline in an end-to-end manner . By the most advanced candidate region extraction methods RPN Inspired by the , We are concerned about feature map Make proposal extraction . With the standard RPN Different , We use the relevant feature mapping of two branches to extract proposals . In the tracking task , We have no predefined categories , Therefore, we need the template branch to encode the appearance information of the target into RPN In the element map to distinguish the foreground and background .

In the tracking phase , The author regards this task as a single target detection task （one-shot detection）, What does that mean , Is to put the first frame bb As an example of detection , Detect similar targets in other frames .

in summary , The author's contribution has the following three points ：

1. Put forward Siamese region proposal network, Be able to use ILSVRC and YouTube-BB A large amount of data for offline end-to-end training .

2. In the tracking phase, the tracking task is constructed into a local single target detection task .

3. stay VOT2015, VOT2016 and VOT2017 Leading performance on , And the speed can reach 160fps.

2. Related work

2.1 RPN

RPN namely Region Proposal Network, Yes, it is RON To select the region of interest , namely proposal extraction. for example , If a region's p>0.5, It is thought that there may be 80 One of the categories , It's not clear what kind it is . Only this and nothing more , The network only needs to select these areas that may contain objects , These selected areas are also called ROI（Region of Interests）, That is, the region of interest . Of course RPN At the same time feature map Frame these ROI Approximate location of the region of interest , The output Bounding Box.

RPN Detailed introduction

2.2 One-shot learning

The most common example is face detection , Only know the information on one picture , Use this information to match the image to be detected , This is the single sample test , It can also be called a learning .

3 Siamese-RPN framework

3.1 SiamFC

SiamFC Detailed introduction

So-called Siamese（ twin ） The Internet , It refers to that the main structure of the network is divided into two branches , These two are like twins , Weight of shared volume layer . The upper one （z） Called template Branch （template）, Used to extract the features of the template frame .φ Represents a feature extraction method , What is extracted in this paper is the depth feature , After the full convolution network, we get a 6×6×128 Of feature map φ(z). Next one （x） It is called detection branch （search）, It is on the current frame according to the result of the previous frame crop Out of search region. After extracting the depth feature, we get a 22×22×128 Of feature map φ(x). Template support feature map In the detection area of the current frame feature map Do matching operation on , It can be seen as φ(z) stay φ(x) Slide up to search , Finally, we get a response diagram , The most responsive point on the graph is the position of the target corresponding to this frame .

Siamese The advantage of the network is , hold tracking The task is made into a test / Match task , Whole tracking The process does not require updating the network , This makes the algorithm fast （FPS：80+）. Besides , Sequel CFNet The two tasks of feature extraction and feature discrimination are made into an end-to-end task , It is the first time to combine depth network and correlation filtering .

Siamese There are also obvious defects ：

1. Template support is only carried out in the first frame , This makes the template features not very adaptable to the changes of the target , When the goal changes greatly , The features from the first frame may not be sufficient to characterize the target . As for why we only extract template features in the first frame , I think it may be because ：

（1） The features of the first frame are the most reliable and robust , stay tracking When it is impossible to determine which frame is reliable , Only the first frame feature is enough to get good accuracy .

（2） The algorithm of extracting template features only in the first frame is simpler , Faster .

2.Siamese Method can only get the center position of the target , But we can't get the size of the target , So we can only adopt simple multi-scale plus regression , This increases the amount of calculation , At the same time, it is not accurate enough .

Network training principle

As shown in the figure , The target template of the previous frame and the search area of the next frame can form many pairs of templates - Candidate pairs （exemplar-candidate pair）, But according to the principle of discriminant tracking , Only the target of the next frame and the target area of the previous frame ( namely exemplar of T frame-exemplar of T+1 frame） It belongs to the positive sample of the model , The rest are large exemplar-candidate pair All negative samples . This completes the end-to-end training of the network structure .

3.2 Siamese-RPN

On the left is the twin network structure , The network structure and parameters of the upper and lower branches are exactly the same , Above is the input of the first frame bounding box, Use this information to detect the target in the candidate area , Template frame . Here are the frames to be detected , obviously , The search area of the frame to be detected is larger than that of the template frame . In the middle is RPN structure , It's divided into two parts , The upper part is the classification Branch , The features of the template frame and the detection frame after passing through the twin network pass through a convolution layer , After the convolution layer, the feature of the template frame becomes 2k×256 passageway ,k yes anchor Number , Because it is divided into two categories , So it is 2k. The following is the boundary box regression Branch , Because there are four quantities [x, y, w, h], So it is 4k On the right is the output .

3.3 Twin feature extraction sub network

In the process of the training AlexNet, Removed conv2 conv4 Two layers of .φ(z) Is template frame output ,φ(x) Is the detection frame output

3.4 Candidate area extraction sub network

Classification branch and regression branch convolute the features of template frame and detection frame respectively ：

contain 2k Channel vectors , Each point in represents positive and negative excitation , Classification by cross entropy loss ; contain 4k Channel vectors , Each dot represents anchor and gt Between dx,dy,dw,dh, adopt smooth L1 Loss gains :

Ax, Ay, Aw, Ah yes anchor boxes Center point coordinates and length and width ; Tx, Ty, Tw, Th yes gt boxes, Why do you do this , Because there are differences in size between different pictures , We should normalize them .

smoothL1 Loss ：

3.5 Training phase ： End to end training twins RPN

Because the change of two consecutive frames in tracking is not great , therefore anchor Use only one scale ,5 Different aspect ratios （ And RPN Medium 3×3 individual anchor Different ）. When IoU Greater than 0.6 Time is the future , Less than 0.3 Time is the background .

4. Tracking as one-shot detection

Average loss function L：

As mentioned above , Give Way z Presentation template patch,x Indicates detection patch, function φ Express Siamese Feature extraction subnet , function ζ Indicates the regional recommendation subnet , Then the one-time detection task can be expressed as ：

Pictured , The purple part looks like the original Siamese The Internet , After the same CNN Then I got two feature map, The blue part is RPN. The template frame is RPN Through the convolution layer ,$ \phi (x){reg} \phi (x){cls}$ As the core used for detection .

To put it simply , It is the branch of pre training template , Use the target feature of the first frame to output a series weights, And these weights, Contains information about the goal , As a detection branch RPN Network parameters go detect The goal is . The advantage of this is ：

（1） Template support can learn one encode The characteristics of the target , Use this feature to find the target , This is better than using the first frame directly feature map Matching is more robust .

（2） Compared with the original Siamese The Internet ,RPN The network can directly regress the coordinates and dimensions of the target , Both accurate , It doesn't need to be like multi-scale A waste of time .

After going through the network , We express classification and regression feature mapping as point sets ：

Because the odd channels on the classification feature map represent positive activation , We collect all In front of K A little bit , among l Is odd , And indicates that the point set is ：

among I,J,L Are some index sets .

Variable i and j Code the positions of corresponding anchors respectively ,l Code the ratio of the corresponding anchor , Therefore, we can export the corresponding anchor set as ：

Besides , We found that On ANC* The activation of the gets the corresponding refinement coordinates as ：

Because it's classification , Before the election k A little bit , Choose in two steps ：

First step , Discard those too far away from the center bb, Only select from a fixed square smaller than the original feature map , Here's the picture ：

The center distance is 7, Look carefully at the picture and you can see , Every grid has k A rectangle .

The second step , Use cosine window （ Restrain those with too large distance ） And scale change punishment （ Suppress large scale changes ） Come on proposal Sort , Choose the best . The specific formula can be seen in the paper .

Use these points to correspond to anchor box Combined with the regression results bounding box：

an Namely anchor Box of ,pro It is the final boundary box after regression thus ,proposals set Just choose .

Then through non maximum inhibition (NMS), seeing the name of a thing one thinks of its function , That is to remove all the boxes that are not huge , because anchor There is usually overlap overlap, therefore , identical object Of proposals There is also overlap . To solve the overlap proposal problem , use NMS Algorithm to deal with ： Two proposal between IoU Greater than the preset threshold , Then discard score Lower proposal.

IoU The presetting of the threshold value needs to be handled carefully , If IoU It's too small , May be lost objects Some of proposals; If IoU Overvalued , May lead to objects There are many proposals.IoU Typical values for 0.6.

5. Implementation details

We use from ImageNet [28] Pre training improved AlexNet, The parameters of the first three convolutions are fixed , Only adjust Siamese-RPN The last two convolutions in . These parameters are obtained by using SGD Optimize the equation 5 From the loss function in . A total of 50 individual epoch,log space The learning rate has increased from 10-2 Down to 10-6. We from VID and Youtube-BB Extract image pairs from , By selecting an interval less than 100 And perform further cropping procedures . If the size of the target bounding box is expressed as （w,h）, We take size A×A Cut the template patch for the center , Its definition is as follows ：

among p =(w + h)/2

Then adjust it to 127×127. Clip the detection patch on the current frame in the same way , Its size is twice that of the template patch , And then adjust to 255×255.

In the reasoning stage , Because we make online tracking a one-time detection task , So there is no online adaptation . Our experiment is with Intel i7,12G RAM,NVidia GTX 1060 Of PC Upper use PyTorch Realized .

Learn more about programming , Please pay attention to my official account ：

The way of code