当前位置：网站首页>SA Siam: Twin network for real-time target tracking

SA Siam: Twin network for real-time target tracking

2022-07-26 14:45:00 【The way of code】

Address of thesis ：https://openaccess.thecvf.com/content_cvpr_2018/papers/He_A_Twofold_Siamese_CVPR_2018_paper.pdf

Abstract

1. The core of this article is ： Classify the image in the task Semantic features （Semantic features） Match with the similarity in the task Appearance features （Appearance features） Complementary combination , It is very suitable for target tracking tasks , Therefore, the method in this paper can be simply summarized as ：SA-Siam= Semantic Branch + Appearance branch ;

2.Motivation： The characteristics of target tracking are , We want to distinguish the changing target object from many backgrounds , The difficulty is ： Background and changes . The idea of this paper is to filter out the background with a semantic Branch , At the same time, a branch of appearance features is used to generalize the changes of targets , If an object is judged not to be the background by the semantic Branch , And it is judged by the appearance feature branch that the object is changed from the target object , Then we think this object is the object that needs to be tracked ;

3. The purpose of this article is to improve SiamFC Discrimination in target tracking tasks . At depth CNN In the task of training target classification , The deep features in the network have strong semantic information and are invariant to the appearance changes of the target . These semantic features can be used to complement SiamFC Appearance features used in target tracking tasks . Based on this discovery , We proposed SA-Siam, This is a double twin network , It consists of semantic branches and appearance branches . Each branch uses the twin network structure to calculate the similarity between the candidate image and the target image . In order to maintain the independence of the two branches , The two twin networks have nothing to do in the training process , It will only be combined during the test .

4. The second core of this paper ： For the newly introduced semantic Branch , This paper further proposes the channel attention mechanism . When using the network to extract the features of the target object , Different targets activate different characteristic channels , We should give a high weight to the activated channel , This paper calculates the weights of these different layers through the response of the target object in the specific layer of the network . Experiments confirm that , In this way , It can further improve the discrimination of semantic twin Networks .

It is still used SiamFC During tracking, all frames are compared with the first frame , Is the main defect of this kind of method .

Related work

RPN Detailed introduction

SiamFC Detailed introduction

SiamRPN Detailed introduction

1.SiamFC： about A,B,C Three pictures , hypothesis C Pictures and A The picture is an object , But the appearance has changed ,B and A It doesn't matter .SiamFC Network input two pictures , So after SiamFC After get A and C High similarity ,A and B Low similarity . Through the above SiamFC The function of , Naturally, it can be used in target tracking algorithm .SiamFC The network has outstanding advantages ： No need to be online fine-tune and end-to-end Tracking mode , So that it can do real-time tracking under the premise of ensuring the tracking effect .

2. Integrated tracker ： Most tracking is a model A, Using models A Calculate the current data to get the tracking results , The integrated tracker has multiple models A,B,C, Analyze the current data respectively , Then we fuse the results to get the final tracking results . The semantic features of this paper + The appearance features draw on the idea of integrated tracker . In the integrated tracker , Model A,B,C The lower the correlation , The better the tracking effect , It's easy to understand , If the three of them are very relevant , Then there is no difference between using three and one , on this account , The semantic feature and appearance feature network in this paper are completely irrelevant in the training process .

frame

The proposed double SA-Siam The architecture of the network .A-Net Indicates appearance network . The network and data structures connected by dotted lines are similar to SiamFC Exactly the same .S-Net Represents the Semantic Web . Extract the features of the last two convolution layers . The channel attention module determines the weight of each characteristic channel based on the target and context information . Appearance branch and semantic branch are trained separately , Not until the test time .

1. Appearance branch （ The blue part ）

One goal A Send to the network P in , A search domain larger than the target S Send to the network P in ,A The resulting feature map is similar to S The resulting characteristic graph is convoluted to obtain the correlation coefficient graph , The greater the correlation coefficient , The more likely it is to be the same goal , The network adopts and SiamFC The same network in .

Appearance branch with （z,X） For input . It cloned SiamFC The Internet . The convolution network used to extract appearance features is called A-Net. The response mapping from the appearance branch can be written as ：

In the problem of similarity learning ,A-Net All parameters in are trained from scratch .

By minimizing the logical loss function L（·） To optimize A-Net, as follows ：

among θa Express A-Net Parameters in ,N It's the number of training samples ,Yi yes ground truth Response .

2. Semantic Branch （ Orange part ）

This is the focus of this article . Orange indicates Semantic Web , It is pre trained AlexNet, Fix all parameters during training and testing , Only extract the last conv4 and conv5 Characteristics of , The target template becomes zs,zs and X The same big as that , and z Have the same center , But it contains context information , Because the channel attention model is added to the branch , Determine the weight through the target and surrounding information , Choose a channel that has a greater impact on a specific tracking target . in addition , In order to better carry out subsequent related operations , The author adds the upper and lower branches to the fusion model , Joined the 1×1 The convolution of layer , Convolute each of the two extracted layers , Make the characteristic channels of the target template branch and the detection branch the same , And the total number of channels is the same as that of the appearance network .

When training semantic branch network, only channel attention module and fusion module are trained .

The response mapping from the semantic branch can be written as ：

ξ Is the channel weight ,g() It is the fusion of features , Easy to operate .

Loss function L（·） as follows ：

among θs Represents a trainable parameter ,N It's the number of training samples .

3. combination

The appearance network and semantic network are trained separately , Semantic network only trains channel attention module and fusion module . During the test time , The final response graph is calculated as the weighted average of the graphs from the two branches ：

among λ Is the weighting parameter , To balance the importance of the two branches . In practice ,λ It can be estimated from the verification set . The author obtains through the experiment λ=0.3 best .

4. In the semantic Branch Channel Attention Mechanism

Why do you do this ： The appearance of high-dimensional semantic features on the target （ The deformation of the picture 、 Spin, etc ） Change is robust , Resulting in low discrimination . In order to improve the discrimination of semantic branches , We designed a Channel Attention modular . Intuition , When tracking different objects , Different channels play different roles , Some channels are extremely important for some objects , But for other objects, it can be ignored , It may even introduce noise . If we can adaptively adjust the importance of channels , Then we will get the target and reliable feature expression . In order to achieve this goal , Not only is the goal important to us , The background in a certain range around it is also important for us , Therefore, the template of the input network here is one circle larger than the appearance Branch .

Now let's talk about how to realize this function .

Channel attention passes through the maximized pool layer and multi-layer perceptron （MLP） Generate channels i The weighting coefficient of ξi.

In the above figure , The assumption is conv5 Layer of the first i Channel characteristic diagram , Dimension for 22×22, Divide the graph into 3×3 Share （ The middle one is 6×6, Is the exact goal ）, after max-pooling Become after operation 3×3 Graph , After a two-layer MLP The Internet （Multi-Layer Perceptron Multilayer perceptron , contain 9 Neurons and a hidden layer , Hidden layer adoption ReLU function ） Score after , stay sigmoid once （ In order to make the score coefficient in 0~1 Between ） Get the final score coefficient . It is worth noting that ： The score coefficient calculation operation here is only calculated in the first frame , Subsequent frames follow the results of the first frame , Therefore, the calculation time is negligible .

experiment

Data dimension ： In our implementation , Target image block z The size is 127×127×3, also zs and X All have 255×255×3 The size of the . about z and X,A-Net The output features of have dimensions of 6×6×256 and 22×22×256. come from S-Net Of conv4 and conv5 The function has a size of 24×24×384 and 22×22×256 The tunnel zs and X. These two groups of functions 1×1 ConvNet Each output 128 Channels （ Up to 256 Channels ））, The spatial resolution remains unchanged . The response graph has the same 17×17 dimension .