当前位置：网站首页>Siamfc: full convolution twin network for target tracking

Siamfc: full convolution twin network for target tracking

2022-07-26 14:45:00 【The way of code】

SiamFC The Internet

In the figure z It represents the template image , The algorithm uses the first frame ground truth;x It stands for search region, Represents the candidate box search area in the subsequent frame to be tracked ;ϕ It represents a feature mapping operation , Map the original image to a specific feature space , What is used in this paper is CNN Convolution layer and in pooling layer ;6×6×128 representative z after ϕ The resulting feature , It's a 128 passageway 6×6 size feature, Empathy ,22×22×128 yes x after ϕ After the characteristics of ; hinder × Represents convolution operation , Give Way 22×22×128 Of feature By 6×6×128 Convolution kernel convolution , Get one 17×17 Of score map, It represents the similarity value between each position in the search area and the template .

The algorithm itself is to compare the similarity between the search area and the target template , Finally get the search area score map. In fact, in principle , This method is very similar to the method of correlation filtering . It matches the target template point by point in the search area , This point by point translation matching method for calculating similarity is regarded as a convolution , Then find the point with the largest similarity value in the convolution result , As the center of new goals .

The picture above ϕ It's actually CNN Part of , And two ϕ The network structure is the same , This is a typical twin neural network , And in the whole model only conv Layer and the pooling layer , So this is also a typical full convolution （fully-convolutional） neural network .

The loss function is definitely needed when training the model , And the optimal model is obtained by minimizing the loss function . The algorithm in this paper is to construct an effective loss function , The location points of the search area are distinguished by positive and negative samples , That is, the points within a certain range of the target are taken as positive samples , Points outside this range are taken as negative samples , For example, figure 1 Generated on the far right of score map in , The red dot is the positive sample , The blue dot is a negative sample , They all correspond to search region Red rectangular area and blue rectangular area in . The article uses logistic loss, The specific loss function form is as follows ：

about score map Hit the loss of each point ：

among v yes score map The true value of each point in ,y∈{+1,−1} Is the label corresponding to this point .

The above is score map At every point in loss value , And for score map Holistic loss, All points are used loss The average of . namely ：

there u∈D representative score map Position in .

The whole network structure is similar to AlexNet, But there is no final full connection layer , Only the previous convolution layer and pooling layer .

The whole network structure is shown in the table above , among pooling The layer uses max-pooling, There is one behind each convolution layer ReLU Nonlinear activation layer , But the fifth floor doesn't . in addition , During training , Every ReLU Used in front of the layer batch normalization（ Batch standardization is a training method often seen in deep learning , It refers to training with gradient descent method DNN when , For each in the network layer mini-batch The data were normalized , Change its mean value to 0, Variance becomes 1, Its main function is to alleviate DNN The gradient in training disappears / Explosion phenomenon , Speed up the training of the model ）, Used to reduce the risk of over fitting .

AlexNet

AlexNet by 8 The layer structure , The top 5 The layers are convolutions , Back 3 The layer is the full connection layer ; The learning parameters are 6 Ten million , Neurons have 650,000 individual .AlexNet In two GPU Up operation ;AlexNet In the 2,4,5 Each floor is the previous floor itself GPU Internal connection , The first 3 The first floor is fully connected with the first two floors , Full connection is 2 individual GPU Full connection ;

RPN Layer 1,2 After a convolution ;Max pooling Layer in RPN The first floor and the second floor 5 After a convolution .ReLU After each convolution layer and full connection layer .

Convolution kernel size and number ：

conv1:96 11×11×3( Number / Long / wide / depth )
conv2:256 5×5×48
conv3:384 3×3×256
conv4: 384 3×3×192
conv5: 256 3×3×192

ReLU、 double GPU operation ： Improve training speed .（ Apply to all convolution layers and full connection layers ）

overlap pool Pooling layer ： Improve accuracy , It is not easy to produce over fitting .（ Applied in the first layer , The second floor , Behind the fifth floor ）

Local response normalized layer (LRN)： Improve accuracy .（ Apply behind the first and second layers ）

Dropout： Reduce over fitting .（ Applied in the first two full connection layers ）

fine-tuning （fine-tune）

See a good model from others , Although the specific problems are different , But I also want to try , See if you can get good results , And I don't have much data , What do I do ？ No problem , Bring someone else's ready-made trained model , Replace it with your own data , Adjust the parameters , Train again , This is fine tuning （fine-tune）.

Freeze part of the convolution layer of the pre training model （ It is usually the most convoluted layer close to the input ）, Train the remaining convolution layers （ It is usually a partial convolution layer close to the output ） And full connection layer . In a sense , Fine tuning should be part of transfer learning .

perceptron ：PLA

Multilayer perceptron is a generalization of perceptron , Perceptron learning algorithm (PLA: Perceptron Learning Algorithm) Describing the structure of neurons is a separate .

The neural network of the perceptron is represented as follows ：

Multilayer perceptron ：MLP

An important feature of multilayer perceptron is multilayer , We call the first layer the input layer , The last layer is called the output layer , The middle layer is called the hidden layer .MLP The number of hidden layers is not specified , Therefore, the appropriate number of hidden layers can be selected according to their respective needs . And there is no limit to the number of neurons in the output layer .

MLP The structure model of neural network is as follows , Only one hidden layer is involved in this paper , The input has only three variables [x1,x2,x3] And an offset b, The output layer has three neurons . Compared with the neuron model in the perceptron algorithm, it is integrated .

ReLU function

ReLU The function formula is as follows ：

The image below ：

sigmod function

sigmod When a function approaches positive infinity or negative infinity , The function approaches a smooth state . Because the output range （0,1）, So the probability of binary classification is often used by this function .

sigmoid The function expression is as follows ：