当前位置：网站首页>[target tracking] |stark

[target tracking] |stark

2022-07-01 15:25:00 【rrr2】

This article USES the transformer Integrate time context information
By updating the input dynamic template （ In this paper, we set up 200 frame ）, Get time context information , recycling transformer To deal with
transformer It has better modeling ability for spatial context

The new architecture consists of three key components ： Encoder 、 Decoder and prediction header .

The encoder accepts the initial target object 、 Input of current image and dynamic update template . The input is a triple consisting of a search area and two templates . Their features from the trunk are first flattened and connected , Then send it to the encoder .
The self attention module in the encoder learns the relationship between inputs through its feature dependence . Because the template image is updated in the whole video sequence , The encoder can capture the spatial and temporal information of the target .
The decoder learns query embedding to predict the spatial location of the target object .
The corner based prediction head is used to estimate the bounding box of the target object in the current frame . meanwhile , Learn the score header to control the update of dynamic template image .

background

offline Siamese Tracker belongs to pure space tracker , They treat target tracking as a template match between the initial template and the current search area .

Gradient free method [54,57] Use additional networks to update Siamese Tracker template [2,61]. Another representative work LTMU[8] Learn the meta updater to predict whether the current state is reliable enough , For updates in long-term tracking . Although these methods are effective , But it will cause the separation of space and time ( Did not explain why the separation , What are the consequences of separation ). contrary , Our method integrates spatial and temporal information as a whole , Use at the same time transformer To study .

STARK-S structure

backbone

Search area x（3320320） , Initialize template z （3128128） after resnet（s=16） Get the feature fx(
2562020) fz(25688)
Insert picture description here
Before sending it to the encoder , Splice flattening ,（256*(400+64)）

Encoder

 (0): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=2048, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=2048, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )

N=6 individual encoder layer （ Long head self attention and FFN）
And calculated sin Position insertion , usage :

decoder

Only enter a query in the decoder to predict a bounding box of the target object . That is, only one set of prediction candidate box values is output .

Besides , Because there is only one prediction , We deleted DETR Hungarian algorithm for predicting Association in [24]. Similar to encoder , Decoder stack M Two decoder layers , Each layer consists of self attention 、 Encoder - Decoder attention and feedforward network . In the encoder - Decoder attention module , The target query can focus on all locations and search area features on the template , Thus learning the robust representation for the final boundary box prediction .

head

The uncertainty in coordinate estimation is explicitly modeled , More accurate and robust predictions are generated for target tracking . A new prediction head is designed by estimating the probability distribution of box corners .

We first extract the search region features from the output sequence of the encoder （400*256）, Then calculate the characteristics of the search area and the output embedding of the decoder （256） Similarity between .

Last reshape Get the feature map f 2562020 Send to FCN

Insert picture description here

        if self.head_type == "CORNER":
            # adjust shape
            #([1, 400, 256])
            enc_opt = memory[-self.feat_len_s:].transpose(0, 1)  # encoder output for the search region (B, HW, C)
            # .([1, 256, 1])
            dec_opt = hs.squeeze(0).transpose(1, 2)  # (B, C, N)
            # dot product
            att = torch.matmul(enc_opt, dec_opt)  # (B, HW, N) ([1, 400, 1])
            # element-wise mul
            opt = (enc_opt.unsqueeze(-1) * att.unsqueeze(-2)).permute((0, 3, 2, 1)).contiguous()  # (B, HW, C, N) --> (B, N, C, HW)
            bs, Nq, C, HW = opt.size()
            opt_feat = opt.view(-1, C, self.feat_sz_s, self.feat_sz_s)
            # run the corner head
            outputs_coord = box_xyxy_to_cxcywh(self.box_head(opt_feat))
            outputs_coord_new = outputs_coord.view(bs, Nq, 4)
            out = {
    'pred_boxes': outputs_coord_new}
            return out, outputs_coord_new

Insert picture description here

Loss function

Do not use classified losses .
Insert picture description here
In reasoning , Fixed the characteristics of the first frame template , Take the search area obtained for each frame as input .

STARK-ST

Insert picture description here

Input added Dynamic templates （3128128）（ What is the dynamic template for the first frame ？）

after backbone Get dynamic template features (25688)
Before sending it to the encoder , Splice flattening ,（256*(400+64+64)）

Sort head

Add classification header , Set the threshold , Judge whether it is the target

Training

It is divided into backbone network and classification header , Train in two stages .
use 18 and 19 Two articles in say It's good to train separately . But at present, a large number of papers combine classification and positioning .
Classified loss
Insert picture description here

experiment

Training settings

Training training data includes LaSOT[13]、GOT-10K[18]、COCO2017[30] and TrackingNet[36]

The size of the search image and template are 320×320 Pixels and 128×128 Pixels , Respectively corresponding to the area of the target box 52 Times and 22 times .

STACK-ST The whole training process is divided into two stages , They need 500 Localization in stages ,50 Stage by stage . Every epoch Use 6W A training triplet . Loss weight λL1 and λiou Set as 5 and 2.AdamW weight decay 10−4.
Every GPU contain 16 A triad , So every iteration Of minibatch size =128

The initial learning rate of the trunk and the rest is 10−5 and 10− Respectively 4. The first stage 400 After two stages , The learning rate has dropped 10 times , The second stage 40 After two stages , The learning rate has dropped 10 times .

Results and comparison

We are on three short-term benchmarks （GOT-10K、TrackingNet and VOT2020） And two long-term benchmarks （LaSOT and VOT2020-LT） test
Insert picture description here
GOT-10K[18] It is a recently released large-scale and highly diverse benchmark for general target tracking in the field . It contains 10000 Multiple video clips of real moving objects . All methods use the same training and test data protocol provided by the data set to ensure the fair comparison of the deep tracker . The classes in the training data set and the test data set are zero overlap . After uploading the tracking results , The official website automatically analyzes . The evaluation indicators provided include success charts 、 Average overlap （AO） And the success rate （SR）.AO Represents the average overlap between all estimated bounding boxes and ground real boxes .SR0.5 Indicates that the overlap exceeds 0.5 The rate of successful tracking frames , and SR0.75 Indicates that the overlap exceeds 0.75 Frame of .

in consideration of Ground Truth The size of the box , take Precision Normalize , obtain Norm. Prec, Its value is in [0, 0.5] Between . That is, judge the prediction box and Ground Truth The Euclidean distance between the center point of the frame and Ground Truth The scale of the bezel of the box .
Insert picture description here
AUC: area under curve Area under curve of success rate graph

VOT2020

Insert picture description here
https://blog.csdn.net/Dr_destiny/article/details/80108255
EAO yes VOT Comprehensive evaluation index for short-term tracking , It can reflect accuracy （A） And robustness （R）, But not by accuracy （A） And robustness （R） Directly calculated .

Accuracy Used to evaluate tracker Accuracy of tracking target , The greater the numerical , The more accurate .

Robustness Used to evaluate tracker Tracking target stability , The greater the numerical , The less stable .

LASOT

LaSOT[13] Is a large-scale long-term tracking benchmark , Include in the test set 280 A video , The average length is 2448 frame
Insert picture description here

VOT2020-LT

VOT2020-LT from 50 A long video composition , The target object often disappears and reappears . Besides , The tracker needs to report the confidence score of the current target . Calculate the accuracy under a series of confidence thresholds （Pr） And recall rate （Re）.Fscore
Insert picture description here

Component comparison

Insert picture description here
analysis
Performance decreases when the fraction header is removed , This indicates that improper use of time information may damage performance , Therefore, it is very important to screen out unreliable templates .
Corner More accurate
encoder It is more important

Comparison with other frameworks

Learn localization and classification together .

As the first 3.2 Section , Localization is seen as a top priority , And training in the first stage . Classification is trained as a secondary task in the second stage . We also did an experiment , Learn localization and classification together in one stage . As shown in the tab .7. This strategy leads to suboptimal results , Than STARK Low strategy 3.9%. Two potential reasons are ：（1） The optimization of score head interferes with the training of box head , Lead to inaccurate box prediction .（2） The training of these two tasks requires different data . say concretely , Localization tasks want all search areas to contain tracking targets , To provide strong supervision . by comparison , Classification tasks are expected to be evenly distributed , Half the search area contains the target , The remaining half does not contain goals .

Insert picture description here
Encoder attention . chart 6 The top half of shows the data from Cat-20 Template search triples , And the attention map from the last encoder layer . Visual attention is based on the central pixel of the initial template , Calculated with all pixels in the triplet as keywords and values . It can be seen that , Focus on the tracked target , And it has been roughly separated from the background . Besides , The features produced by the encoder also have a strong ability to distinguish between the target and the distractor . Decoder attention . chart 6 The second half of shows the data from BOWS-13 Template search triples , And attention mapping from the last decoder layer . It can be seen that , The decoder pays different attention to the template and the search area . say concretely , The attention on the template is mainly focused on the upper left corner of the target , And the attention in the search area often focuses on the boundary of the target . Besides , The learned attention is robust to interference .