当前位置:网站首页>[target tracking] |stark
[target tracking] |stark
2022-07-01 15:25:00 【rrr2】
This article USES the transformer Integrate time context information
By updating the input dynamic template ( In this paper, we set up 200 frame ), Get time context information , recycling transformer To deal with
transformer It has better modeling ability for spatial context
The new architecture consists of three key components : Encoder 、 Decoder and prediction header .
The encoder accepts the initial target object 、 Input of current image and dynamic update template . The input is a triple consisting of a search area and two templates . Their features from the trunk are first flattened and connected , Then send it to the encoder .
The self attention module in the encoder learns the relationship between inputs through its feature dependence . Because the template image is updated in the whole video sequence , The encoder can capture the spatial and temporal information of the target .
The decoder learns query embedding to predict the spatial location of the target object .
The corner based prediction head is used to estimate the bounding box of the target object in the current frame . meanwhile , Learn the score header to control the update of dynamic template image .
background
offline Siamese Tracker belongs to pure space tracker , They treat target tracking as a template match between the initial template and the current search area .
Gradient free method [54,57] Use additional networks to update Siamese Tracker template [2,61]. Another representative work LTMU[8] Learn the meta updater to predict whether the current state is reliable enough , For updates in long-term tracking . Although these methods are effective , But it will cause the separation of space and time ( Did not explain why the separation , What are the consequences of separation ). contrary , Our method integrates spatial and temporal information as a whole , Use at the same time transformer To study .
STARK-S structure
backbone
Search area x(3320320) , Initialize template z (3128128) after resnet(s=16) Get the feature fx(
2562020) fz(25688)
Before sending it to the encoder , Splice flattening ,(256*(400+64))
Encoder
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
N=6 individual encoder layer ( Long head self attention and FFN)
And calculated sin Position insertion , usage :
decoder
Only enter a query in the decoder to predict a bounding box of the target object . That is, only one set of prediction candidate box values is output .
Besides , Because there is only one prediction , We deleted DETR Hungarian algorithm for predicting Association in [24]. Similar to encoder , Decoder stack M Two decoder layers , Each layer consists of self attention 、 Encoder - Decoder attention and feedforward network . In the encoder - Decoder attention module , The target query can focus on all locations and search area features on the template , Thus learning the robust representation for the final boundary box prediction .
head
The uncertainty in coordinate estimation is explicitly modeled , More accurate and robust predictions are generated for target tracking . A new prediction head is designed by estimating the probability distribution of box corners .
We first extract the search region features from the output sequence of the encoder (400*256), Then calculate the characteristics of the search area and the output embedding of the decoder (256) Similarity between .
Last reshape Get the feature map f 2562020 Send to FCN
if self.head_type == "CORNER":
# adjust shape
#([1, 400, 256])
enc_opt = memory[-self.feat_len_s:].transpose(0, 1) # encoder output for the search region (B, HW, C)
# .([1, 256, 1])
dec_opt = hs.squeeze(0).transpose(1, 2) # (B, C, N)
# dot product
att = torch.matmul(enc_opt, dec_opt) # (B, HW, N) ([1, 400, 1])
# element-wise mul
opt = (enc_opt.unsqueeze(-1) * att.unsqueeze(-2)).permute((0, 3, 2, 1)).contiguous() # (B, HW, C, N) --> (B, N, C, HW)
bs, Nq, C, HW = opt.size()
opt_feat = opt.view(-1, C, self.feat_sz_s, self.feat_sz_s)
# run the corner head
outputs_coord = box_xyxy_to_cxcywh(self.box_head(opt_feat))
outputs_coord_new = outputs_coord.view(bs, Nq, 4)
out = {
'pred_boxes': outputs_coord_new}
return out, outputs_coord_new
Loss function
Do not use classified losses .
In reasoning , Fixed the characteristics of the first frame template , Take the search area obtained for each frame as input .
STARK-ST
Input added Dynamic templates (3128128)( What is the dynamic template for the first frame ?)
after backbone Get dynamic template features (25688)
Before sending it to the encoder , Splice flattening ,(256*(400+64+64))
Sort head
Add classification header , Set the threshold , Judge whether it is the target
Training
It is divided into backbone network and classification header , Train in two stages .
use 18 and 19 Two articles in say It's good to train separately . But at present, a large number of papers combine classification and positioning .
Classified loss
experiment
Training settings
Training training data includes LaSOT[13]、GOT-10K[18]、COCO2017[30] and TrackingNet[36]
The size of the search image and template are 320×320 Pixels and 128×128 Pixels , Respectively corresponding to the area of the target box 52 Times and 22 times .
STACK-ST The whole training process is divided into two stages , They need 500 Localization in stages ,50 Stage by stage . Every epoch Use 6W A training triplet . Loss weight λL1 and λiou Set as 5 and 2.AdamW weight decay 10−4.
Every GPU contain 16 A triad , So every iteration Of minibatch size =128
The initial learning rate of the trunk and the rest is 10−5 and 10− Respectively 4. The first stage 400 After two stages , The learning rate has dropped 10 times , The second stage 40 After two stages , The learning rate has dropped 10 times .
Results and comparison
We are on three short-term benchmarks (GOT-10K、TrackingNet and VOT2020) And two long-term benchmarks (LaSOT and VOT2020-LT) test
GOT-10K[18] It is a recently released large-scale and highly diverse benchmark for general target tracking in the field . It contains 10000 Multiple video clips of real moving objects . All methods use the same training and test data protocol provided by the data set to ensure the fair comparison of the deep tracker . The classes in the training data set and the test data set are zero overlap . After uploading the tracking results , The official website automatically analyzes . The evaluation indicators provided include success charts 、 Average overlap (AO) And the success rate (SR).AO Represents the average overlap between all estimated bounding boxes and ground real boxes .SR0.5 Indicates that the overlap exceeds 0.5 The rate of successful tracking frames , and SR0.75 Indicates that the overlap exceeds 0.75 Frame of .
in consideration of Ground Truth The size of the box , take Precision Normalize , obtain Norm. Prec, Its value is in [0, 0.5] Between . That is, judge the prediction box and Ground Truth The Euclidean distance between the center point of the frame and Ground Truth The scale of the bezel of the box .
AUC: area under curve Area under curve of success rate graph
VOT2020
https://blog.csdn.net/Dr_destiny/article/details/80108255
EAO yes VOT Comprehensive evaluation index for short-term tracking , It can reflect accuracy (A) And robustness (R), But not by accuracy (A) And robustness (R) Directly calculated .
Accuracy Used to evaluate tracker Accuracy of tracking target , The greater the numerical , The more accurate .
Robustness Used to evaluate tracker Tracking target stability , The greater the numerical , The less stable .
LASOT
LaSOT[13] Is a large-scale long-term tracking benchmark , Include in the test set 280 A video , The average length is 2448 frame
VOT2020-LT
VOT2020-LT from 50 A long video composition , The target object often disappears and reappears . Besides , The tracker needs to report the confidence score of the current target . Calculate the accuracy under a series of confidence thresholds (Pr) And recall rate (Re).Fscore
Component comparison
analysis
Performance decreases when the fraction header is removed , This indicates that improper use of time information may damage performance , Therefore, it is very important to screen out unreliable templates .
Corner More accurate
encoder It is more important
Comparison with other frameworks
Learn localization and classification together .
As the first 3.2 Section , Localization is seen as a top priority , And training in the first stage . Classification is trained as a secondary task in the second stage . We also did an experiment , Learn localization and classification together in one stage . As shown in the tab .7. This strategy leads to suboptimal results , Than STARK Low strategy 3.9%. Two potential reasons are :(1) The optimization of score head interferes with the training of box head , Lead to inaccurate box prediction .(2) The training of these two tasks requires different data . say concretely , Localization tasks want all search areas to contain tracking targets , To provide strong supervision . by comparison , Classification tasks are expected to be evenly distributed , Half the search area contains the target , The remaining half does not contain goals .
Encoder attention . chart 6 The top half of shows the data from Cat-20 Template search triples , And the attention map from the last encoder layer . Visual attention is based on the central pixel of the initial template , Calculated with all pixels in the triplet as keywords and values . It can be seen that , Focus on the tracked target , And it has been roughly separated from the background . Besides , The features produced by the encoder also have a strong ability to distinguish between the target and the distractor . Decoder attention . chart 6 The second half of shows the data from BOWS-13 Template search triples , And attention mapping from the last decoder layer . It can be seen that , The decoder pays different attention to the template and the search area . say concretely , The attention on the template is mainly focused on the upper left corner of the target , And the attention in the search area often focuses on the boundary of the target . Besides , The learned attention is robust to interference .
边栏推荐
- Raytheon technology rushes to the Beijing stock exchange and plans to raise 540million yuan
- Storage form of in-depth analysis data in memory
- Summary of point cloud reconstruction methods I (pcl-cgal)
- Flink 系例 之 TableAPI & SQL 与 MYSQL 数据查询
- 做空蔚来的灰熊,以“碰瓷”中概股为生?
- Flink 系例 之 TableAPI & SQL 与 MYSQL 分组统计
- opencv学习笔记五--文件扫描+OCR文字识别
- Beilianzhuguan joined the dragon lizard community to jointly promote carbon neutralization
- SAP CRM organization Model(组织架构模型)自动决定的逻辑分析
- swiper 轮播图,最后一张图与第一张图无缝衔接
猜你喜欢
Junda technology - wechat cloud monitoring scheme for multiple precision air conditioners
[one day learning awk] conditions and cycles
[advanced ROS] lesson 5 TF coordinate transformation in ROS
采集数据工具推荐,以及采集数据列表详细图解流程
opencv学习笔记五--文件扫描+OCR文字识别
Wechat applet 01 bottom navigation bar settings
微信网页订阅消息实现
Implementation of wechat web page subscription message
Wechat applet 02 - Implementation of rotation map and picture click jump
Task.Run(), Task.Factory.StartNew() 和 New Task() 的行为不一致分析
随机推荐
Using swiper to make mobile phone rotation map
Digital transformation: data visualization enables sales management
MySQL 服务正在启动 MySQL 服务无法启动解决途径
点云重建方法汇总一(PCL-CGAL)
Redis installation and setting up SSDB master-slave environment under Ubuntu 14.04
【一天学awk】函数与自定义函数
Phpcms background upload picture button cannot be clicked
These three online PS tools should be tried
Zhang Chi Consulting: household appliance enterprises use Six Sigma projects to reduce customers' unreasonable return cases
微信公众号订阅消息 wx-open-subscribe 的实现及闭坑指南
Demand prioritization method based on value quantification
【天线】【3】CST一些快捷键
Hardware design guide for s32k1xx microcontroller
Survey of intrusion detection systems:techniques, datasets and challenges
微信小程序02-轮播图实现与图片点击跳转
厦门灌口镇田头村特色农产品 甜头村特色农产品蚂蚁新村7.1答案
Is JPMorgan futures safe to open an account? What is the account opening method of JPMorgan futures company?
It's settled! 2022 Hainan secondary cost engineer examination time is determined! The registration channel has been opened!
STM32F411 SPI2输出错误,PB15无脉冲调试记录【最后发现PB15与PB14短路】
MySQL审计插件介绍