当前位置:网站首页>[target tracking] |stark
[target tracking] |stark
2022-07-01 15:25:00 【rrr2】
This article USES the transformer Integrate time context information
By updating the input dynamic template ( In this paper, we set up 200 frame ), Get time context information , recycling transformer To deal with
transformer It has better modeling ability for spatial context
The new architecture consists of three key components : Encoder 、 Decoder and prediction header .
The encoder accepts the initial target object 、 Input of current image and dynamic update template . The input is a triple consisting of a search area and two templates . Their features from the trunk are first flattened and connected , Then send it to the encoder .
The self attention module in the encoder learns the relationship between inputs through its feature dependence . Because the template image is updated in the whole video sequence , The encoder can capture the spatial and temporal information of the target .
The decoder learns query embedding to predict the spatial location of the target object .
The corner based prediction head is used to estimate the bounding box of the target object in the current frame . meanwhile , Learn the score header to control the update of dynamic template image .
background
offline Siamese Tracker belongs to pure space tracker , They treat target tracking as a template match between the initial template and the current search area .
Gradient free method [54,57] Use additional networks to update Siamese Tracker template [2,61]. Another representative work LTMU[8] Learn the meta updater to predict whether the current state is reliable enough , For updates in long-term tracking . Although these methods are effective , But it will cause the separation of space and time ( Did not explain why the separation , What are the consequences of separation ). contrary , Our method integrates spatial and temporal information as a whole , Use at the same time transformer To study .
STARK-S structure
backbone
Search area x(3320320) , Initialize template z (3128128) after resnet(s=16) Get the feature fx(
2562020) fz(25688)
Before sending it to the encoder , Splice flattening ,(256*(400+64))
Encoder
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
)
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
)
N=6 individual encoder layer ( Long head self attention and FFN)
And calculated sin Position insertion , usage :
decoder
Only enter a query in the decoder to predict a bounding box of the target object . That is, only one set of prediction candidate box values is output .
Besides , Because there is only one prediction , We deleted DETR Hungarian algorithm for predicting Association in [24]. Similar to encoder , Decoder stack M Two decoder layers , Each layer consists of self attention 、 Encoder - Decoder attention and feedforward network . In the encoder - Decoder attention module , The target query can focus on all locations and search area features on the template , Thus learning the robust representation for the final boundary box prediction .
head
The uncertainty in coordinate estimation is explicitly modeled , More accurate and robust predictions are generated for target tracking . A new prediction head is designed by estimating the probability distribution of box corners .
We first extract the search region features from the output sequence of the encoder (400*256), Then calculate the characteristics of the search area and the output embedding of the decoder (256) Similarity between .
Last reshape Get the feature map f 2562020 Send to FCN

if self.head_type == "CORNER":
# adjust shape
#([1, 400, 256])
enc_opt = memory[-self.feat_len_s:].transpose(0, 1) # encoder output for the search region (B, HW, C)
# .([1, 256, 1])
dec_opt = hs.squeeze(0).transpose(1, 2) # (B, C, N)
# dot product
att = torch.matmul(enc_opt, dec_opt) # (B, HW, N) ([1, 400, 1])
# element-wise mul
opt = (enc_opt.unsqueeze(-1) * att.unsqueeze(-2)).permute((0, 3, 2, 1)).contiguous() # (B, HW, C, N) --> (B, N, C, HW)
bs, Nq, C, HW = opt.size()
opt_feat = opt.view(-1, C, self.feat_sz_s, self.feat_sz_s)
# run the corner head
outputs_coord = box_xyxy_to_cxcywh(self.box_head(opt_feat))
outputs_coord_new = outputs_coord.view(bs, Nq, 4)
out = {
'pred_boxes': outputs_coord_new}
return out, outputs_coord_new

Loss function
Do not use classified losses .
In reasoning , Fixed the characteristics of the first frame template , Take the search area obtained for each frame as input .
STARK-ST


Input added Dynamic templates (3128128)( What is the dynamic template for the first frame ?)
after backbone Get dynamic template features (25688)
Before sending it to the encoder , Splice flattening ,(256*(400+64+64))
Sort head
Add classification header , Set the threshold , Judge whether it is the target
Training
It is divided into backbone network and classification header , Train in two stages .
use 18 and 19 Two articles in say It's good to train separately . But at present, a large number of papers combine classification and positioning .
Classified loss 
experiment
Training settings
Training training data includes LaSOT[13]、GOT-10K[18]、COCO2017[30] and TrackingNet[36]
The size of the search image and template are 320×320 Pixels and 128×128 Pixels , Respectively corresponding to the area of the target box 52 Times and 22 times .
STACK-ST The whole training process is divided into two stages , They need 500 Localization in stages ,50 Stage by stage . Every epoch Use 6W A training triplet . Loss weight λL1 and λiou Set as 5 and 2.AdamW weight decay 10−4.
Every GPU contain 16 A triad , So every iteration Of minibatch size =128
The initial learning rate of the trunk and the rest is 10−5 and 10− Respectively 4. The first stage 400 After two stages , The learning rate has dropped 10 times , The second stage 40 After two stages , The learning rate has dropped 10 times .
Results and comparison
We are on three short-term benchmarks (GOT-10K、TrackingNet and VOT2020) And two long-term benchmarks (LaSOT and VOT2020-LT) test 
GOT-10K[18] It is a recently released large-scale and highly diverse benchmark for general target tracking in the field . It contains 10000 Multiple video clips of real moving objects . All methods use the same training and test data protocol provided by the data set to ensure the fair comparison of the deep tracker . The classes in the training data set and the test data set are zero overlap . After uploading the tracking results , The official website automatically analyzes . The evaluation indicators provided include success charts 、 Average overlap (AO) And the success rate (SR).AO Represents the average overlap between all estimated bounding boxes and ground real boxes .SR0.5 Indicates that the overlap exceeds 0.5 The rate of successful tracking frames , and SR0.75 Indicates that the overlap exceeds 0.75 Frame of .
in consideration of Ground Truth The size of the box , take Precision Normalize , obtain Norm. Prec, Its value is in [0, 0.5] Between . That is, judge the prediction box and Ground Truth The Euclidean distance between the center point of the frame and Ground Truth The scale of the bezel of the box .
AUC: area under curve Area under curve of success rate graph
VOT2020

https://blog.csdn.net/Dr_destiny/article/details/80108255
EAO yes VOT Comprehensive evaluation index for short-term tracking , It can reflect accuracy (A) And robustness (R), But not by accuracy (A) And robustness (R) Directly calculated .
Accuracy Used to evaluate tracker Accuracy of tracking target , The greater the numerical , The more accurate .
Robustness Used to evaluate tracker Tracking target stability , The greater the numerical , The less stable .
LASOT
LaSOT[13] Is a large-scale long-term tracking benchmark , Include in the test set 280 A video , The average length is 2448 frame 
VOT2020-LT
VOT2020-LT from 50 A long video composition , The target object often disappears and reappears . Besides , The tracker needs to report the confidence score of the current target . Calculate the accuracy under a series of confidence thresholds (Pr) And recall rate (Re).Fscore

Component comparison

analysis
Performance decreases when the fraction header is removed , This indicates that improper use of time information may damage performance , Therefore, it is very important to screen out unreliable templates .
Corner More accurate
encoder It is more important
Comparison with other frameworks
Learn localization and classification together .
As the first 3.2 Section , Localization is seen as a top priority , And training in the first stage . Classification is trained as a secondary task in the second stage . We also did an experiment , Learn localization and classification together in one stage . As shown in the tab .7. This strategy leads to suboptimal results , Than STARK Low strategy 3.9%. Two potential reasons are :(1) The optimization of score head interferes with the training of box head , Lead to inaccurate box prediction .(2) The training of these two tasks requires different data . say concretely , Localization tasks want all search areas to contain tracking targets , To provide strong supervision . by comparison , Classification tasks are expected to be evenly distributed , Half the search area contains the target , The remaining half does not contain goals .

Encoder attention . chart 6 The top half of shows the data from Cat-20 Template search triples , And the attention map from the last encoder layer . Visual attention is based on the central pixel of the initial template , Calculated with all pixels in the triplet as keywords and values . It can be seen that , Focus on the tracked target , And it has been roughly separated from the background . Besides , The features produced by the encoder also have a strong ability to distinguish between the target and the distractor . Decoder attention . chart 6 The second half of shows the data from BOWS-13 Template search triples , And attention mapping from the last decoder layer . It can be seen that , The decoder pays different attention to the template and the search area . say concretely , The attention on the template is mainly focused on the upper left corner of the target , And the attention in the search area often focuses on the boundary of the target . Besides , The learned attention is robust to interference .
边栏推荐
- 选择在长城证券上炒股开户可以吗?安全吗?
- Qt+pcl Chapter 6 point cloud registration ICP Series 2
- 【STM32-USB-MSC问题求助】STM32F411CEU6 (WeAct)+w25q64+USB-MSC Flash用SPI2 读出容量只有520KB
- 《QT+PCL第六章》点云配准icp系列5
- 智能运维实战:银行业务流程及单笔交易追踪
- Filter & (login interception)
- 【一天学awk】函数与自定义函数
- Survey of intrusion detection systems:techniques, datasets and challenges
- 【STM32学习】 基于STM32 USB存储设备的w25qxx自动判断容量检测
- Using swiper to make mobile phone rotation map
猜你喜欢

Survey of intrusion detection systems:techniques, datasets and challenges

Basic use process of cmake

Implementation of wechat web page subscription message
Sort out the four commonly used sorting functions in SQL

Task. Run(), Task. Factory. Analysis of behavior inconsistency between startnew() and new task()

微信公众号订阅消息 wx-open-subscribe 的实现及闭坑指南

Zhang Chi Consulting: lead lithium battery into six sigma consulting to reduce battery capacity attenuation

【STM32-USB-MSC问题求助】STM32F411CEU6 (WeAct)+w25q64+USB-MSC Flash用SPI2 读出容量只有520KB

竣达技术丨多台精密空调微信云监控方案

采集数据工具推荐,以及采集数据列表详细图解流程
随机推荐
Lean Six Sigma project counseling: centralized counseling and point-to-point counseling
智能运维实战:银行业务流程及单笔交易追踪
【云动向】6月上云新风向!云商店热榜揭晓
Opencv Learning Notes 6 -- image feature [harris+sift]+ feature matching
openssl客户端编程:一个不起眼的函数导致的SSL会话失败问题
SAP s/4hana: one code line, many choices
DirectX repair tool v4.1 public beta! [easy to understand]
idea中新建的XML文件变成普通文件的解决方法.
入侵检测模型(An Intrusion-Detection Model)
Demand prioritization method based on value quantification
关于用 ABAP 代码手动触发 SAP CRM organization Model 自动决定的研究
Beilianzhuguan joined the dragon lizard community to jointly promote carbon neutralization
【锁】Redis锁 处理并发 原子性
phpcms后台上传图片按钮无法点击
Zhang Chi Consulting: household appliance enterprises use Six Sigma projects to reduce customers' unreasonable return cases
Is JPMorgan futures safe to open an account? What is the account opening method of JPMorgan futures company?
[advanced ROS] lesson 5 TF coordinate transformation in ROS
MySQL审计插件介绍
微信网页订阅消息实现
SQL常用的四个排序函数梳理