当前位置：网站首页>Rgb-t tracking: [multimodal fusion] visible thermal UAV tracking: a large scale benchmark and new baseline

Rgb-t tracking: [multimodal fusion] visible thermal UAV tracking: a large scale benchmark and new baseline

2022-07-28 09:22:00 【ZZ's big spike】

Catalog

RGB-T Introduction to tracking algorithm
HMFT

This paper presents a large-scale RGB-T Tracked data sets , Accordingly, a Baseline, On existing data sets GTOT / RGB210 / RGB234 Get the best performance on .
For information about the dataset in this paper, please see this blog RGB-T track ——【 Dataset benchmark 】GTOT / RGBT210 / RGBT234 / VOT-2019-2020 / LasHeR / VTUAV

HMFT： The paper Data sets

RGB-T Introduction to tracking algorithm

Usually RGB-T The tracker mainly uses RGB Trackers are similar pipeline, Then focus on designing a two-mode fusion method . The existing fusion methods are mainly divided into ： Image fusion 、 Feature fusion 、 Three types of decision fusion .

【 Image fusion 】： utilize BackBone The Internet , Learn the picture features of visible light pictures and thermal infrared pictures by sharing weights , And the shared weight learned is equivalent to taking the same information useful for locating the target in the visible light picture and the thermal infrared picture . The drawback of this method is that it requires high alignment of visible light images and thermal infrared images .
【 Feature fusion 】： majority Tracker It is a feature of integrating visible light pictures and thermal infrared pictures . There are also two kinds of integration ：1. Use one mode as an auxiliary mode to perform refine;2. First, the features of the two modes are directly spliced （ Usually press channel-wise）, Then learn a new feature after the interaction of two modes through the deep network . The advantage of this method is high flexibility , The alignment of pictures is not required .
【 Decision fusion 】： Each mode outputs the estimation of the target independently , With response map In the form of , Then merge the two modes of decision , Output one final score.

HMFT

This model accommodates the above three fusion methods . The model is as follows , You can see HMFT The framework has two branches ：Discriminative bransh Branches and Complementary bransh Branch . Mainly by 3 It consists of three main modules ：CIF / DFF / ADF.
Insert picture description here

Discriminative bransh Branch ：
Complementary bransh Branch ：

Image complementary information fusion 【CIF】

The function of this module is to learn the consistency information related to the target in the two modes .
Insert picture description here

Module input ： $I_v$ and $I_t$ respectively RGB Pictures and Thermal picture .
The blue part is the network that extracts complementary information 【Comp. Backbone】, namely ResNet50, Share weight , Extract common features . there $L_{div}$ yes KL- Divergent Loss function , The function is to maintain the consistency of these two modes , use KL Divergence constrains the distribution of features . So in training , The objective function of learning is to make these two backbone The characteristics of network output should be as same as possible . It is also equivalent to considering consistent information . The objective function is as follows ：

among $P_v^i$ and $P_t^i$ respectively visible Pictures and thermal Picture in ResNet50 The first $i$ Characteristics of the layer . So this is the characteristic of each layer KL Minimize the sum of divergence .
The output is by channel-wise Features stitched together $P_a \in \mathbb{R}^{2C*H*W}$ , The original feature dimension is $P_{v/t} \in \mathbb{R}^{C*H*W}$ .

Information fusion of discrimination features 【DFF】

The function of this module is to learn different discriminative information in the two modal information .RGB Images can provide powerful appearance information ; Infrared images can provide information about the target contour . So first model the two modes separately , Generate feature re fusion . The specific process is as follows ：
Insert picture description here

Model input ：Backbone The network outputs characteristics independently of two modes $F_v$ 、 $F_t$
Blue box ： take $F_v$ 、 $F_t$ Add by corresponding elements （Elem.Sum） Close , After a global average pool （GAP） And full connection layer （FC） Get a global vector $d_g$ , Contains information about two modes . The formula is as follows ： here $D_v$ 、 $D_t$ It's corresponding to $F_v$ 、 $F_t$ , It should be a clerical error .
Orange Box ： Use two independent modal exclusive full connection layers $\digamma_v$ 、 $\digamma_t$ +softmax Operation generates mode specific channel-wise The weight of $w_v$ , $w_t\in \mathbb{R}^{C*1*1}$ .
![ Insert picture description here ](https://img-blog.csdnimg.cn/ed63d75e0c5d4442aa089a8109b33a1f.png
#pic_center)
Red box ： Use the calculated weight $w_v$ , $w_t$ use channel-wise The way of multiplication and the initial modal characteristics $F_v$ 、 $F_t$ Multiply , Add it up .
Module output ： Fused features $D_a^i$

Adaptive decision fusion 【ADF】

The function of this module is based on CIF、DFF Characteristic graph of branch independent output , Calculate the confidence of these characteristic graphs , Calculate the weight of these characteristic graphs according to the confidence degree, and weight the characteristic graphs , Then generate the final feature map .
Insert picture description here

Module input ：CIF、DFF Characteristic graph of branch independent output $P_a$ and $D_a$ .
MAM The function of the module is to obtain the confidence of consistency branch and discriminant branch respectively based on the self attention mechanism $M_c$ 、 $M_d$ . The specific operation is ： For input features $X$ , That's the top $P_a$ and $D_a$ , Through the first 1*1 The convolution of reduces the feature dimension （ In order to reduce the amount of calculation ）, after Reshape operation , take $X$ Of shape from $\times W \times H$ become $\times WH$ , As a feature embedded in the self attention mechanism , obtain $HW \times C$ Characteristics of , Right again channel Add and then reshape obtain $\times W \times 1$ Model confidence . The calculation is as follows ：
take $M_c$ and $M_d$ Splice up , Input to a two-tier Encoder-Decoder In the network , Get the respective weights of the modes $E_c, E_d \in \mathbb{R}^{H*W}$ . This weight is right CIF、DFF Response diagram of branch independent output $R_c$ and $R_d$ do element-wise ride （ Weighting operation ） obtain $R_F$ .
$R_F=R_d \odot E_d+R_c \odot E_c$

Algorithm flow

Insert picture description here
For the current tracking image

Two branches Discriminative branch and Complementary branch Feature fusion method and image information fusion method are used to get the target response map ;
utilize ADF, For two branches Discriminative branch and Complementary branch The response graph of , Generate final response diagram ;
utilize DiMP in IoU Prediction module , Take 10 individual proposal, Right again proposal forecast IoU fraction , Take the three with the highest scores proposal Average , Output the final prediction bounding box .

QQQQQ QQ Q

原网站

版权声明
本文为[ZZ's big spike]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280844090227.html