当前位置：网站首页>LEARNING TARGET-ORIENTED DUAL ATTENTION FOR ROBUST RGB-T TRACKING

LEARNING TARGET-ORIENTED DUAL ATTENTION FOR ROBUST RGB-T TRACKING

2022-06-11 06:50:00 【A Xuan is going to graduate~】

Rui Yang, Yabin Zhu, Xiao Wang, Chenglong Li, Jin Tang

Hefei, Anhui Province, China

2019 IEEE International Conference on Image Processing (ICIP)

1. Abstract

RGBT Tracking attempts to locate targets using complementary visual and thermal infrared data . The existing RGBT Tracker pass Robust feature representation learning or Adaptive modal weighting To fuse different modes . However , How to integrate the dual attention mechanism for visual tracking is still a subject that has not been studied . In this paper , Two visual attention mechanisms are proposed for robust visual tracking . say concretely , Partial attention through the use of RGB and T The common attention of data is used to train the depth classifier . It also introduces the global attention , This is a multimodal goal driven attention estimation network . It can provide the classifier with global suggestions and local suggestions extracted from previous tracking results .

2. introduction ：

In this paper, a new dual vision attention guided RGBT Tracking algorithm ： Local attention and global attention . The training process consists of two steps forward and backward . In the forward step , Will be paired RGB and T Samples are sent to the depth tracking detection network , Estimate the corresponding classification score . In the backward step , Along the direction from the last fully connected layer to the first convolution layer , Pairs of inputs RGB-T The samples were partially verified by classification scores . take The partial derivative of the first layer is output as RGB And heat input . Each pixel value on this attention map indicates input RGB-T The importance of the corresponding pixels of the sample to the classification accuracy . In the process , In the loss function, the attention graph is added as the regularization term , Make the classifier pay more attention to the target area .

Local search strategy

This article will The paper 1 A target driven attention estimation network first proposed in , Extended to RGB-T On the global attention mechanism , To deal with the problems caused by local search strategy . say concretely , take RGB、T And the original target image as input , The characteristic graphs extracted from the convolution network are connected , These features are fed into the up sampling network , To generate the corresponding attention map . High quality global recommendations （global proposal） From the attention area （attention region） Extract from , And send it to the classifier together with local suggestions . therefore , The complementarity of local and global attention maps will be further improved RGB-T Robustness and accuracy of target tracker .

Contribution of this paper ：

（1） Propose a Use visual attention Of Local attention mechanism , be used for RGB-T track .

（2） To further improve RGB-T Robustness of target tracker , The goal driven global attention mechanism is extended to multimodal form .

3. Method ：

3.1 Network structure ：

The network mainly includes two modules ： be based on RGBT Tracking local attention and multimodal driving global attention estimation network .

3.1.1 Local attention network ：

The general tracking detection framework usually defines the target object as a positive class , The background is defined as a negative class to train a binary classifier , for example MDNet. In this paper MDNet As RGBT The core of the tracker , Because it has a strong feature representation ability . say concretely , For the input RGB and T The sample pair , Three convolution layers and two fully connected layers are used to extract features , To reduce the computational burden , The features of different modes are connected and sent to the domain specific layer to obtain the fractional graph . Cross entropy loss is used to optimize ：

N yes mini-batch size,yi It's No i Yes RGBT Sample to truth labels .Pi It's corresponding to RGBT Prediction of sample pairs . In order to make the classifier pay more attention to the target in the tracking process , stay MDNet A regularization term based on cross entropy function is added to the , The motivation for joining this item is , We can get two attention maps for input pairs , namely positive attention map Ap and the negative attention map An. For each positive sample , Want to be related to the target object Ap Each pixel value of is larger , and An The pixel value of is small . The regularization term is defined as follows ：

and Represent mean and variance respectively .

The final loss function is set to ：

Is a scalar parameter used to balance these two terms , In subsequent experiments , The effects of these two parameters are also examined .

Based on the formula 4, Interactive learning can be achieved through standard back propagation and chain rules . In each iteration of the classification trainer , The attention map of each input training data can be obtained , The classifier will focus more on the target object than the background , In the tracking phase , The classifier will learn to focus on RGB And thermal images .

Although the use of local attention mechanism has achieved better performance , But this improved tracking and detection framework still adopts this local search strategy , It will cause serious obstruction to the , Sensitive to challenges such as field vision and rapid movement . therefore , This article quotes RGB-T Goal driven global attention network to deal with this problem ,

3.1.2 Global attention to the network ：

In this section , Put forward RGB-T Goal driven global attention network , To supplement local recommendations for robust visual tracking , As shown in the network diagram ： The input to this module is RGB、 Thermal infrared and corresponding target objects , Truncated VGG Network to extract the feature representation of these inputs , And connect them into a characteristic diagram . To be precise , First, input all the images resize become 192x256x3, The corresponding characteristic diagram is 12x16x512, therefore , The characteristic diagram after connection is 12x16x2048, Then it is sent to the upper sampling network , The upsampling network is reverse VGG The Internet , The output has the same resolution as the input .

The paper 1 ：Xiao Wang, Chenglong Li, Rui Y ang, Tianzhu Zhang,Jin Tang, and Bin Luo, “Describe and attend to track:Learning natural language guided structural representation and visual attention for object tracking,” arXiv preprint arXiv:1811.10014, 2018.

原网站

版权声明
本文为[A Xuan is going to graduate~]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203020525263413.html