当前位置：网站首页>Some time series modeling strategies (I)

Some time series modeling strategies (I)

2022-07-26 19:11:00 【Gu daochangsheng '】

Temporal Kernel Selection Block

paper subject ：BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification

paper Published by the Chinese Academy of Sciences on CVPR 2022 The job of

paper Address ： link

Code: link
stay [30, 42] after , We decompose the video network into spatial clues and temporal relationships . Use efficient BiCnet Fully explore spatial clues , We built a Temporal Kernel Selection Blocks to jointly model short-term and long-term time relationships . Because the time relationship of different scales has different importance for different sequences （ Pictured 2 Shown ）,TKS Combine multi-scale time relationships in a dynamic way , That is to assign different weights to different time scales according to the input sequence .

chart 2

chart 2： The short-term and long-term temporal relationships have different importance for different sequences . (a) Partially occluded sequence . Long term time clues are needed to reduce occlusion . (b) Fast moving pedestrian sequence . Short term time cues are needed to simulate detailed movement patterns .

Special ,TKS With a series of continuous frame characteristics $F=\left\{F_{t}\right\}_{t=1}^{T}$ As input , among $F_{t}$ It's No $t$ Characteristic diagram of frame , And in $F$ Perform triple operations on , namely Partition、Select and Excite.

Partition operation . Due to the imperfect character detection algorithm , The adjacent frames of the video are not well aligned , This may cause time convolution in the video reID [9] The invalid . stay [34] after , We use partition strategy to alleviate the problem of spatial dislocation . say concretely , Given video feature map $\left\{F_{t}\right\}_{t=1}^{T}$ , We divide each frame into $\times w$ A spatial area , And average pool each divided area , Build regional video feature map $\in \mathbb{R}^{T \times C \times h \times w}$ .

Select operation . Pictured 4 Shown , Given $X$ , We carry out $K$ Parallel paths $\left\{\mathcal{F}^{(i)}: X \rightarrow Y^{(i)} \in\right.$ $\left.\mathbb{R}^{T \times C \times h \times w}\right\}_{i=1}^{K}$ , among F (i) Yes. $2 i + 1$ Kernel size 1D Time convolution [30]. In order to further improve efficiency , have $\times 1 \times 1$ The time convolution of the kernel is replaced with $\times 1 \times 1$ Kernel and expansion size $i$ Extended convolution of . The basic idea of the selection operation is to use the global information from all time paths to determine the weight assigned to each path . say concretely , We first fuse the outputs of all paths by summing the elements , Then perform global average pooling to obtain global characteristics $\in \mathbb{R}^{C \times 1}$ ：
$P_{T, h, w}\left(\sum_{i=1}^{K} Y^{(i)}\right),$
among $G A P_{T, h, w}$ Represents global average pooling along time and space dimensions . Then embed according to the global $u$ Get the channel selection weight $\left\{g_{i} \in \mathbb{R}^{C \times 1}\right\}_{i=1}^{K}$ ,
$g_{i}=\frac{\exp \left(W_{i} u\right)}{\sum_{j=1}^{K} \exp \left(W_{j} u\right)} \quad i \in\{1, \ldots, K\},$
among $W_{i} \in \mathbb{R}^{C \times C}$ Is for $Y^{(i)}$ Generate $g_{i}$ Transformation parameters of . Then the aggregation characteristic graph is obtained through the selection weights on various time cores $\in$ $\mathbb{R}^{T \times C \times h \times w}$ ,
$Z=\sum_{i=1}^{K} \mathcal{R}\left(g_{i}\right) \odot Y^{(i)},$
among $\mathcal{R}$ Yes, it will $g_{i} \in \mathbb{R}^{C \times 1}$ Remodel as $\mathbb{R}^{1 \times C \times 1 \times 1}$ In order to $Y^{(i)}$ Size compatible reshaping operation .

It's worth pointing out , Compared with using scale weights to provide rough fusion , We choose to use channel weights （ equation 7） To merge . This design results in finer grained fusion , Each characteristic channel can be adjusted . Besides , The weight is dynamically calculated according to the input video . This may have different dominant time scales for different sequences reID crucial .

Trigger operation . The excitation operation pairs $Z$ Adjust to modulate the input characteristic diagram . The final feature map $\operatorname{map} E=\left\{E_{t}\right\}_{t=1}^{T}$ by ： $E_{t}=\mathcal{U}\left(Z_{t}\right)+F_{t}$ . here $\mathcal{U}$ It is the nearest neighbor sampler , It's right $Z_{t}$ Perform upsampling to match $F_{t}$ The spatial resolution of . TKS The block maintains the input size , Therefore, it can be inserted into BiCnet To extract effective spatio-temporal features .

TEMPORAL-WISE DYNAMIC NETWORKS

paper subject ：Dynamic Neural Networks: A Survey

paper Tsinghua published on TPAMI 2021 The job of

paper link ： Address

Usually , Less computation can be dynamically allocated to inputs at unimportant time positions / Do not calculate to improve network efficiency .

Temporal-wise Dynamic Video Recognition

For video recognition , Video can be regarded as the sequential input of frames , Time dynamic network aims to allocate adaptive computing resources for different frames . This can usually be achieved in two ways ：1） Dynamically update the hidden state in each time step of the cycle model , as well as 2） Perform adaptive pre sampling on key frames （ The first 4.2.2 section ）.

4.2.1 Video Recognition with Dynamic RNNs

Video recognition is usually carried out through a circular process , Among them, the video frame is first composed of 2D CNN code , Then the obtained frame features are fed to RNN To update its hidden status . be based on RNN Adaptive video recognition is usually achieved by ：1） Process unimportant frames with relatively cheap calculations （“glimpse”）[177],[178]; 2） Exit ahead of time [61],[62]; 3） Perform dynamic jumps to determine “where to see”[61]、[179]、[180]、[181].

Dynamic update of hidden status . In order to reduce the redundant calculation of each time step ,LiteEval [177] In two with different calculation costs LSTM Make a choice between . ActionSpotter [178] Decide whether to update the hidden state according to each input frame . AdaFuse [182] Selectively reuse some of the feature channels in the previous step , To effectively use historical information . Recent work has also proposed to adaptively determine the numerical accuracy when processing sequential input frames [183] Or mode [184]、[185].
Temporarily quit in advance . Human beings can easily understand the content before watching the whole video . This early stop is also implemented in Dynamic Networks , Prediction is based on only a part of the video frame [61]、[62]、[186]. Together with the time dimension ,[62] The model in further realizes the early exit from the network depth .
Skip video . Considering the use of CNN Encoding those unimportant frames still requires a lot of computation , A more effective solution might be to dynamically skip certain frames without watching them . Existing technology [179]、[180]、[187] Usually, the learning prediction network should jump to the position in each time step . Besides , stay [61] Early stop and dynamic jump are allowed in , The jumping stride is limited to the discrete range . Adaptive frame (AdaFrame) [181] Generate [0, 1] The continuous scalar in the range is used as the relative location .

4.2.2 Dynamic Key Frame Sampling

First, perform the adaptive pre sampling process , Then predict by processing the selected key frames or clip subsets .

Temporal attention is a common technique for the network to focus on significant frames . For face recognition , Neural aggregate network [22] Use soft attention to adapt to aggregated frame features . In order to improve reasoning efficiency , Hard attention is achieved to use RL Iteratively delete unimportant frames , For effective video face verification [188].
The sampling module also dynamically selects the key frames in the video / Popular options for editing . for example , First, in the [189]、[190] Uniform sampling of frames in , Then discrete decisions are made for each selected frame to step forward or backward . As for clip level sampling ,SCSample [191] It is designed based on trained classifiers , To find the clip with the largest amount of information for prediction . Besides , Dynamic sampling network (DSN) [192] Divide each video into multiple parts , A clip is sampled from each part by using a sampling module that shares weights across parts .

Temporal Deformable Convolutional Encoder

paper subject ：Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning

paper It was published by Sun Yat sen University on AAAI 2019 The job of

paper Address ： link

The encoder is the source sequence （ That is, the frame of the video / Clip sequence ） A module that acts as input and generates intermediate states to encode semantic content . ad locum , We are TDConvED A time deformable convolution block is designed in the encoder of , It applies time deformable convolution to the input sequence , To capture frames sampled in free-form time deformation / The context of the clip , Pictured 3 (a) Shown . This time deformable convolution is designed through actions in video / Capture time dynamics on the natural basis of the scene to improve the traditional time convolution . meanwhile , The feedforward convolution structure in the encoder can realize the parallelization in the input sequence , And allow fast calculations . Besides , In order to take advantage of the long-term dependency between input sequences in the coding phase , Stack multiple time deformable convolutional blocks in the encoder , To integrate contextual information from a large number of time samples in the input sequence .

chart 3

chart 3：(a) Time deformable convolution block in encoder 、(b) Shift convolution block sum in decoder Diagram of time deformable convolution .

In the formula , Consider the... In the encoder $l$ A time deformable convolution block , The corresponding output sequence is expressed as $\mathbf{p}^{l}=\left(p_{1}^{l}, p_{2}^{l}, \ldots, p_{N_{v}}^{l}\right)$ , among $p_{i}^{l} \in \mathbb{R}^{D_{r}}$ In order to $i$ frame / Clip centered time deformable convolution output . Given the $(l - 1)$ Output sequence of blocks $\mathbf{p}^{l-1}=\left(p_{1}^{l-1}, p_{2}^{l-1}, \ldots, p_{N_{v}}^{l-1}\right)$ , Intermediate state of each output $p_{i}^{l}$ It's by putting $\mathbf{p}^{l-1}$ The subsequence of is sent into spatiotemporal deformation convolution （ Nuclear size ： $k$ ） Add a nonlinear element to realize . Please note that , Time warping convolution operates in two stages , That is, first measure the sampled frame by one-dimensional convolution / The time offset of the clip , Then summarize the sampling frames / Characteristics of fragments , Pictured 3 Shown . More specifically , Make $X=\left(p_{i+r_{1}}^{l-1}, p_{i+r_{2}}^{l-1}, \ldots, p_{i+r_{k}}^{l-1}\right)$ Express $\mathbf{p}^{l-1}$ The subsequence , among $r_{n}$ yes $R$ Of $n$ Elements , $R=\{-k / 2, \ldots, 0, \ldots, k / 2\}$ . The first $l$ One dimensional convolution in a time deformable convolution block can be parameterized into a transformation matrix $W_{f}^{l} \in \mathbb{R}^{k \times k D_{r}}$ And offset $b_{f}^{l} \in \mathbb{R}^{k}$ , It uses $X$ in $k$ Concatenation of elements as input , Generate a set of offsets $\Delta r^{i}=\left\{\Delta r_{n}^{i}\right\}_{n=1}^{k} \in \mathbb{R}^{k}$ ：
$\Delta r^{i}=W_{f}^{l}\left[p_{i+r_{1}}^{l-1}, p_{i+r_{2}}^{l-1}, \ldots, p_{i+r_{k}}^{l-1}\right]+b_{f}^{l},$
among $\Delta r^{i}$ No $n$ Elements $\Delta r_{n}^{i}$ Represents a subsequence $X$ pass the civil examinations $n$ Measurement time offset of samples . Next , We use another one-dimensional convolution to enhance the samples with time offset , Realize the output of time deformable convolution .
$o_{i}^{l}=W_{d}^{l}\left[p_{i+r_{1}+\Delta r_{1}^{i}}^{l-1}, p_{i+r_{2}+\Delta r_{2}^{i}}^{l-1}, \ldots, p_{i+r_{k}+\Delta r_{k}^{i}}^{l-1}\right]+b_{d}^{l},\quad(4)$
among $W_{d}^{l} \in \mathbb{R}^{2 D_{r} \times k D_{r}}$ Is the transformation matrix in one-dimensional convolution , $b_{d}^{l} \in \mathbb{R}^{2 D_{r}}$ It's a deviation . Due to time offset $\Delta r_{n}^{i}$ It's usually a decimal , The formula （4） Medium $p_{i+r_{n}+\Delta r_{n}^{i}}^{l-1}$ It can be calculated by time linear interpolation .
$p_{i+r_{n}+\Delta r_{n}^{i}}^{l-1}=\sum_{s} B\left(s, i+r_{n}+\Delta r_{n}^{i}\right) p_{s}^{l-1},$
among $i+r_{n}+\Delta r_{n}^{i}$ Represents an arbitrary position , $s$ Lists the input sequence $\mathbf{p}^{l-1}$ All integral positions in , $b)=\max (0,1-|a-b|)$ .

Besides , We use gated linear units （GLU） As a nonlinear element to mitigate gradient propagation . therefore , Deformable convolution in a given time $o_{i}^{l} \in \mathbb{R}^{2 D_{r}}$ Output , Its dimension is twice that of the input element , Through a simple gating mechanism in $o_{i}^{l}=[A, B]$ On the application GLU.
$g\left(o_{i}^{l}\right)=A \otimes \sigma(B),$
among , $\in \mathbb{R}^{D_{r}}$ , $\otimes$ It's point by point multiplication . $\sigma(B)$ Represents a door unit , control $A$ Which elements of are more relevant to the current context . Besides , A residual connection from the input of a time deformable convolution block to the output of the block is added , To make the network deeper . therefore , The first $l$ The final output of a time deformable convolution block is measured as
$p_{i}^{l}=g\left(o_{i}^{l}\right)+p_{i}^{l-1} .$
In order to ensure that the output sequence length of the time deformable convolution block matches the input length , Use both left and right sides of the input $k /2$ Zero vector filling . By entering the frame / Several time deformable convolution blocks are stacked on the clip sequence , We get the context vector $\mathbf{z}=\left(z_{1}, z_{2}, \ldots, z_{N_{v}}\right)$ The final sequence of , among $z_{i} \in \mathbb{R}^{D_{r}}$ It means the first one $i$ frame / clip .