当前位置:网站首页>Some time series modeling strategies (I)
Some time series modeling strategies (I)
2022-07-26 19:11:00 【Gu daochangsheng '】
Temporal Kernel Selection Block
paper subject :BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification
paper Published by the Chinese Academy of Sciences on CVPR 2022 The job of
paper Address : link
Code: link
stay [30, 42] after , We decompose the video network into spatial clues and temporal relationships . Use efficient BiCnet Fully explore spatial clues , We built a Temporal Kernel Selection Blocks to jointly model short-term and long-term time relationships . Because the time relationship of different scales has different importance for different sequences ( Pictured 2 Shown ),TKS Combine multi-scale time relationships in a dynamic way , That is to assign different weights to different time scales according to the input sequence .

chart 2: The short-term and long-term temporal relationships have different importance for different sequences . (a) Partially occluded sequence . Long term time clues are needed to reduce occlusion . (b) Fast moving pedestrian sequence . Short term time cues are needed to simulate detailed movement patterns .
Special ,TKS With a series of continuous frame characteristics F = { F t } t = 1 T F=\left\{F_{t}\right\}_{t=1}^{T} F={ Ft}t=1T As input , among F t F_{t} Ft It's No t t t Characteristic diagram of frame , And in F F F Perform triple operations on , namely Partition、Select and Excite.
Partition operation . Due to the imperfect character detection algorithm , The adjacent frames of the video are not well aligned , This may cause time convolution in the video reID [9] The invalid . stay [34] after , We use partition strategy to alleviate the problem of spatial dislocation . say concretely , Given video feature map { F t } t = 1 T \left\{F_{t}\right\}_{t=1}^{T} { Ft}t=1T, We divide each frame into h × w h \times w h×w A spatial area , And average pool each divided area , Build regional video feature map X ∈ R T × C × h × w X \in \mathbb{R}^{T \times C \times h \times w} X∈RT×C×h×w.
Select operation . Pictured 4 Shown , Given X X X, We carry out K K K Parallel paths { F ( i ) : X → Y ( i ) ∈ \left\{\mathcal{F}^{(i)}: X \rightarrow Y^{(i)} \in\right. { F(i):X→Y(i)∈ R T × C × h × w } i = 1 K \left.\mathbb{R}^{T \times C \times h \times w}\right\}_{i=1}^{K} RT×C×h×w}i=1K, among F (i) Yes. 2 i + 1 2 i+1 2i+1 Kernel size 1D Time convolution [30]. In order to further improve efficiency , have ( 2 i + 1 ) × 1 × 1 (2i+1) \times 1 \times 1 (2i+1)×1×1 The time convolution of the kernel is replaced with 3 × 1 × 1 3 \times 1 \times 1 3×1×1 Kernel and expansion size i i i Extended convolution of . The basic idea of the selection operation is to use the global information from all time paths to determine the weight assigned to each path . say concretely , We first fuse the outputs of all paths by summing the elements , Then perform global average pooling to obtain global characteristics u ∈ R C × 1 u \in \mathbb{R}^{C \times 1} u∈RC×1:
u = G A P T , h , w ( ∑ i = 1 K Y ( i ) ) , u=G A P_{T, h, w}\left(\sum_{i=1}^{K} Y^{(i)}\right), u=GAPT,h,w(i=1∑KY(i)),
among G A P T , h , w G A P_{T, h, w} GAPT,h,w Represents global average pooling along time and space dimensions . Then embed according to the global u u u Get the channel selection weight { g i ∈ R C × 1 } i = 1 K \left\{g_{i} \in \mathbb{R}^{C \times 1}\right\}_{i=1}^{K} { gi∈RC×1}i=1K,
g i = exp ( W i u ) ∑ j = 1 K exp ( W j u ) i ∈ { 1 , … , K } , g_{i}=\frac{\exp \left(W_{i} u\right)}{\sum_{j=1}^{K} \exp \left(W_{j} u\right)} \quad i \in\{1, \ldots, K\}, gi=∑j=1Kexp(Wju)exp(Wiu)i∈{ 1,…,K},
among W i ∈ R C × C W_{i} \in \mathbb{R}^{C \times C} Wi∈RC×C Is for Y ( i ) Y^{(i)} Y(i) Generate g i g_{i} gi Transformation parameters of . Then the aggregation characteristic graph is obtained through the selection weights on various time cores Z ∈ Z \in Z∈ R T × C × h × w \mathbb{R}^{T \times C \times h \times w} RT×C×h×w,
Z = ∑ i = 1 K R ( g i ) ⊙ Y ( i ) , Z=\sum_{i=1}^{K} \mathcal{R}\left(g_{i}\right) \odot Y^{(i)}, Z=i=1∑KR(gi)⊙Y(i),
among R \mathcal{R} R Yes, it will g i ∈ R C × 1 g_{i} \in \mathbb{R}^{C \times 1} gi∈RC×1 Remodel as R 1 × C × 1 × 1 \mathbb{R}^{1 \times C \times 1 \times 1} R1×C×1×1 In order to Y ( i ) Y^{(i)} Y(i) Size compatible reshaping operation .

It's worth pointing out , Compared with using scale weights to provide rough fusion , We choose to use channel weights ( equation 7) To merge . This design results in finer grained fusion , Each characteristic channel can be adjusted . Besides , The weight is dynamically calculated according to the input video . This may have different dominant time scales for different sequences reID crucial .
Trigger operation . The excitation operation pairs Z Z Z Adjust to modulate the input characteristic diagram . The final feature map map E = { E t } t = 1 T \operatorname{map} E=\left\{E_{t}\right\}_{t=1}^{T} mapE={ Et}t=1T by : E t = U ( Z t ) + F t E_{t}=\mathcal{U}\left(Z_{t}\right)+F_{t} Et=U(Zt)+Ft. here U \mathcal{U} U It is the nearest neighbor sampler , It's right Z t Z_{t} Zt Perform upsampling to match F t F_{t} Ft The spatial resolution of . TKS The block maintains the input size , Therefore, it can be inserted into BiCnet To extract effective spatio-temporal features .
TEMPORAL-WISE DYNAMIC NETWORKS
paper subject :Dynamic Neural Networks: A Survey
paper Tsinghua published on TPAMI 2021 The job of
paper link : Address
Usually , Less computation can be dynamically allocated to inputs at unimportant time positions / Do not calculate to improve network efficiency .
Temporal-wise Dynamic Video Recognition
For video recognition , Video can be regarded as the sequential input of frames , Time dynamic network aims to allocate adaptive computing resources for different frames . This can usually be achieved in two ways :1) Dynamically update the hidden state in each time step of the cycle model , as well as 2) Perform adaptive pre sampling on key frames ( The first 4.2.2 section ).
4.2.1 Video Recognition with Dynamic RNNs
Video recognition is usually carried out through a circular process , Among them, the video frame is first composed of 2D CNN code , Then the obtained frame features are fed to RNN To update its hidden status . be based on RNN Adaptive video recognition is usually achieved by :1) Process unimportant frames with relatively cheap calculations (“glimpse”)[177],[178]; 2) Exit ahead of time [61],[62]; 3) Perform dynamic jumps to determine “where to see”[61]、[179]、[180]、[181].
Dynamic update of hidden status . In order to reduce the redundant calculation of each time step ,LiteEval [177] In two with different calculation costs LSTM Make a choice between . ActionSpotter [178] Decide whether to update the hidden state according to each input frame . AdaFuse [182] Selectively reuse some of the feature channels in the previous step , To effectively use historical information . Recent work has also proposed to adaptively determine the numerical accuracy when processing sequential input frames [183] Or mode [184]、[185].
Temporarily quit in advance . Human beings can easily understand the content before watching the whole video . This early stop is also implemented in Dynamic Networks , Prediction is based on only a part of the video frame [61]、[62]、[186]. Together with the time dimension ,[62] The model in further realizes the early exit from the network depth .
Skip video . Considering the use of CNN Encoding those unimportant frames still requires a lot of computation , A more effective solution might be to dynamically skip certain frames without watching them . Existing technology [179]、[180]、[187] Usually, the learning prediction network should jump to the position in each time step . Besides , stay [61] Early stop and dynamic jump are allowed in , The jumping stride is limited to the discrete range . Adaptive frame (AdaFrame) [181] Generate [0, 1] The continuous scalar in the range is used as the relative location .
4.2.2 Dynamic Key Frame Sampling
First, perform the adaptive pre sampling process , Then predict by processing the selected key frames or clip subsets .
Temporal attention is a common technique for the network to focus on significant frames . For face recognition , Neural aggregate network [22] Use soft attention to adapt to aggregated frame features . In order to improve reasoning efficiency , Hard attention is achieved to use RL Iteratively delete unimportant frames , For effective video face verification [188].
The sampling module also dynamically selects the key frames in the video / Popular options for editing . for example , First, in the [189]、[190] Uniform sampling of frames in , Then discrete decisions are made for each selected frame to step forward or backward . As for clip level sampling ,SCSample [191] It is designed based on trained classifiers , To find the clip with the largest amount of information for prediction . Besides , Dynamic sampling network (DSN) [192] Divide each video into multiple parts , A clip is sampled from each part by using a sampling module that shares weights across parts .
Temporal Deformable Convolutional Encoder
paper subject :Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
paper It was published by Sun Yat sen University on AAAI 2019 The job of
paper Address : link
The encoder is the source sequence ( That is, the frame of the video / Clip sequence ) A module that acts as input and generates intermediate states to encode semantic content . ad locum , We are TDConvED A time deformable convolution block is designed in the encoder of , It applies time deformable convolution to the input sequence , To capture frames sampled in free-form time deformation / The context of the clip , Pictured 3 (a) Shown . This time deformable convolution is designed through actions in video / Capture time dynamics on the natural basis of the scene to improve the traditional time convolution . meanwhile , The feedforward convolution structure in the encoder can realize the parallelization in the input sequence , And allow fast calculations . Besides , In order to take advantage of the long-term dependency between input sequences in the coding phase , Stack multiple time deformable convolutional blocks in the encoder , To integrate contextual information from a large number of time samples in the input sequence .

chart 3:(a) Time deformable convolution block in encoder 、(b) Shift convolution block sum in decoder Diagram of time deformable convolution .
In the formula , Consider the... In the encoder l l l A time deformable convolution block , The corresponding output sequence is expressed as p l = ( p 1 l , p 2 l , … , p N v l ) \mathbf{p}^{l}=\left(p_{1}^{l}, p_{2}^{l}, \ldots, p_{N_{v}}^{l}\right) pl=(p1l,p2l,…,pNvl), among p i l ∈ R D r p_{i}^{l} \in \mathbb{R}^{D_{r}} pil∈RDr In order to i i i frame / Clip centered time deformable convolution output . Given the ( l − 1 ) (l-1) (l−1) Output sequence of blocks p l − 1 = ( p 1 l − 1 , p 2 l − 1 , … , p N v l − 1 ) \mathbf{p}^{l-1}=\left(p_{1}^{l-1}, p_{2}^{l-1}, \ldots, p_{N_{v}}^{l-1}\right) pl−1=(p1l−1,p2l−1,…,pNvl−1), Intermediate state of each output p i l p_{i}^{l} pil It's by putting p l − 1 \mathbf{p}^{l-1} pl−1 The subsequence of is sent into spatiotemporal deformation convolution ( Nuclear size : k k k) Add a nonlinear element to realize . Please note that , Time warping convolution operates in two stages , That is, first measure the sampled frame by one-dimensional convolution / The time offset of the clip , Then summarize the sampling frames / Characteristics of fragments , Pictured 3 Shown . More specifically , Make X = ( p i + r 1 l − 1 , p i + r 2 l − 1 , … , p i + r k l − 1 ) X=\left(p_{i+r_{1}}^{l-1}, p_{i+r_{2}}^{l-1}, \ldots, p_{i+r_{k}}^{l-1}\right) X=(pi+r1l−1,pi+r2l−1,…,pi+rkl−1) Express p l − 1 \mathbf{p}^{l-1} pl−1 The subsequence , among r n r_{n} rn yes R R R Of n n n Elements , R = { − k / 2 , … , 0 , … , k / 2 } R=\{-k / 2, \ldots, 0, \ldots, k / 2\} R={ −k/2,…,0,…,k/2}. The first l l l One dimensional convolution in a time deformable convolution block can be parameterized into a transformation matrix W f l ∈ R k × k D r W_{f}^{l} \in \mathbb{R}^{k \times k D_{r}} Wfl∈Rk×kDr And offset b f l ∈ R k b_{f}^{l} \in \mathbb{R}^{k} bfl∈Rk, It uses X X X in k k k Concatenation of elements as input , Generate a set of offsets Δ r i = { Δ r n i } n = 1 k ∈ R k \Delta r^{i}=\left\{\Delta r_{n}^{i}\right\}_{n=1}^{k} \in \mathbb{R}^{k} Δri={ Δrni}n=1k∈Rk:
Δ r i = W f l [ p i + r 1 l − 1 , p i + r 2 l − 1 , … , p i + r k l − 1 ] + b f l , \Delta r^{i}=W_{f}^{l}\left[p_{i+r_{1}}^{l-1}, p_{i+r_{2}}^{l-1}, \ldots, p_{i+r_{k}}^{l-1}\right]+b_{f}^{l}, Δri=Wfl[pi+r1l−1,pi+r2l−1,…,pi+rkl−1]+bfl,
among Δ r i \Delta r^{i} Δri No n n n Elements Δ r n i \Delta r_{n}^{i} Δrni Represents a subsequence X X X pass the civil examinations n n n Measurement time offset of samples . Next , We use another one-dimensional convolution to enhance the samples with time offset , Realize the output of time deformable convolution .
o i l = W d l [ p i + r 1 + Δ r 1 i l − 1 , p i + r 2 + Δ r 2 i l − 1 , … , p i + r k + Δ r k i l − 1 ] + b d l , ( 4 ) o_{i}^{l}=W_{d}^{l}\left[p_{i+r_{1}+\Delta r_{1}^{i}}^{l-1}, p_{i+r_{2}+\Delta r_{2}^{i}}^{l-1}, \ldots, p_{i+r_{k}+\Delta r_{k}^{i}}^{l-1}\right]+b_{d}^{l},\quad(4) oil=Wdl[pi+r1+Δr1il−1,pi+r2+Δr2il−1,…,pi+rk+Δrkil−1]+bdl,(4)
among W d l ∈ R 2 D r × k D r W_{d}^{l} \in \mathbb{R}^{2 D_{r} \times k D_{r}} Wdl∈R2Dr×kDr Is the transformation matrix in one-dimensional convolution , b d l ∈ R 2 D r b_{d}^{l} \in \mathbb{R}^{2 D_{r}} bdl∈R2Dr It's a deviation . Due to time offset Δ r n i \Delta r_{n}^{i} Δrni It's usually a decimal , The formula (4) Medium p i + r n + Δ r n i l − 1 p_{i+r_{n}+\Delta r_{n}^{i}}^{l-1} pi+rn+Δrnil−1 It can be calculated by time linear interpolation .
p i + r n + Δ r n i l − 1 = ∑ s B ( s , i + r n + Δ r n i ) p s l − 1 , p_{i+r_{n}+\Delta r_{n}^{i}}^{l-1}=\sum_{s} B\left(s, i+r_{n}+\Delta r_{n}^{i}\right) p_{s}^{l-1}, pi+rn+Δrnil−1=s∑B(s,i+rn+Δrni)psl−1,
among i + r n + Δ r n i i+r_{n}+\Delta r_{n}^{i} i+rn+Δrni Represents an arbitrary position , s s s Lists the input sequence p l − 1 \mathbf{p}^{l-1} pl−1 All integral positions in , B ( a , b ) = max ( 0 , 1 − ∣ a − b ∣ ) B(a, b)=\max (0,1-|a-b|) B(a,b)=max(0,1−∣a−b∣).
Besides , We use gated linear units (GLU) As a nonlinear element to mitigate gradient propagation . therefore , Deformable convolution in a given time o i l ∈ R 2 D r o_{i}^{l} \in \mathbb{R}^{2 D_{r}} oil∈R2Dr Output , Its dimension is twice that of the input element , Through a simple gating mechanism in o i l = [ A , B ] o_{i}^{l}=[A, B] oil=[A,B] On the application GLU.
g ( o i l ) = A ⊗ σ ( B ) , g\left(o_{i}^{l}\right)=A \otimes \sigma(B), g(oil)=A⊗σ(B),
among , A , B ∈ R D r A, B \in \mathbb{R}^{D_{r}} A,B∈RDr, ⊗ \otimes ⊗ It's point by point multiplication . σ ( B ) \sigma(B) σ(B) Represents a door unit , control A A A Which elements of are more relevant to the current context . Besides , A residual connection from the input of a time deformable convolution block to the output of the block is added , To make the network deeper . therefore , The first l l l The final output of a time deformable convolution block is measured as
p i l = g ( o i l ) + p i l − 1 . p_{i}^{l}=g\left(o_{i}^{l}\right)+p_{i}^{l-1} . pil=g(oil)+pil−1.
In order to ensure that the output sequence length of the time deformable convolution block matches the input length , Use both left and right sides of the input k / 2 k / 2 k/2 Zero vector filling . By entering the frame / Several time deformable convolution blocks are stacked on the clip sequence , We get the context vector z = ( z 1 , z 2 , … , z N v ) \mathbf{z}=\left(z_{1}, z_{2}, \ldots, z_{N_{v}}\right) z=(z1,z2,…,zNv) The final sequence of , among z i ∈ R D r z_{i} \in \mathbb{R}^{D_{r}} zi∈RDr It means the first one i i i frame / clip .
边栏推荐
猜你喜欢

Multi thread learning notes -1.cas

【考研词汇训练营】Day 14 —— panini,predict,access,apologize,sense,transport,aggregation

MySQL - 多表查询与案例详解

2022上海市安全员C证操作证考试题库模拟考试平台操作

Brand new! Uncover the promotion route of Ali P5 Engineer ~p8 architect

从6月25日考试之后,看新考纲如何复习PMP

CoVOS:无需解码!利用压缩视频比特流的运动矢量和残差进行半监督的VOS加速(CVPR 2022)...

likeshop外卖点餐系统开源啦100%开源无加密
![[postgraduate entrance examination vocabulary training camp] day 13 - reliance, expert, subject, unconscious, photograph, exaggeration, counter act](/img/9c/0e6e8abebfd3afdeef2913281a6ada.png)
[postgraduate entrance examination vocabulary training camp] day 13 - reliance, expert, subject, unconscious, photograph, exaggeration, counter act

Distributed transaction Seata
随机推荐
2022年化工自动化控制仪表考题模拟考试平台操作
2022G1工业锅炉司炉上岗证题库及模拟考试
Brian behrendorf, general manager of openssf Foundation: it is estimated that there will be 420million open sources in 2026
SSM integration - functional module and interface testing
CTO will teach you: how to take over his project when a technician suddenly leaves
JS uses readLine to realize terminal input data
MongoDB stats统计集合占用空间大小
The United States, Japan and South Korea jointly developed 6G with the intention of anti surpassing, but China has long been prepared
2022 Shanghai safety officer C certificate operation certificate examination question bank simulated examination platform operation
Huawei cloud · cloud sharing experts~
【MySQL从入门到精通】【高级篇】(八)聚簇索引&非聚簇索引&联合索引
支持代理直连Oracle数据库,JumpServer堡垒机v2.24.0发布
模板进阶(跑路人笔记)
Tensor Rt的int8量化原理
MySQL数据库命令大全
MySQL日志介绍
CoVOS:无需解码!利用压缩视频比特流的运动矢量和残差进行半监督的VOS加速(CVPR 2022)...
What aspects should be considered in the selection of MES system?
Lombok common notes
网络协议:TCP/IP协议