当前位置:网站首页>Video Object Detection
Video Object Detection
2022-08-04 19:03:00 【InfoQ】
The < div> & amp;nbsp; Video target detection is a natural extension in the field of image target detection in video, because the essence of video or continuous images, video target detection and image target detection is the basic principle of the same.Video data is composed of a large number of consecutive images, number, between adjacent image pixels change is small, there are a large number of redundant information.If the video data frame by frame decomposition, direct input image target detection model of training, large amount of calculation will seriously affect the detection speed and lead to test result is the value of practical application, and can't solve motion blur video data, video from coke, unusual gesture or object shelter and other issues.< / div>< div> & amp;nbsp; When people are not sure the identity of an object, they can from other relevant information for an object with the current with high semantic similarity of different objects, and the supply together.Video contains more abundant information about the same object instance, for example, it is the appearance of the different positions and different perspective, so video target detector should be more powerful than static image detector, the key challenge is to design a model, make full use of the current image frame time context information to improve the detection accuracy and speed.< / div>< div> & amp;nbsp; 2016 & have spentYears, Microsoft Asia research institute of the generation of JiFeng team depth characteristics of flow algorithm & have spentDFF (Deep Feature Flow), the core idea is to use simple fixed interval algorithm to select a series of key frames, sparse characteristics is extracted by the depth of the time-consuming to neural network;To account for more than non-critical frame, using optical flow network (FlowNet) on a key frame and the current non-critical frame between the optical flow field (flow Field), and by the depth of the optical flow field will be key frames convolution characteristic figure spread to the current frame, finally will all figure into the characteristics of a task network (Task Network) to obtain test results.Using optical flow network than the common characteristics of extraction of convolution neural network the advantages of less amount of calculation and DFF Algorithm is introduced into & have spentFlowNet To calculate the characteristics of the non-critical frame figure, in order to reduce the amount of calculation of the model.When key frame interval frames for & have spent10 & have spentWhen DFF Algorithm of precision & have spentMAP Down & benchmark methods have spent0.8, but speed up & have spent about5 & have spentTimes to & have spent20.25 FPS.< / div>< div> & amp;nbsp; Because the object detection in video will be influenced by various environmental factors, difficult to success, single frame image target detection generation JiFeng team in & have spentDFF The new algorithm is proposed on the basis of & have spentFGFA [26] (Flow - Guided Feature Aggregation) and in & have spent2017 & have spentYears & have spentILSVRC The game won the championship, first will get the corresponding image feature extraction into network characteristic figure, the adjacent K  before and after each frame on the motion path.Frame the characteristics of the figure to the current frame, and introduces an adaptive weighting network will spread over the characteristics of the aggregated into the characteristics of the current frame, in order to enhance the characteristics of the current frame, said improve the quality of the characteristics of the current frame.FGFA is a typical detection precision of the algorithm in return for a speed when key frame interval frames for & have spent10 & have spentWhen mAP  detection index;Up to & have spent76.3%, relative to the baseline algorithm & have spentMAP Value added & have spentFrom 2.9%, but the average single frame detection time & have spent288 ms & have spentRose & have spent733 ms.< / div>< div> & amp;nbsp; High performance of target detection relies on the convolution of expensive network to calculate the characteristic, often this will give those who need to detect the target in real time from video streaming application brings big challenges, the key to solve the problem is how to keep competition performance and reduce the computational cost.2018 & have spentIn the Chinese university of Hong Kong, Thomson technology joint laboratory announced & have spentST - Lattice grid (scale - time) algorithm, the first to get test results on sparse key frames, and then use spread and refining unit (Propagation And Refinement Units) of the testing results are spread across time, across the scale, until you reach the output node, auxiliary non-critical frame detection, finally use & have spentTube - level classifier to correct the position, the results.In & have spentImageNet VID Data sets, ST - Lattice A compromise as precision/speed, can achieve & have spent79.6 the mAP (20 FPS) and & have spent The performance of the 79.0 mAP (62 FPS).
上图呈现了尺度-时间网格算法的详细过程,途中的每个节点表示一定尺度和时间点的检测结果,每条边则表示从一个节点到另一个节点的一次性操作。水平方向的操作 T(蓝色)表示时间传播,采用 MHI 的方式有效的计算和保留足够的运动信息,用以处理帧间较大的运动位移,粗略的定位到中间时间处的目标,此操作只关注物体的运动,不考虑预测结果和 Ground True 之间的偏移量。垂直方向(绿色)代表从低分辨率到高分辨率的空间细化,用于弥补操作 T 的影响,通过从粗到细的方法回归边界盒的偏移量,以此实现更精准的定位。对于一个视频,只在稀疏的关键帧上进行卷积操作提取特征并给出最后的检测结果,将结果沿着预定义的路径传播到最底层,底部的最终结果覆盖了所有时间点。
2018 年,Fanyi Xiao 和 Yong Jae Lee 在文献中时空记忆网络(spatial-temporal memory network, STMN),提出一种新的 RNN 结构,为视频对象检测建模对象随时间变化的外观和运动信息。跨帧的记忆网络传递信息会带来定位误差,可利用 Match Trans 机制建模目标运动,使用匹配变换去对齐帧到帧的特征,使精度达到了当时的领先水准。
Gedas Bertasius 等人提出的时空采样网络(Spatiotemporal Sampling Networks, STSN),采用 Deformable Conv 结构提取相邻帧空间特征来执行视频目标检测。2020年,出现了可学习的时空采样模块[30](Learnable Spatio-Temporal Sampling, LSTS),利用 ResNet-101 作为 backbone 对关键帧进行处理,而非关键帧采用轻量级的网络,可实现精度和速度的极致,实现在帧间传播高级特征的目的。此文有稀疏递归特征更新(SRFU)和密集特征聚合(DFA)两个网络模块,SRFU 用于维持记忆特征来捕获时间关系,记忆特征会在关键帧处进行迭代更新,DFA 传播关键帧的记忆特征以此增强和丰富非关键帧的低级特征。LSTS 模块嵌入到 SRFU 和 DFA 结构中,以便在帧间准确的传播和对齐特征。LSTS 模块思想:在特征Ft 上进行随机采样,利用采样位置计算嵌入的特征 f(Ft) 和 g(Ft+k)相对应的相似度权重,计算出权重,并对𝐹进行特征融合获得传播特征F'(t+k),在训练过程中根据最终的检测损失迭代更新采样的位置。
上述算法均采用深层网络 ResNet-101 作为特征提取的网络,然后使用 Faster R-CNN 或 R-FCN 作为检测网络,这种两阶段的检测方案可以有效的提高视频目标检测的精确度,但对于速度方面有欠缺。本文所提出的方案利用一阶段的检测网络对稀疏的关键帧进行检测,会牺牲些许的精度以寻求速度的提升,实验结果表明所提算法在确保一定精确的情况下可以达到实时的效果。
边栏推荐
- 作业8.3 线程同步互斥机制条件变量
- [Sql刷题篇] 查询信息数据--Day1
- MMDetection 使用示例:从入门到出门
- How does the intelligent video surveillance platform EasyCVR use the interface to export iframe addresses in batches?
- c语言进阶篇:自定义类型--结构体
- gbase8s创建RANGE分片表
- Those things about the curl command
- [Distributed Advanced] Let's fill in those pits in Redis distributed locks.
- 在表格数据集上训练变分自编码器 (VAE)示例
- win10 uwp DataContext
猜你喜欢
随机推荐
在线生成接口文档
小波提取特征的VQ实现
目标检测的发展与现状
动手学深度学习_VggNet
如何封装 svg
火灾报警联网FC18中CAN光端机常见问题解答和使用指导
基于YOLOV5行人跌倒检测实验
Industrial CCD and CMOS camera
win10 uwp json
【填空题】130道面试填空题
Yuanguo chain game system development
LVS负载均衡群集之原理叙述
当前最快的实例分割模型:YOLACT 和 YOLACT++
The upgrade of capacity helps the flow of computing power, the acceleration moment of China's digital economy
老电脑怎么重装系统win10
自己经常使用的三种调试:Pycharm、Vscode、pdb调试
【AI+医疗】斯坦福大学最新博士论文《深度学习在医学影像理解中的应用》,205页pdf
MMDetection 使用示例:从入门到出门
指静脉识别-matlab
部署LVS-DR群集