当前位置:网站首页>ECCV 2022 Oral Video Instance Segmentation New SOTA: SeqFormer & IDOL and CVPR 2022 Video Instance Segmentation Competition Champion Scheme...
ECCV 2022 Oral Video Instance Segmentation New SOTA: SeqFormer & IDOL and CVPR 2022 Video Instance Segmentation Competition Champion Scheme...
2022-08-05 09:04:00 【I love computer vision】
关注公众号,发现CV技术之美
This paper mainly introduces the recent two ECCV 2022 Oral 的工作,分别在 offline 和 online Under the paradigm of examples of video segmentation(Video Instance Segmentation, VIS)The performance of the task has achieved the highest,并在CVPR 2022 The fourth large-scale video object segmentation challenge( 4th Large-scale Video Object Segmentation Challenge)Instances of video segmentation on the track in the first place,Models and code have been open source!
SeqFormer:https://arxiv.org/abs/2112.08275
IDOL: https://arxiv.org/abs/2207.10661
官方代码地址: https://github.com/wjf5203/VNext
Ⅰ.SeqFormer: Sequential Transformer for Video Instance Segmentation, ECCV, 2022 (Oral).
SeqFormer:For example video segmentation sequenceTransformer
基于 vision transformer, The article puts forward a offline 的 VIS 算法:SeqFormer.SeqFormerFor every object in the video to build the corresponding characteristic,And has given the characteristics of global information extraction ability.Unlike the existing algorithm,SeqFormer 提出了一个 Query 分离的机制,将 Instance Query 分离成 Box Query,In each frame respectively to extract the object corresponding to the location of the information,And then aggregated to in video-level More effectively represent each instance.在不使用任何tracking branchesAnd post-processing case,SeqFormer 在 YouTube-VIS 达到了 47.4 AP (ResNet-50)和 49.0 AP (ResNet-101) 的精度,, respectively, than the current optimal algorithm 4.6 和 4.4 AP.
Ⅱ.In Defense of Online Models for Video Instance Segmentation, ECCV, 2022 (Oral).
IDOL : Online video instance segmentation new paradigm
该文章是ECCV2022Out of the article.This paper analyzes the inVIS任务中,offlineAlgorithm often lead at the same timeonline算法达到 10AP About the phenomenon of,And analyzes the cause online 模型和 offline The reasons of the gap of huge model,提出了一个基于contrastive learning的 online 算法:IDOL.The algorithm can learn more with degree of differentiation ofinstance embedding,And make full use of the history of the video information to ensure the stability of the algorithm,将onlinePerformance to a model withofflineModel is even higher level.IDOL 在 YouTube-VIS 2019 上达到了 49.5 AP,Beyond the optimal before respectively online / offline 算法 13.2 / 2.1 AP.In a more challengingOVIS数据集上,IDOL 更是达到了30.2 AP,Beyond the previous optimal algorithm twice.In the recently held CVPR 2022 Large-Scale Video Object Segmentation Challenge, Video Instance Segmentation Track 上,IDOLAlso beyond the online/offline 模型,取得了第一名.
VNextThe author put forward based onDetectron2Instances of video identification framework,Now all of the code above two articles are integrated into theVNext中.VNextAims at providing a video instance to identify areas with a unified and efficient framework to promote the development of the field,欢迎大家在VNextVideo on related tasks of exploration and experiment:https://github.com/wjf5203/VNext .
Video Demo:
SeqFormer: Sequential Transformer for Video Instance Segmentation, ECCV, 2022 (Oral)
SeqFormer:For example video segmentation sequenceTransformer
01
Motivation
Examples of video segmentation is a rise in recent years, visual task,In the example image segmentation on the basis of introducing the temporal dimension,At the same time of each frame object segmentation request tracking the object in frame,So how to make good use of temporal features of video is the task of a big difficulty.
最近TransformerIn the field, the development of bring some new solution,But before based onTransformerThe full three-dimensional characteristics of the video will direct flattening directly intoTransformer Decoder中,Want to model at the same time to completeSegmentation和Tracking,Such a direct solution though effective,But is not in conformity with the intuition of video.文章认为,Two-dimensional spatial and temporal characteristics of video should be handled in a different way.
因此,SeqFormer提出了Decoder中的QueryThe separation mechanism,具体来说,SeqFormerThe Shared instanceQuerySeparation on each frame,Independent positioning objects in each frame and extract the corresponding feature,To ensure that model to extract information on each frame is accurate.最终,The information of each frame will be aggregated together to become a global object features said,This feature was used to predict object categories and generate dynamic convolution parameters used to segment the object on each frame.
Articles that an aggregation of the characteristics of global information can be more robust and efficient said in the video object,从而进一步提高Transformer在VIS上的表现.
02
SeqFormer
SeqFormerThe overall structure includes three parts:¹Backbone network andTransformer Encoder ²Query Decompose Decoder ³A variety of outputOutput Head.The backbone network andTransformer Encoder 都进行的是frame-level的特征提取.
2.1 Query Decompose Decoder
该部分是SeqFormer的核心结构.When given a video,物体的形状、Position changes even obscured,People usually can easily distinguish the object,Because people will make these objects of different frames to look the same,This is the key difference between the video and pictures.
因此,文章提出Instance Query和Box Query的概念:在Decoder的第一层,共享的Instance Query Will be separated to each frame,Independently on each frameattention;且Box Query会通过Box HeadTo predict object on each frame boxes,并且在Decoder的每个layerIterative optimization between.
Box Query 就像Instance QueryOn every frame of theAnchor,To locate and attention to the same object,And extracting the information aggregation again toInstance Query上.通过这样一个Query Decompose Decoder,SeqFormerCompleted on each frame look for objects and polymerization process of global features.
如图所示,Visualization in differentDecoder Layer之后,Decoder中的同一个Instance Query 对应的Box Query Focus on each frame area.(a)是第一层DecoderThe attention of the area,For each frameBox QueryHave the same initialization values,So they focus area is the same;(b)The attention is the second area,It can be seen as the model focuses on the regional distribution of the corresponding objects around;(c)是最后一层Decoder The attention of the area,Now focus on the region more accurate.整个Decoder 以这样一种coarse-to-fineThe way of positioning to each object,And aggregation of each objectvideo-level的特征表示.
2.2 Output Head
In each objectvideo-levelAfter the characteristics of the said,通过两个FFNGet the classification results of the object andMask Head 的权重参数.Mask HeadIs a three layer1x1卷积网络,在Encoder通过Mask BranchTo get high resolutionFeature Map上进行卷积,Thus dynamic on each frame using the sameMaskHead预测mask.Because objects in different frames on sharing the sameMask Head进行卷积,这使得SeqFormerVery efficient for segmenting the object,At the same time also can use in a small frame generated on theMask Head In all the frames on the convolution is achieved for the whole video segmentation,扩展了SeqFormer的应用方式.
03
Demo
以下Demo 展示了SeqFormer在YouTube-VIS 2019 Some of the visual effect on the video.
04
Performance
在Youtube-VIS 2019和 2021 上对SeqFormer进行了评测:
4.1 YouTube-VIS 2019
在YouTube-VIS 2019,SeqFormer在与各种backbone的组合下,均在mask APSurpassed all algorithms before a lot,在ResNet-50上mask AP能够达到47.4,通过与Swin-Transformer的组合,SeqFormer将这个benchmarkOn the performance of the in59.3的新高度.得益于Offline 模型能够以batchIn the form of multiple frames parallel processing,SeqFormer的FPS也达到72.3.
4.2 YouTube-VIS 2021
在YouTube-VIS 2021 上,SeqFormerAlso can achieve stablestate-of-the-art.
05
Conclusion
SeqFormerAlign the objects on the different frames in video information and natural solutions to split the examples of video segmentation and tracking problem,而不需要任何后处理,它将VISRaised to a new stage of model performance.The author hope to concise and efficientSeqFormer能够给VISAreas to bring some inspiration,A strong and become the future researchbaseline.
In Defense of Online Models for Video Instance Segmentation, ECCV, 2022 (Oral)
IDOL : Online video instance segmentation new paradigm
01
Motivation
在VIS任务中,以往offlineAlgorithm often lead at the same timeonline算法达到 10AP 左右,然而onlineAlgorithm in dealing with a long video and continuous video real task has its inherent advantages such as.为了理解VIS任务中 online 模型和 offline The reasons of the difference of model,作者设计了 frame 和 clip 两种 Oracle 实验,A detailed study of existing offline 模型 (IFC & SeqFormer):
对于 frame oracles,在每个 clip Within and adjacent clip 之间提供 groundtruth 的实例 ID,The performance of the algorithm only depends on the estimate of the segmentation mask 的质量.对于 clip oracles,Provide only adjacent clip 之间的 groundtruth 实例 ID,Need the method in clip 内进行关联.此时,frame oracles 与 clip oracles The performance gap is reflected the current offline The effect of the model in the black box of associated.
同样,The article also compares the current best online 算法 (CrossVIS):
通过以上的实验,可以得到如下结论:
From the point of view of instance integral,per-clip 分割在 mask Did no better than on the quality per-frame Split a lot better,而且 mask Quality is not online Methods the cause of the poor performance:CrossVIS Or even better than the same period of the work(即 IFC )
当前 SOTA offline 方法的 per-clip Segmentation is not always effective and robust:More frames do provide more information,But it is only applicable to certain situation:per-clip Segmentation segmentation is not improved obviously SeqFormer 的性能.此外,在 OVIS The more challenging test data sets such as,When the fragment size longer,Multiple frame segmentation may even make IFC 和 SeqFormer The performance on decrease respectively 1.8 和 2.2 AP.虽然从理论上讲,offline 算法的 per-clip Segmentation is using multiple frames of an inherent advantage,But it still needs to further explore,Especially in how to use the information in the frame, and how to deal with complex pattern、Shade and object deformation.
From the perspective of cross frame matching,offlineMethods a huge advantage is that they can use the black box network clip 内的匹配.This advantage in YouTube-VIS On the data set is very obvious.The authors prove that this is caused by the currentonline和offlineThe main reason for the performance gap between paradigm.然而,When video is becoming complicated,offline Algorithm of the black box of the associated process will also deteriorate rapidly(在 OVIS 上,IFC/SeqFormer The performance of the reduced respectively 12.3/20.9 AP).此外,In dealing with a longer video,offline Video segmentation methods need to be input into multipleclipIn order to avoid beyond calculation limits,clip与clip The match is still inevitable.因此,匹配/Correlation is aonline 与 offline The main reason for the model performance gap,同时对于 offline Model still is inevitable and very important.
In fully understanding the aim online 与 offline After the performance of the algorithm,The authors found that improveonline The core of the performance of the algorithm is to improve the performance of matching.
因此,文章提出了IDOL.The key idea is to embedding To ensure the same instance in the space between the frame of the similarities and differences in different instances in all frame,At the same time provide more understanding instance features,Has a better consistency of time,To ensure a more accurate interframe correlation results.
其次,Previous methods often by manual Settings to select the positive and negative samples,It was introduced to the shade and crowded scene false positives .为了解决这个问题,Article will sample selection problem formulated in order to optimize the optimal transmission problem in the theory,从而减少 false positives And to further improve the quality of the contrast study samples.
在推理过程中,By using a one-to-many time weighted softmax,Using the frame on the history of the information to identify caused by the absence of shade instance,And strengthen the associated consistency and integrity.
02
Details
In order to improve the performance of matching,The author proposes a framework to on the basis of the comparative study is more of discriminant features,整体网络结构如下图所示:
IDOLA single frame of every frame image instance integral,为了与SeqFormer公平对比,IDOL采用与SeqFormerThe same instance integralpipeline.IDOLIncluding two phase model training and reasoning:
训练阶段,如上图所示,Training at random to extract a frame from a training set as key frames,At the same time a reference of the same video of the adjacent frames in frame.The key frame and reference frame into a Shared weightbackbone和 Transformer 中进行处理.
Transformer The role of the series is to use fixed number of N The learning object finder in feature extraction on the drawing feature,最后输出的 N 个特征表示,Contains the characteristics of each object in the image.对于关键帧,These features indicate used into threeOutput HeadComplete example of a single frame segmentation,Comparison in order to provide more abundant learning samples here,原先SeqFormerThe predicted results andGTOne to one match between,Be altered by the optimal transmission completed more than a pair of matching,In order to increase the perGTThe characteristics of the corresponding number.
对于参考帧,Transformer 生成的 N Features contained in the said reference frame on each object information,For these characteristics that,Through the optimal transmission theory,According to the forecast detector boxes and classification score,For each object on the key frames select multiple positive and negative samples in the reference frame.图中v For the characteristics of each object on the key frames said,k+ 和 k- Respectively on the reference frame for its characteristics of positive samples and negative samples said,The positive and negative samples,By comparison with another learning characteristics of the generator,And by comparing the loss function to calculate loss value,Used to make the network learn more will be able to distinguish between different objects, the characteristics of the said.
推理阶段,IDOL To each frame of video, in turn, into the trained model,Model in the prediction of the segmentation results of each frame at the same time,Will give each segmentation result at the same time to create a contrast characteristics,This feature is used to link each frame segmentation results.
具体来说,Will first initialize a real-time update list of memory,In the first frame is detected objects are added into the list,Given the initialization id 序号,On every frame after,Comparison of the object to be detected features will be compared with the list of each object is a bidirectional one-to-many time weighted softmax分数,According to the score will be a new detect object in the list corresponds to the memory of,Memory in the list of contrast features at the same time for the next frame matching.
03
Demo
以下Demo 展示了IDOL 在OVIS 以及 YouTube-VIS 2019 Some of the visual effect on the video.
04
Performance
文章将IDOL与目前主流的online、offline 模型进行了对比,“V”表示仅使用 YouTube-VIS 训练集进行训练.“V+I”Said also has chosen COCO The overlapping categories synthetic Video is used in the joint training.表示将COCOImages from the random cutting of two timeskey-reference frame 对IDOL进行预训练.可以看到,IDOLMuch more than the otheronline算法,At the same time also more than the mainstreamoffline算法.
Table1: Comparison on YouTube-VIS 2019 val set
Table2: Comparison on YouTube-VIS 2021 val set
Table3: Comparison on OVIS 2021 val set
05
conclusion
Online VIS Methods in the treatment of long time/Continuous video has its inherent advantage,But they are in performance significantly lagged behind offline 模型,This work aims to make up for the performance gap between.The article first analyzes the current online 和 offline 模型,Found the gap mainly comes from matching between frames.
基于这一观察,文章提出了 IDOL,It can make the model for VIS More discrimination task learning characteristics and robustness of instance.It is significantly superior to all online 和 offline 方法,And in all three data set to obtain the latestSOTA结果.同时IDOL也在CVPR 2022 的 VIS workshop 中取得了第一名.Looking forward to the article on the current VIS Methods of analysis can be for the future online 和 offline Method of work to help.
06
1st Place Solution for YouTubeVOS Challenge 2022: Video Instance Segmentation
在CVPR2022 workshopAt the fourth session of large-scale video object segmentation challenge instances of video segmentation on the track,以IDOL作为baselineThe method has achieved the first prize of the game,And beyond the second4.9%,这也证明了IDOLThe superiority of under various scenarios.Specific entry scheme see report:https://youtube-vos.org/assets/challenge/2022/reports/VIS_1st.pdf.
END
加入「视频实例分割」交流群备注:VIS
边栏推荐
猜你喜欢
ECCV 2022 Oral 视频实例分割新SOTA:SeqFormer&IDOL及CVPR 2022 视频实例分割竞赛冠军方案...
Spark cluster deployment (third bullet)
pytorch余弦退火学习率CosineAnnealingLR的使用
Comprehensively explain what is the essential difference between GET and POST requests?Turns out I always misunderstood
XCODE12 在使用模拟器(SIMULATOR)时编译错误的解决方法
IT研发/开发流程规范效能的思考总结
Undefined symbols for architecture arm64解决方案
DNS 查询原理详解
苹果官网商店新上架Mophie系列Powerstation Pro、GaN充电头等产品
DataFrame insert row and column at specified position
随机推荐
JS语法使用
七夕看什么电影好?爬取电影评分并存入csv文件
CROS and JSONP configuration
生命的颜色占卜
Luogu P1908: 逆序对 [树状数组]
Xcode10的打包方式distribute app和启动项目报错以及Xcode 打包本地ipa包安装到手机上
openpyxl操作Excel文件
Xcode 12 ld: symbol(s) not found for architecture armv64
Undefined symbols for architecture arm64解决方案
微信小程序请求封装
DPU — 功能特性 — 安全系统的硬件卸载
Creo 9.0 基准特征:基准点
sphinx matches the specified field
动态库之间回调函数使用
Iptables implementation under the network limited (NTP) synchronization time custom port
工程制图试题
512-color chromatogram
画法几何及工程制图考试卷A卷
Controlling number and letter input in ASP
How to make pictures clear in ps, self-study ps software photoshop2022, simple and fast use ps to make photos clearer and more textured